-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10322: [C++][Dataset] Minimize Expression #8894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
adc5551 to
7fc23dd
Compare
|
R centos-7 (gcc 4.8) compilation fails: https://github.com/apache/arrow/pull/8894/checks?check_run_id=1572105784#step:9:513 |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I've reviewed the C++ side of this.
| }; | ||
| RETURN_NOT_OK(CanonicalizeAndFoldConstants()); | ||
|
|
||
| for (const auto& guarantee : conjunction_members) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of looping on conjunction members, have you tried to match all members at once in the post-visit callback in DirectComparisonSimplification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the advantage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visiting the tree only once may be cheaper, though that would have to be measured.
| RETURN_NOT_OK(ExtractKnownFieldValuesImpl(&conjunction_members, &known_values)); | ||
|
|
||
| ARROW_ASSIGN_OR_RAISE(expr, | ||
| ReplaceFieldsWithKnownValues(known_values, std::move(expr))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful, given that DirectComparisonSimplification should catch these cases as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should indeed catch those cases. My plan was to extract a function
Result<map<FieldRef, {
Datum min,
Datum max,
bool nullable,
bool include_min,
bool include_max
}>> ExtractKnownRanges(guarantee);
for independent testing (which wouldn't handle the equality case), but I haven't done so yet. In any case this shouldn't represent a perf penalty since the conjunction_members which correspond to those equality conditions are extracted before running DirectComparisonSimplification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason for the separation is DirectComparisonSimplification requires that the simplified expression is also a comparison. This will not catch cases such as is_in(a, [1,2,3]) where a == 4 so we wouldn't be able to skip that partition
9718255 to
9a00cef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't necessarily need to happen in this PR, but: I personally don't find this a very user-friendly error message with "FieldRef.Name" in it (I think the previous was more readable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can certainly change the error message, but I'd like to wait for a follow up
|
@github-actions crossbow submit test-conda-python-3.7-dask-latest test-conda-python-3.7-pandas-master test-conda-python-3.7-pandas-latest |
|
Ah, the bot doesn't work at the moment I suppose. I ran the dask/parquet tests locally, and they are passing. I also ran my tax-dataset dask notebook with some queries using this branch, but apparently there is a bug in the latest version of dask to read those (something I need to investigate and report/fix in dask). But that also happens with released pyarrow 2.0.0, so not caused by this PR. |
@kszucs fixed this today (it was blocked by INFRA) so it should work if you rebase. |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
The python failure looks legit: Seems that the |
|
Revision: 9a00cef5c82a032107d36b7fd47d1ec8a14703b0 Submitted crossbow builds: ursa-labs/crossbow @ actions-824
|
|
(the failing test-conda-python-3.7-pandas-latest is the same serialization test failure as mentioned above) |
|
(I reviewed the minimal python changes, which look good, and looked at part of the C++ dataset changes, but no need to wait on further review from my side) |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review everything again.
cpp/src/arrow/compute/cast.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this inside the anonymous namespace above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, nit but I don't understand why this is in cast.{h,cc}. I would expect to only find the cast functions here. A new scalar_nested.{h,cc} would seem a more logical place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm also curious why it's called "project". It sounds rather imprecise, though it may be the conventional term for this operation?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"project" is the conventional term. I'll move it to a separate header/source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment as for AddSimpleCast above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, AddSimpleCast is no longer referenced anywhere so I'll just rename this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, is there a fast path when conversion.second == 1? Otherwise, perhaps create a JIRA for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a fast path for this.
cpp/src/arrow/dataset/expression.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, perhaps tag these APIs experimental, so that we can change them without warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a comment at the top of the file insufficient?
| /// \param[in] by A StructArray whose columns will be used as grouping criteria. | ||
| /// \return A StructArray mapping unique rows (in field "values", represented as a | ||
| /// StructArray with the same fields as `by`) to lists of indices where | ||
| /// that row appears (in field "groupings"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps make this more explicit in the docstring then?
| encoded = column.mutable_array(); | ||
|
|
||
| auto indices = | ||
| std::make_shared<Int32Array>(encoded->length, std::move(encoded->buffers[1])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, std::move seems wrong if the column is already dictionary-encoded.
| *fused_indices = checked_pointer_cast<Int32Array>(new_fused_indices.make_array()); | ||
|
|
||
| // XXX should probably cap this at 2**15 or so | ||
| ARROW_CHECK(!internal::MultiplyWithOverflow(size_, dictionary_size, &size_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping.
1ca0dc6 to
b14a456
Compare
|
non Apache CI: https://github.com/bkietz/arrow/runs/1659121616 |
Although it's clearly related, I personally still find it a bit strange name for this specific (user exposed) function (but I am certainly not very familiar with the different contexts where "project" gets used, eg in Python/pandas this term is basically never used). |
|
The |
See also: #8894 The "project" compute function is not really intended for direct use; it's primarily a convenience for exposing expressions to projection: https://issues.apache.org/jira/browse/ARROW-11174 As such, maybe it should be hidden instead of exposed to python? Closes #9131 from bkietz/11166-Add-bindings-for-ProjectO Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
…ictionary columns Enables usage of dictionary columns as partition columns on write. Additionally resolves some partition-related follow ups from #8894 (@pitrou): - raise an error status [instead of aborting](#8894) for overflowing maximum group count - handle dictionary index types [other than int32](#8894) - don't build an unused null bitmap [in CountsToOffsets](#8894) - improve docstrings for [MakeGroupings, ApplyGroupings](#8894) At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly: ``` data/ col=a/ part-0.parquet # col is "a" throughout col=b/ part-1.parquet # col is "b" throughout part-2.parquet # col is null throughout ``` Closes #9130 from bkietz/10247-Cannot-write-dataset-with Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
…ictionary columns Enables usage of dictionary columns as partition columns on write. Additionally resolves some partition-related follow ups from #8894 (@pitrou): - raise an error status [instead of aborting](apache/arrow#8894) for overflowing maximum group count - handle dictionary index types [other than int32](apache/arrow#8894) - don't build an unused null bitmap [in CountsToOffsets](apache/arrow#8894) - improve docstrings for [MakeGroupings, ApplyGroupings](apache/arrow#8894) At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly: ``` data/ col=a/ part-0.parquet # col is "a" throughout col=b/ part-1.parquet # col is "b" throughout part-2.parquet # col is null throughout ``` Closes #9130 from bkietz/10247-Cannot-write-dataset-with Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
This PR replaces the
Expressionclass hierarchy with a simpler discriminated union of:This reduces the overhead of supporting new compute functions in dataset filters: execution and validation against a schema are already implemented and tested in
compute::. Only serialization and equality comparison need to be manually wired up, and only if the function requires nontrivial function options.TODO replace projection with expressions as well. The
structfunction is provided to replaceRecordBatchProjector, invocable withcallas with any other compute function: