ARROW-10322: [C++][Dataset] Minimize Expression #8894

bkietz · 2020-12-10T23:02:36Z

This PR replaces the Expression class hierarchy with a simpler discriminated union of:

literal values
field references
call expressions which simply wrap a function name, a vector of arguments, and options.

Expression add_1_to_i32 = call("add", {field_ref("i32"), literal(1)});

This reduces the overhead of supporting new compute functions in dataset filters: execution and validation against a schema are already implemented and tested in compute::. Only serialization and equality comparison need to be manually wired up, and only if the function requires nontrivial function options.

TODO replace projection with expressions as well. The struct function is provided to replace RecordBatchProjector, invocable with call as with any other compute function:

Expression projection = call("struct", {field_ref("f32"), add_1_to_i32},
                             StructOptions{"f32_renamed", "i32 + 1"});

github-actions · 2020-12-10T23:06:28Z

https://issues.apache.org/jira/browse/ARROW-10322

r/src/expression.cpp

nealrichardson · 2020-12-17T18:10:58Z

R centos-7 (gcc 4.8) compilation fails: https://github.com/apache/arrow/pull/8894/checks?check_run_id=1572105784#step:9:513

pitrou

Ok, I think I've reviewed the C++ side of this.

cpp/examples/arrow/dataset-parquet-scan-example.cc

cpp/src/arrow/array/array_struct_test.cc

cpp/src/arrow/compute/cast.cc

cpp/src/arrow/dataset/expression.cc

pitrou · 2020-12-17T19:26:09Z

cpp/src/arrow/dataset/expression.cc

+  };
+  RETURN_NOT_OK(CanonicalizeAndFoldConstants());
+
+  for (const auto& guarantee : conjunction_members) {


Instead of looping on conjunction members, have you tried to match all members at once in the post-visit callback in DirectComparisonSimplification?

What would be the advantage?

Visiting the tree only once may be cheaper, though that would have to be measured.

pitrou · 2020-12-17T19:26:38Z

cpp/src/arrow/dataset/expression.cc

+  RETURN_NOT_OK(ExtractKnownFieldValuesImpl(&conjunction_members, &known_values));
+
+  ARROW_ASSIGN_OR_RAISE(expr,
+                        ReplaceFieldsWithKnownValues(known_values, std::move(expr)));


Is this useful, given that DirectComparisonSimplification should catch these cases as well?

It should indeed catch those cases. My plan was to extract a function

Result<map<FieldRef, { Datum min, Datum max, bool nullable, bool include_min, bool include_max }>> ExtractKnownRanges(guarantee);

for independent testing (which wouldn't handle the equality case), but I haven't done so yet. In any case this shouldn't represent a perf penalty since the conjunction_members which correspond to those equality conditions are extracted before running DirectComparisonSimplification

Another reason for the separation is DirectComparisonSimplification requires that the simplified expression is also a comparison. This will not catch cases such as is_in(a, [1,2,3]) where a == 4 so we wouldn't be able to skip that partition

jorisvandenbossche · 2021-01-05T16:43:57Z

python/pyarrow/tests/parquet/test_dataset.py

Doesn't necessarily need to happen in this PR, but: I personally don't find this a very user-friendly error message with "FieldRef.Name" in it (I think the previous was more readable)

I can certainly change the error message, but I'd like to wait for a follow up

jorisvandenbossche · 2021-01-05T19:16:27Z

@github-actions crossbow submit test-conda-python-3.7-dask-latest test-conda-python-3.7-pandas-master test-conda-python-3.7-pandas-latest

jorisvandenbossche · 2021-01-05T19:41:42Z

Ah, the bot doesn't work at the moment I suppose. I ran the dask/parquet tests locally, and they are passing.

I also ran my tax-dataset dask notebook with some queries using this branch, but apparently there is a bug in the latest version of dask to read those (something I need to investigate and report/fix in dask). But that also happens with released pyarrow 2.0.0, so not caused by this PR.

nealrichardson · 2021-01-05T19:42:41Z

Ah, the bot doesn't work at the moment I suppose.

@kszucs fixed this today (it was blocked by INFRA) so it should work if you rebase.

nealrichardson

From the R side, this looks good, and @jonkeane have already built on it (#8947) and are ready to rebase and proceed with that PR once this one merges.

jorisvandenbossche · 2021-01-05T19:44:11Z

The python failure looks legit:

 ________________________ test_expression_serialization _________________________

    def test_expression_serialization():
        a = ds.scalar(1)
        b = ds.scalar(1.1)
        c = ds.scalar(True)
        d = ds.scalar("string")
        e = ds.scalar(None)
        f = ds.scalar({'a': 1})
        g = ds.scalar(pa.scalar(1))
    
        all_exprs = [a, b, c, d, e, f, g, a == b, a > b, a & b, a | b, ~c,
                     d.is_valid(), a.cast(pa.int32(), safe=False),
                     a.cast(pa.int32(), safe=False), a.isin([1, 2, 3]),
                     ds.field('i64') > 5, ds.field('i64') == 5,
                     ds.field('i64') == 7]
        for expr in all_exprs:
            assert isinstance(expr, ds.Expression)
            restored = pickle.loads(pickle.dumps(expr))
>           assert expr.equals(restored)
E           assert False
E            +  where False = <built-in method equals of pyarrow._dataset.Expression object at 0x7effd5d024a0>(<pyarrow.dataset.Expression is_in(1, value_set=[\n  -2459565876494606883,\n  -2459565876494606883,\n  -2459565876494606883\n], skip_nulls)>)
E            +    where <built-in method equals of pyarrow._dataset.Expression object at 0x7effd5d024a0> = <pyarrow.dataset.Expression is_in(1, value_set=[\n  1,\n  2,\n  3\n], skip_nulls)>.equals

opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_dataset.py:419: AssertionError

Seems that the value_set is viewed/deserialized with a wrong type or so.

github-actions · 2021-01-05T20:09:54Z

Revision: 9a00cef5c82a032107d36b7fd47d1ec8a14703b0

Submitted crossbow builds: ursa-labs/crossbow @ actions-824

Task	Status
test-conda-python-3.7-dask-latest
test-conda-python-3.7-pandas-latest
test-conda-python-3.7-pandas-master

jorisvandenbossche · 2021-01-05T20:44:16Z

(the failing test-conda-python-3.7-pandas-latest is the same serialization test failure as mentioned above)

jorisvandenbossche · 2021-01-06T17:00:27Z

(I reviewed the minimal python changes, which look good, and looked at part of the C++ dataset changes, but no need to wait on further review from my side)

pitrou

I didn't review everything again.

pitrou · 2021-01-06T16:29:59Z

cpp/src/arrow/compute/cast.cc

Can you put this inside the anonymous namespace above?

Also, nit but I don't understand why this is in cast.{h,cc}. I would expect to only find the cast functions here. A new scalar_nested.{h,cc} would seem a more logical place.

(I'm also curious why it's called "project". It sounds rather imprecise, though it may be the conventional term for this operation?)

"project" is the conventional term. I'll move it to a separate header/source.

pitrou · 2021-01-06T16:40:16Z

cpp/src/arrow/compute/kernels/scalar_cast_internal.h

Add comment as for AddSimpleCast above?

actually, AddSimpleCast is no longer referenced anywhere so I'll just rename this

cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc

pitrou · 2021-01-06T16:44:37Z

cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc

For the record, is there a fast path when conversion.second == 1? Otherwise, perhaps create a JIRA for it.

There is a fast path for this.

cpp/src/arrow/compute/kernels/scalar_cast_test.cc

pitrou · 2021-01-06T16:59:38Z

cpp/src/arrow/dataset/expression.h

By the way, perhaps tag these APIs experimental, so that we can change them without warning?

Is a comment at the top of the file insufficient?

pitrou · 2021-01-06T17:04:41Z

cpp/src/arrow/dataset/partition.h

+/// \param[in] by A StructArray whose columns will be used as grouping criteria.
+/// \return A StructArray mapping unique rows (in field "values", represented as a
+///         StructArray with the same fields as `by`) to lists of indices where
+///         that row appears (in field "groupings").


Perhaps make this more explicit in the docstring then?

cpp/src/arrow/type.h

pitrou · 2021-01-06T17:15:11Z

cpp/src/arrow/dataset/partition.cc

+    encoded = column.mutable_array();
+
+    auto indices =
+        std::make_shared<Int32Array>(encoded->length, std::move(encoded->buffers[1]));


Similarly, std::move seems wrong if the column is already dictionary-encoded.

pitrou · 2021-01-06T17:15:28Z

cpp/src/arrow/dataset/partition.cc

+    *fused_indices = checked_pointer_cast<Int32Array>(new_fused_indices.make_array());
+
+    // XXX should probably cap this at 2**15 or so
+    ARROW_CHECK(!internal::MultiplyWithOverflow(size_, dictionary_size, &size_));


bkietz · 2021-01-06T22:24:10Z

non Apache CI: https://github.com/bkietz/arrow/runs/1659121616

jorisvandenbossche · 2021-01-07T16:26:03Z

[@pitrou] I'm also curious why it's called "project". It sounds rather imprecise, though it may be the conventional term for this operation?)

[@bkietz] "project" is the conventional term. I'll move it to a separate header/source.

Although it's clearly related, I personally still find it a bit strange name for this specific (user exposed) function (but I am certainly not very familiar with the different contexts where "project" gets used, eg in Python/pandas this term is basically never used).
In the Dataset context, we typically speak about projection when eg defining a subset of the columns to return, correct? But here, you already have the subset of arrays/scalars, and only combine them in a StructArray (naively, I would expect that a project function would eg receive a record batch and return a subset of it (with potentially renamed, reordered, etc fields). So it feels like a level lower as an actual 'project' operation.

jorisvandenbossche · 2021-01-07T16:27:10Z

The ProjectOptions also still need to be exposed in Python -> opened https://issues.apache.org/jira/browse/ARROW-11166

See also: #8894 The "project" compute function is not really intended for direct use; it's primarily a convenience for exposing expressions to projection: https://issues.apache.org/jira/browse/ARROW-11174 As such, maybe it should be hidden instead of exposed to python? Closes #9131 from bkietz/11166-Add-bindings-for-ProjectO Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

@pitrou

…ictionary columns Enables usage of dictionary columns as partition columns on write. Additionally resolves some partition-related follow ups from #8894 (@pitrou): - raise an error status [instead of aborting](#8894) for overflowing maximum group count - handle dictionary index types [other than int32](#8894) - don't build an unused null bitmap [in CountsToOffsets](#8894) - improve docstrings for [MakeGroupings, ApplyGroupings](#8894) At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly: ``` data/ col=a/ part-0.parquet # col is "a" throughout col=b/ part-1.parquet # col is "b" throughout part-2.parquet # col is null throughout ``` Closes #9130 from bkietz/10247-Cannot-write-dataset-with Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

@pitrou

…ictionary columns Enables usage of dictionary columns as partition columns on write. Additionally resolves some partition-related follow ups from #8894 (@pitrou): - raise an error status [instead of aborting](apache/arrow#8894) for overflowing maximum group count - handle dictionary index types [other than int32](apache/arrow#8894) - don't build an unused null bitmap [in CountsToOffsets](apache/arrow#8894) - improve docstrings for [MakeGroupings, ApplyGroupings](apache/arrow#8894) At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly: ``` data/ col=a/ part-0.parquet # col is "a" throughout col=b/ part-1.parquet # col is "b" throughout part-2.parquet # col is null throughout ``` Closes #9130 from bkietz/10247-Cannot-write-dataset-with Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

bkietz requested a review from fsaintjacques December 10, 2020 23:02

github-actions bot added Component: C++ Component: R labels Dec 10, 2020

nealrichardson reviewed Dec 11, 2020

View reviewed changes

r/src/expression.cpp Outdated Show resolved Hide resolved

nealrichardson reviewed Dec 11, 2020

View reviewed changes

r/src/expression.cpp Outdated Show resolved Hide resolved

nealrichardson force-pushed the 10322-Minimize-Expression-to-a- branch from adc5551 to 7fc23dd Compare December 15, 2020 18:48

pitrou reviewed Dec 17, 2020

View reviewed changes

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 21, 2020

bkietz force-pushed the 10322-Minimize-Expression-to-a- branch from 9718255 to 9a00cef Compare January 4, 2021 22:51

jorisvandenbossche reviewed Jan 5, 2021

View reviewed changes

nealrichardson marked this pull request as ready for review January 5, 2021 19:40

nealrichardson approved these changes Jan 5, 2021

View reviewed changes

pitrou requested a review from jorisvandenbossche January 6, 2021 16:24

pitrou reviewed Jan 6, 2021

View reviewed changes

bkietz added 7 commits January 6, 2021 13:34

ARROW-10322: [C++][Dataset] Minimize dataset::Expression

f8e258b

replace filtering with Expression2

c31069e

remove ExpressionState

67dcd0e

repair implicit casts

e9dffce

delete Expression DSL operators and old Expression class

0e6da5b

rename Expression2 -> Expression

ec6d9a0

first pass at repairing bindings

9f05186

bkietz added 14 commits January 6, 2021 13:34

export ExecuteScalarExpression

c220823

remove test_parquet.py

38389e7

rename struct->project, move out of cast's test

fbe89d1

add some docstrings

f03953b

extract BindNonRecursive, filter2->filter, comments

63f1464

move more things to namespace{}, docstring for Modify

607ed96

ensure field_ref into list safely errors

f952400

debug prints

5f29f52

remove unused functions

0dffb56

take ownership of buffer to preserve deserialized arrays' storage

24c9277

cleanup, FieldPath::operator bool -> empty

1292076

move project to scalar_nested.cc, add test for different chunking

a072952

remove AddSimpleArrayOnlyCast

9ce0adb

clang-format

b14a456

bkietz force-pushed the 10322-Minimize-Expression-to-a- branch from 1ca0dc6 to b14a456 Compare January 6, 2021 18:35

incorrect projection in CsvFileFormat

e8080da

nealrichardson closed this in 57376d2 Jan 6, 2021

This was referenced Jan 7, 2021

ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns #9130

Closed

ARROW-11166: [Python] Add binding for ProjectOptions #9131

Closed

bkietz deleted the 10322-Minimize-Expression-to-a- branch February 25, 2021 16:11

asfimport mentioned this pull request Jan 7, 2021

[C++][Dataset] Minimize Expression to a wrapper around compute::Function #26312

Closed

ARROW-10322: [C++][Dataset] Minimize Expression #8894

ARROW-10322: [C++][Dataset] Minimize Expression #8894

Uh oh!

Conversation

bkietz commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 10, 2020

Uh oh!

Uh oh!

Uh oh!

nealrichardson commented Dec 17, 2020

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 5, 2021

Uh oh!

jorisvandenbossche commented Jan 5, 2021

Uh oh!

nealrichardson commented Jan 5, 2021

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 5, 2021

Uh oh!

github-actions bot commented Jan 5, 2021

Uh oh!

jorisvandenbossche commented Jan 5, 2021

Uh oh!

jorisvandenbossche commented Jan 6, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

bkietz commented Dec 10, 2020 •

edited

Loading