Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Dec 10, 2020

This PR replaces the Expression class hierarchy with a simpler discriminated union of:

  • literal values
  • field references
  • call expressions which simply wrap a function name, a vector of arguments, and options.
Expression add_1_to_i32 = call("add", {field_ref("i32"), literal(1)});

This reduces the overhead of supporting new compute functions in dataset filters: execution and validation against a schema are already implemented and tested in compute::. Only serialization and equality comparison need to be manually wired up, and only if the function requires nontrivial function options.

TODO replace projection with expressions as well. The struct function is provided to replace RecordBatchProjector, invocable with call as with any other compute function:

Expression projection = call("struct", {field_ref("f32"), add_1_to_i32},
                             StructOptions{"f32_renamed", "i32 + 1"});

@github-actions
Copy link

@nealrichardson nealrichardson force-pushed the 10322-Minimize-Expression-to-a- branch from adc5551 to 7fc23dd Compare December 15, 2020 18:48
@nealrichardson
Copy link
Member

R centos-7 (gcc 4.8) compilation fails: https://github.com/apache/arrow/pull/8894/checks?check_run_id=1572105784#step:9:513

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I've reviewed the C++ side of this.

};
RETURN_NOT_OK(CanonicalizeAndFoldConstants());

for (const auto& guarantee : conjunction_members) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of looping on conjunction members, have you tried to match all members at once in the post-visit callback in DirectComparisonSimplification?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the advantage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visiting the tree only once may be cheaper, though that would have to be measured.

RETURN_NOT_OK(ExtractKnownFieldValuesImpl(&conjunction_members, &known_values));

ARROW_ASSIGN_OR_RAISE(expr,
ReplaceFieldsWithKnownValues(known_values, std::move(expr)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful, given that DirectComparisonSimplification should catch these cases as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should indeed catch those cases. My plan was to extract a function

Result<map<FieldRef, {
  Datum min,
  Datum max,
  bool nullable,
  bool include_min,
  bool include_max
}>> ExtractKnownRanges(guarantee);

for independent testing (which wouldn't handle the equality case), but I haven't done so yet. In any case this shouldn't represent a perf penalty since the conjunction_members which correspond to those equality conditions are extracted before running DirectComparisonSimplification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason for the separation is DirectComparisonSimplification requires that the simplified expression is also a comparison. This will not catch cases such as is_in(a, [1,2,3]) where a == 4 so we wouldn't be able to skip that partition

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 21, 2020
@bkietz bkietz force-pushed the 10322-Minimize-Expression-to-a- branch from 9718255 to 9a00cef Compare January 4, 2021 22:51
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't necessarily need to happen in this PR, but: I personally don't find this a very user-friendly error message with "FieldRef.Name" in it (I think the previous was more readable)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly change the error message, but I'd like to wait for a follow up

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.7-dask-latest test-conda-python-3.7-pandas-master test-conda-python-3.7-pandas-latest

@nealrichardson nealrichardson marked this pull request as ready for review January 5, 2021 19:40
@jorisvandenbossche
Copy link
Member

Ah, the bot doesn't work at the moment I suppose. I ran the dask/parquet tests locally, and they are passing.

I also ran my tax-dataset dask notebook with some queries using this branch, but apparently there is a bug in the latest version of dask to read those (something I need to investigate and report/fix in dask). But that also happens with released pyarrow 2.0.0, so not caused by this PR.

@nealrichardson
Copy link
Member

Ah, the bot doesn't work at the moment I suppose.

@kszucs fixed this today (it was blocked by INFRA) so it should work if you rebase.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the R side, this looks good, and @jonkeane have already built on it (#8947) and are ready to rebase and proceed with that PR once this one merges.

@jorisvandenbossche
Copy link
Member

The python failure looks legit:

 ________________________ test_expression_serialization _________________________

    def test_expression_serialization():
        a = ds.scalar(1)
        b = ds.scalar(1.1)
        c = ds.scalar(True)
        d = ds.scalar("string")
        e = ds.scalar(None)
        f = ds.scalar({'a': 1})
        g = ds.scalar(pa.scalar(1))
    
        all_exprs = [a, b, c, d, e, f, g, a == b, a > b, a & b, a | b, ~c,
                     d.is_valid(), a.cast(pa.int32(), safe=False),
                     a.cast(pa.int32(), safe=False), a.isin([1, 2, 3]),
                     ds.field('i64') > 5, ds.field('i64') == 5,
                     ds.field('i64') == 7]
        for expr in all_exprs:
            assert isinstance(expr, ds.Expression)
            restored = pickle.loads(pickle.dumps(expr))
>           assert expr.equals(restored)
E           assert False
E            +  where False = <built-in method equals of pyarrow._dataset.Expression object at 0x7effd5d024a0>(<pyarrow.dataset.Expression is_in(1, value_set=[\n  -2459565876494606883,\n  -2459565876494606883,\n  -2459565876494606883\n], skip_nulls)>)
E            +    where <built-in method equals of pyarrow._dataset.Expression object at 0x7effd5d024a0> = <pyarrow.dataset.Expression is_in(1, value_set=[\n  1,\n  2,\n  3\n], skip_nulls)>.equals

opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_dataset.py:419: AssertionError

Seems that the value_set is viewed/deserialized with a wrong type or so.

@github-actions
Copy link

github-actions bot commented Jan 5, 2021

Revision: 9a00cef5c82a032107d36b7fd47d1ec8a14703b0

Submitted crossbow builds: ursa-labs/crossbow @ actions-824

Task Status
test-conda-python-3.7-dask-latest Github Actions
test-conda-python-3.7-pandas-latest Github Actions
test-conda-python-3.7-pandas-master Github Actions

@jorisvandenbossche
Copy link
Member

(the failing test-conda-python-3.7-pandas-latest is the same serialization test failure as mentioned above)

@jorisvandenbossche
Copy link
Member

(I reviewed the minimal python changes, which look good, and looked at part of the C++ dataset changes, but no need to wait on further review from my side)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review everything again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this inside the anonymous namespace above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, nit but I don't understand why this is in cast.{h,cc}. I would expect to only find the cast functions here. A new scalar_nested.{h,cc} would seem a more logical place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm also curious why it's called "project". It sounds rather imprecise, though it may be the conventional term for this operation?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"project" is the conventional term. I'll move it to a separate header/source.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment as for AddSimpleCast above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, AddSimpleCast is no longer referenced anywhere so I'll just rename this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, is there a fast path when conversion.second == 1? Otherwise, perhaps create a JIRA for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a fast path for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, perhaps tag these APIs experimental, so that we can change them without warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// \param[in] by A StructArray whose columns will be used as grouping criteria.
/// \return A StructArray mapping unique rows (in field "values", represented as a
/// StructArray with the same fields as `by`) to lists of indices where
/// that row appears (in field "groupings").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this more explicit in the docstring then?

encoded = column.mutable_array();

auto indices =
std::make_shared<Int32Array>(encoded->length, std::move(encoded->buffers[1]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, std::move seems wrong if the column is already dictionary-encoded.

*fused_indices = checked_pointer_cast<Int32Array>(new_fused_indices.make_array());

// XXX should probably cap this at 2**15 or so
ARROW_CHECK(!internal::MultiplyWithOverflow(size_, dictionary_size, &size_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping.

@bkietz bkietz force-pushed the 10322-Minimize-Expression-to-a- branch from 1ca0dc6 to b14a456 Compare January 6, 2021 18:35
@bkietz
Copy link
Member Author

bkietz commented Jan 6, 2021

@jorisvandenbossche
Copy link
Member

[@pitrou] I'm also curious why it's called "project". It sounds rather imprecise, though it may be the conventional term for this operation?)

[@bkietz] "project" is the conventional term. I'll move it to a separate header/source.

Although it's clearly related, I personally still find it a bit strange name for this specific (user exposed) function (but I am certainly not very familiar with the different contexts where "project" gets used, eg in Python/pandas this term is basically never used).
In the Dataset context, we typically speak about projection when eg defining a subset of the columns to return, correct? But here, you already have the subset of arrays/scalars, and only combine them in a StructArray (naively, I would expect that a project function would eg receive a record batch and return a subset of it (with potentially renamed, reordered, etc fields). So it feels like a level lower as an actual 'project' operation.

@jorisvandenbossche
Copy link
Member

The ProjectOptions also still need to be exposed in Python -> opened https://issues.apache.org/jira/browse/ARROW-11166

bkietz added a commit that referenced this pull request Jan 11, 2021
See also: #8894

The "project" compute function is not really intended for direct use; it's primarily a convenience for exposing expressions to projection: https://issues.apache.org/jira/browse/ARROW-11174

As such, maybe it should be hidden instead of exposed to python?

Closes #9131 from bkietz/11166-Add-bindings-for-ProjectO

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
bkietz added a commit that referenced this pull request Jan 15, 2021
…ictionary columns

Enables usage of dictionary columns as partition columns on write.

Additionally resolves some partition-related follow ups from #8894 (@pitrou):
- raise an error status [instead of aborting](#8894) for overflowing maximum group count
- handle dictionary index types [other than int32](#8894)
- don't build an unused null bitmap [in CountsToOffsets](#8894)
- improve docstrings for [MakeGroupings, ApplyGroupings](#8894)

At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly:

```
data/
    col=a/
        part-0.parquet # col is "a" throughout
    col=b/
        part-1.parquet # col is "b" throughout
    part-2.parquet # col is null throughout
```

Closes #9130 from bkietz/10247-Cannot-write-dataset-with

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
@bkietz bkietz deleted the 10322-Minimize-Expression-to-a- branch February 25, 2021 16:11
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…ictionary columns

Enables usage of dictionary columns as partition columns on write.

Additionally resolves some partition-related follow ups from #8894 (@pitrou):
- raise an error status [instead of aborting](apache/arrow#8894) for overflowing maximum group count
- handle dictionary index types [other than int32](apache/arrow#8894)
- don't build an unused null bitmap [in CountsToOffsets](apache/arrow#8894)
- improve docstrings for [MakeGroupings, ApplyGroupings](apache/arrow#8894)

At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly:

```
data/
    col=a/
        part-0.parquet # col is "a" throughout
    col=b/
        part-1.parquet # col is "b" throughout
    part-2.parquet # col is null throughout
```

Closes #9130 from bkietz/10247-Cannot-write-dataset-with

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: C++ Component: R needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants