Skip to content

Fix combined filtering and column projection in read_parquet#13666

Merged
rapids-bot[bot] merged 10 commits intorapidsai:branch-23.08from
rjzamora:filter-and-project
Jul 11, 2023
Merged

Fix combined filtering and column projection in read_parquet#13666
rapids-bot[bot] merged 10 commits intorapidsai:branch-23.08from
rjzamora:filter-and-project

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Jul 6, 2023

Description

Follow-up to #13334 for the special case that the filters argument includes column names that are not included in the current column projection (i.e. the columns argument). Although this pattern is not a common case at the moment, it is perfectly valid, and will become more common when cudf is used as a dask-expr backend (since the predicate-pushdown optimizations in dask-expr are significantly more advanced than those in dask-dataframe).

Note:
Prior to #13334, the special case targeted by this PR would not have produced any run-time errors, but it also wouldn't have produced proper filtering in many cases. Now that cudf.read_parquet does produce proper row-wise filtering, it turns out that we now need to sacrifice a bit of IO in cases like this. Although this is unfortunate, I personally feel that it is still worthwhile to enforce row-wise filtering.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress Python Affects Python cuDF API. non-breaking Non-breaking change labels Jul 6, 2023
@rjzamora rjzamora self-assigned this Jul 6, 2023
@rjzamora rjzamora requested a review from a team as a code owner July 6, 2023 19:28
@rjzamora rjzamora requested review from bdice and mroeschke July 6, 2023 19:28
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 6, 2023
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the inclusion of the filtering columns can be made a bit simpler, but otherwise logic looks good to me, thanks.

# we do NEED these columns for accurate filtering.
projection = None
if columns and filters:
filtered_columns = _filtered_columns(filters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point you've normalized the filters into list[list[tuple]] so _filtered_columns doesn't need to recurse to flatten things, and you can just do:

filtered_columns = set(v[0] for v in itertools.chain.from_iterable(filters))

Perhaps then:

projected_columns = columns # this could be None so it's fine
if columns and filters:
    projected_columns = columns
    columns = sorted(set(v[0] for v in itertools.chain.from_iterable(filters)) | set(columns))

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point about filters being normalized here. In that case, itertools.chain makes a lot of sense.

projected_columns = columns # this could be None so it's fine

Interestingly, this line caused failures, and exposed a bug in my logic. columns can technically include fields that are used to set the index, and so we need to make sure projected_columns only includes current column names before the getitem operation at the end of this function.

Since we need extra logic to check the elements of projected_columns, I decided it probably makes more sense for the projected_columns default to be None (rather than columns).

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 11, 2023
@rjzamora
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 091de4d into rapidsai:branch-23.08 Jul 11, 2023
@rjzamora rjzamora deleted the filter-and-project branch July 11, 2023 19:05
@wence-
Copy link
Contributor

wence- commented Jul 12, 2023

Thanks Rick.

rapids-bot bot pushed a commit that referenced this pull request Jul 17, 2023
…et` (#13697)

This is the dask-cudf version of #13666, which fixes the case that the `filters` argument includes column names that are not included in the `columns` argument to `cudf.read_parquet`. It turns out that we need to add the exact same fix for the dask-specific `read_parquet` code path as well.  Note that it was just an oversight to leave this out of #13666 - This is currently a dask-expressions blocker.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants