Fix combined filtering and column projection in `read_parquet` by rjzamora · Pull Request #13666 · rapidsai/cudf

rjzamora · 2023-07-06T19:28:55Z

Description

Follow-up to #13334 for the special case that the filters argument includes column names that are not included in the current column projection (i.e. the columns argument). Although this pattern is not a common case at the moment, it is perfectly valid, and will become more common when cudf is used as a dask-expr backend (since the predicate-pushdown optimizations in dask-expr are significantly more advanced than those in dask-dataframe).

Note:
Prior to #13334, the special case targeted by this PR would not have produced any run-time errors, but it also wouldn't have produced proper filtering in many cases. Now that cudf.read_parquet does produce proper row-wise filtering, it turns out that we now need to sacrifice a bit of IO in cases like this. Although this is unfortunate, I personally feel that it is still worthwhile to enforce row-wise filtering.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence-

I think the inclusion of the filtering columns can be made a bit simpler, but otherwise logic looks good to me, thanks.

wence- · 2023-07-10T10:18:51Z

python/cudf/cudf/io/parquet.py

+    # we do NEED these columns for accurate filtering.
+    projection = None
+    if columns and filters:
+        filtered_columns = _filtered_columns(filters)


At this point you've normalized the filters into list[list[tuple]] so _filtered_columns doesn't need to recurse to flatten things, and you can just do:

filtered_columns = set(v[0] for v in itertools.chain.from_iterable(filters))

Perhaps then:

projected_columns = columns # this could be None so it's fine if columns and filters: projected_columns = columns columns = sorted(set(v[0] for v in itertools.chain.from_iterable(filters)) | set(columns))

WDYT?

Ah, good point about filters being normalized here. In that case, itertools.chain makes a lot of sense.

projected_columns = columns # this could be None so it's fine

Interestingly, this line caused failures, and exposed a bug in my logic. columns can technically include fields that are used to set the index, and so we need to make sure projected_columns only includes current column names before the getitem operation at the end of this function.

Since we need extra logic to check the elements of projected_columns, I decided it probably makes more sense for the projected_columns default to be None (rather than columns).

…into filter-and-project

rjzamora · 2023-07-11T19:05:51Z

/merge

wence- · 2023-07-12T07:51:01Z

Thanks Rick.

…et` (#13697) This is the dask-cudf version of #13666, which fixes the case that the `filters` argument includes column names that are not included in the `columns` argument to `cudf.read_parquet`. It turns out that we need to add the exact same fix for the dask-specific `read_parquet` code path as well. Note that it was just an oversight to leave this out of #13666 - This is currently a dask-expressions blocker. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #13697

rjzamora added 2 commits July 6, 2023 11:27

enable filtering on columns outside the current column projection

9576c64

tweak comment

393a7bf

rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress Python Affects Python cuDF API. non-breaking Non-breaking change labels Jul 6, 2023

rjzamora self-assigned this Jul 6, 2023

rjzamora requested a review from a team as a code owner July 6, 2023 19:28

rjzamora requested review from bdice and mroeschke July 6, 2023 19:28

Merge branch 'branch-23.08' into filter-and-project

0e83825

rjzamora mentioned this pull request Jul 6, 2023

Support cudf as a DataFrame backend dask/dask-expr#212

Merged

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 6, 2023

rjzamora added this to the Parquet continuous improvement milestone Jul 6, 2023

wence- approved these changes Jul 10, 2023

View reviewed changes

rjzamora added 7 commits July 10, 2023 12:50

Merge branch 'branch-23.08' into filter-and-project

0dfa0dc

use itertools

7d6e2ec

Merge branch 'branch-23.08' into filter-and-project

6eea303

Merge branch 'branch-23.08' into filter-and-project

3aec9a5

ensure list

a74184c

Merge branch 'filter-and-project' of https://github.com/rjzamora/cudf …

df3ff01

…into filter-and-project

split out list comprehension and use _column_names

356044e

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 11, 2023

rapids-bot bot merged commit 091de4d into rapidsai:branch-23.08 Jul 11, 2023

rjzamora deleted the filter-and-project branch July 11, 2023 19:05

rjzamora mentioned this pull request Jul 14, 2023

Fix combined filtering and column projection in dask_cudf.read_parquet #13697

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix combined filtering and column projection in `read_parquet`#13666

Fix combined filtering and column projection in `read_parquet`#13666
rapids-bot[bot] merged 10 commits intorapidsai:branch-23.08from
rjzamora:filter-and-project

rjzamora commented Jul 6, 2023 •

edited

Loading

Uh oh!

wence- left a comment

Uh oh!

wence- Jul 10, 2023

Uh oh!

rjzamora Jul 10, 2023

Uh oh!

rjzamora commented Jul 11, 2023

Uh oh!

wence- commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjzamora commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

wence- Jul 10, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora Jul 10, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Jul 11, 2023

Uh oh!

wence- commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjzamora commented Jul 6, 2023 •

edited

Loading