Fix combined filtering and column projection in read_parquet#13666
Fix combined filtering and column projection in read_parquet#13666rapids-bot[bot] merged 10 commits intorapidsai:branch-23.08from
read_parquet#13666Conversation
wence-
left a comment
There was a problem hiding this comment.
I think the inclusion of the filtering columns can be made a bit simpler, but otherwise logic looks good to me, thanks.
python/cudf/cudf/io/parquet.py
Outdated
| # we do NEED these columns for accurate filtering. | ||
| projection = None | ||
| if columns and filters: | ||
| filtered_columns = _filtered_columns(filters) |
There was a problem hiding this comment.
At this point you've normalized the filters into list[list[tuple]] so _filtered_columns doesn't need to recurse to flatten things, and you can just do:
filtered_columns = set(v[0] for v in itertools.chain.from_iterable(filters))
Perhaps then:
projected_columns = columns # this could be None so it's fine
if columns and filters:
projected_columns = columns
columns = sorted(set(v[0] for v in itertools.chain.from_iterable(filters)) | set(columns))WDYT?
There was a problem hiding this comment.
Ah, good point about filters being normalized here. In that case, itertools.chain makes a lot of sense.
projected_columns = columns # this could be None so it's fine
Interestingly, this line caused failures, and exposed a bug in my logic. columns can technically include fields that are used to set the index, and so we need to make sure projected_columns only includes current column names before the getitem operation at the end of this function.
Since we need extra logic to check the elements of projected_columns, I decided it probably makes more sense for the projected_columns default to be None (rather than columns).
…into filter-and-project
|
/merge |
|
Thanks Rick. |
…et` (#13697) This is the dask-cudf version of #13666, which fixes the case that the `filters` argument includes column names that are not included in the `columns` argument to `cudf.read_parquet`. It turns out that we need to add the exact same fix for the dask-specific `read_parquet` code path as well. Note that it was just an oversight to leave this out of #13666 - This is currently a dask-expressions blocker. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #13697
Description
Follow-up to #13334 for the special case that the
filtersargument includes column names that are not included in the current column projection (i.e. thecolumnsargument). Although this pattern is not a common case at the moment, it is perfectly valid, and will become more common when cudf is used as a dask-expr backend (since the predicate-pushdown optimizations in dask-expr are significantly more advanced than those in dask-dataframe).Note:
Prior to #13334, the special case targeted by this PR would not have produced any run-time errors, but it also wouldn't have produced proper filtering in many cases. Now that
cudf.read_parquetdoes produce proper row-wise filtering, it turns out that we now need to sacrifice a bit of IO in cases like this. Although this is unfortunate, I personally feel that it is still worthwhile to enforce row-wise filtering.Checklist