Delay row-wise parquet filtering until just after IO by rjzamora · Pull Request #10478 · dask/dask

rjzamora · 2023-08-29T19:19:26Z

While investigating benchmark results in dask-expr, @phofl noticed that predicate pushdown was giving us surprisingly bad performance in some cases. Although defining filters is clearly beneficial when we can drop entire row-groups from the dataset, the row-wise filtering step that pyarrow applies at read time is much less beneficial. In fact, it is typically faster to delay the row-wise filtering step until just after IO is complete (either with pandas or pyarrow).

This PR proposes the minimal possible change needed to delay the row-wise filtering step until just after IO. Note that row-groups and hive-partitions will already have been filtered before the modified function is executed. Therefore, these changes do not mean that we will need data to be in memory for filtering.

…O is faster

phofl

Looks good generally, I'll do some tests as well

phofl · 2023-08-30T08:28:43Z

dask/dataframe/io/parquet/arrow.py


+            io_filters = None
+            if filters:
+                # Only apply row-wise filters at IO time if we are


Are you sure about this part?

I'm hoping this case is somewhat rare. It means that we are filtering on 1+ columns that will not included in the current column projection. Therefore, in order to delay filtering until after IO, we would need to read in those extra columns, apply filters, and then drop them after IO. Doing that may be worth it, but I figured we could hold off on that.

rjzamora · 2023-10-31T14:42:55Z

@phofl - This does not effect dask-expr much anymore, but my understanding is that this change still improves performance when filters=... is explicitly set by the user. Should we try to get this in?

rjzamora added 2 commits August 29, 2023 12:03

avoid pyarrow filtering at IO time - filtering the table just after I…

b4d57db

…O is faster

formatting

39aa463

rjzamora added dataframe parquet labels Aug 29, 2023

rjzamora requested a review from phofl August 29, 2023 19:19

rjzamora self-assigned this Aug 29, 2023

github-actions bot added the io label Aug 29, 2023

rjzamora added 2 commits August 29, 2023 18:13

fix dropped column problem

ffa3c8f

fix formatting

ff7c41e

phofl reviewed Aug 30, 2023

View reviewed changes

rjzamora mentioned this pull request Sep 20, 2023

Remove predicate pushdown from read_parquet dask/dask-expr#305

Merged

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Oct 2, 2023

github-actions bot removed the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delay row-wise parquet filtering until just after IO#10478

Delay row-wise parquet filtering until just after IO#10478
rjzamora wants to merge 4 commits intodask:mainfrom
rjzamora:avoid-io-filters

rjzamora commented Aug 29, 2023

Uh oh!

phofl left a comment

Uh oh!

phofl Aug 30, 2023

Uh oh!

rjzamora Aug 30, 2023

Uh oh!

rjzamora commented Oct 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rjzamora commented Aug 29, 2023

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

phofl Aug 30, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora Aug 30, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Oct 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants