Skip to content

Delay row-wise parquet filtering until just after IO#10478

Open
rjzamora wants to merge 4 commits intodask:mainfrom
rjzamora:avoid-io-filters
Open

Delay row-wise parquet filtering until just after IO#10478
rjzamora wants to merge 4 commits intodask:mainfrom
rjzamora:avoid-io-filters

Conversation

@rjzamora
Copy link
Copy Markdown
Member

While investigating benchmark results in dask-expr, @phofl noticed that predicate pushdown was giving us surprisingly bad performance in some cases. Although defining filters is clearly beneficial when we can drop entire row-groups from the dataset, the row-wise filtering step that pyarrow applies at read time is much less beneficial. In fact, it is typically faster to delay the row-wise filtering step until just after IO is complete (either with pandas or pyarrow).

This PR proposes the minimal possible change needed to delay the row-wise filtering step until just after IO. Note that row-groups and hive-partitions will already have been filtered before the modified function is executed. Therefore, these changes do not mean that we will need data to be in memory for filtering.

@rjzamora rjzamora requested a review from phofl August 29, 2023 19:19
@rjzamora rjzamora self-assigned this Aug 29, 2023
@github-actions github-actions bot added the io label Aug 29, 2023
Copy link
Copy Markdown
Collaborator

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally, I'll do some tests as well


io_filters = None
if filters:
# Only apply row-wise filters at IO time if we are
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this part?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping this case is somewhat rare. It means that we are filtering on 1+ columns that will not included in the current column projection. Therefore, in order to delay filtering until after IO, we would need to read in those extra columns, apply filters, and then drop them after IO. Doing that may be worth it, but I figured we could hold off on that.

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Oct 2, 2023
@rjzamora
Copy link
Copy Markdown
Member Author

@phofl - This does not effect dask-expr much anymore, but my understanding is that this change still improves performance when filters=... is explicitly set by the user. Should we try to get this in?

@github-actions github-actions bot removed the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants