Delay row-wise parquet filtering until just after IO#10478
Delay row-wise parquet filtering until just after IO#10478
Conversation
phofl
left a comment
There was a problem hiding this comment.
Looks good generally, I'll do some tests as well
|
|
||
| io_filters = None | ||
| if filters: | ||
| # Only apply row-wise filters at IO time if we are |
There was a problem hiding this comment.
Are you sure about this part?
There was a problem hiding this comment.
I'm hoping this case is somewhat rare. It means that we are filtering on 1+ columns that will not included in the current column projection. Therefore, in order to delay filtering until after IO, we would need to read in those extra columns, apply filters, and then drop them after IO. Doing that may be worth it, but I figured we could hold off on that.
|
@phofl - This does not effect dask-expr much anymore, but my understanding is that this change still improves performance when |
While investigating benchmark results in dask-expr, @phofl noticed that predicate pushdown was giving us surprisingly bad performance in some cases. Although defining
filtersis clearly beneficial when we can drop entire row-groups from the dataset, the row-wise filtering step that pyarrow applies at read time is much less beneficial. In fact, it is typically faster to delay the row-wise filtering step until just after IO is complete (either with pandas or pyarrow).This PR proposes the minimal possible change needed to delay the row-wise filtering step until just after IO. Note that row-groups and hive-partitions will already have been filtered before the modified function is executed. Therefore, these changes do not mean that we will need data to be in memory for filtering.