-
Notifications
You must be signed in to change notification settings - Fork 71
Description
Is your feature request related to a problem? Please describe.
It would be great if filter operations in a sql clause could be pushed down to the io layer for formats like parquet that support filtering based on statistics/metadata.
Describe the solution you'd like
There are a couple of ways this could be tackled:
- Implement this as a part of the graph optimization step within dask HLG's where dask would automatically push the filter steps down to the IO. There have been mentions of it in a few dask issues/discussions: High Level Expressions dask/dask#7933 and Add filter optimization for remote parquet dask/dask#7090 (comment), so there might be value in solving it at that level since the feature might be of value to other's in the dask community as well.
- Implant this by adding custom rule based optimizations during the planning stage to push the filters as close to the table scan step as possible and have logic that converts the filter predicates down to the IO stage for formats that support this. There his been some initial mention of this approach as well in Optimization: Go from general logical plans to dask-specific plans? #183.
Describe alternatives you've considered
While true predicate pushdown is not possible within dask-sql today, an alternative option of passing filters while reading parquet/orc datasets is possible the with Dask Dataframe api, and users could create tables from those dataframes if possible.
Curious to know what other's thoughts are and if they have a preference on which approach might be beneficial. I'm personally leaning towards 1. since it has the potential to benefit other dask users as well but I'm not sure of how complex the implementation would be.
cc: @rjzamora who is probably more familiar than I am with parquet/orc in dask and the hlg work there.