-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
@appletreeisyellow added PruningStatistics::row_counts() in #9223 which allows better pruning of columns which are all null.
However, I believe we have not hooked that API up into the ParquetExec, so it won't prune row groups based on this information.
For example, if column a is all NULL, a predicate `a > 5' can never be true, but the the ParquetExec won't be able to prune row groups or pages for this case
Describe the solution you'd like
Implement RowGroupPruningStastics::row_counts
And PagesPruningStatistics::row_counts
Describe alternatives you've considered
I think the row counts can be found on https://docs.rs/parquet/latest/parquet/format/struct.ColumnMetaData.html
So this ticket should be a matter of copying the row counts correctly and writing some tests in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/row_group_pruning.rs / https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/page_pruning.rs
Additional context
No response