Skip to content

Prune columns / pages that are all null in ParquetExec by connecting up row_counts in pruning statistics #9961

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

@appletreeisyellow added PruningStatistics::row_counts() in #9223 which allows better pruning of columns which are all null.

However, I believe we have not hooked that API up into the ParquetExec, so it won't prune row groups based on this information.

For example, if column a is all NULL, a predicate `a > 5' can never be true, but the the ParquetExec won't be able to prune row groups or pages for this case

Describe the solution you'd like

Implement RowGroupPruningStastics::row_counts

https://github.com/apache/arrow-datafusion/blob/2dad90425bacb98a3c2a4214faad53850c93104e/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L345-L347

And PagesPruningStatistics::row_counts

https://github.com/apache/arrow-datafusion/blob/2dad90425bacb98a3c2a4214faad53850c93104e/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L550-L552

Describe alternatives you've considered

I think the row counts can be found on https://docs.rs/parquet/latest/parquet/format/struct.ColumnMetaData.html

So this ticket should be a matter of copying the row counts correctly and writing some tests in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/row_group_pruning.rs / https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/page_pruning.rs

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions