BUG: Filtering on columns in read_parquet with partitioned datasets working incorrectly

From debugging with @rjzamora, we see the following problem with filtering on a column (not on a partition key) with a partitioned dataset (with the `pyarrow` engine):

- When you have a partitioned dataset (so eg with a path like `/year=2020/month=7/data*.parquet`), then the "dask partitions" map to single files (and not row groups, although the default is `split_row_groups=True`).
- But, the statistics are gathered per row group, while the parts are thus files, giving a mismatch between `parts` and `stats` at https://github.com/dask/dask/blob/4ae913b13df7c5018e24eea6ee76392a6ef59f3e/dask/dataframe/io/parquet/arrow.py#L325)
- That means that when filtering the parts (in `apply_filters`), wrong statistics are used for deciding on skipping the parts (too many parts are skipped, loosing data)

The easiest short term fix is to disallow `gather_statistics=True` when having a partitioned dataset (the default is actually to not gather statistics in a partitioned dataset, but when you manually set `gather_statistics=True`, it will do that, but then incorrectly as described above).

The longer term fix (to actually implement filtering for partitioned datasets) would be to either combine the statistics per row group into statistics per file (aggregate the statistics), or otherwise keep a list of filtered row_groups indices to read per file.

(sorry for the bug report in "words" for now, instead of a code example)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Filtering on columns in read_parquet with partitioned datasets working incorrectly #5959

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BUG: Filtering on columns in read_parquet with partitioned datasets working incorrectly #5959

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions