-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
From debugging with @rjzamora, we see the following problem with filtering on a column (not on a partition key) with a partitioned dataset (with the pyarrow engine):
- When you have a partitioned dataset (so eg with a path like
/year=2020/month=7/data*.parquet), then the "dask partitions" map to single files (and not row groups, although the default issplit_row_groups=True). - But, the statistics are gathered per row group, while the parts are thus files, giving a mismatch between
partsandstatsat)dask/dask/dataframe/io/parquet/arrow.py
Line 325 in 4ae913b
return (meta, stats, parts) - That means that when filtering the parts (in
apply_filters), wrong statistics are used for deciding on skipping the parts (too many parts are skipped, loosing data)
The easiest short term fix is to disallow gather_statistics=True when having a partitioned dataset (the default is actually to not gather statistics in a partitioned dataset, but when you manually set gather_statistics=True, it will do that, but then incorrectly as described above).
The longer term fix (to actually implement filtering for partitioned datasets) would be to either combine the statistics per row group into statistics per file (aggregate the statistics), or otherwise keep a list of filtered row_groups indices to read per file.
(sorry for the bug report in "words" for now, instead of a code example)