Skip to content

BUG: Filtering on columns in read_parquet with partitioned datasets working incorrectly #5959

@jorisvandenbossche

Description

@jorisvandenbossche

From debugging with @rjzamora, we see the following problem with filtering on a column (not on a partition key) with a partitioned dataset (with the pyarrow engine):

  • When you have a partitioned dataset (so eg with a path like /year=2020/month=7/data*.parquet), then the "dask partitions" map to single files (and not row groups, although the default is split_row_groups=True).
  • But, the statistics are gathered per row group, while the parts are thus files, giving a mismatch between parts and stats at
    return (meta, stats, parts)
    )
  • That means that when filtering the parts (in apply_filters), wrong statistics are used for deciding on skipping the parts (too many parts are skipped, loosing data)

The easiest short term fix is to disallow gather_statistics=True when having a partitioned dataset (the default is actually to not gather statistics in a partitioned dataset, but when you manually set gather_statistics=True, it will do that, but then incorrectly as described above).

The longer term fix (to actually implement filtering for partitioned datasets) would be to either combine the statistics per row group into statistics per file (aggregate the statistics), or otherwise keep a list of filtered row_groups indices to read per file.

(sorry for the bug report in "words" for now, instead of a code example)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions