Skip to content

Parquet statistics dropped due to non-numeric filename sort in dirs w/ ≥10 partitions #7248

@ryan-williams

Description

@ryan-williams

Some read_parquet code-paths sort part files improperly due to not using natural_sort_key (numeric sort), e.g.

part.0.parquet
part.1.parquet
part.10.parquet
part.11.parquet
part.12.parquet
part.2.parquet
part.3.parquet
part.4.parquet
part.5.parquet
part.6.parquet
part.7.parquet
part.8.parquet
part.9.parquet

In the absence of a _metadata file that was written alongside the files in the same order, this can result in read_parquet erroneously believing a given column is not sorted, and bailing out on gathering partition-statistics, which can in turn lead to missing divisions that should otherwise be populated.

This issue occurs in dask/dataframe/io/tests/test_parquet.py::test_read_dir_nometa, but a check_divisions=False in the relevant assert_eq causes the test to miss it.

I should have a fix incoming momentarily.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions