Parquet statistics dropped due to non-numeric filename sort in dirs w/ ≥10 partitions

Some `read_parquet` code-paths sort part files improperly due to not using [`natural_sort_key`](https://github.com/dask/dask/blob/656028e1b88cf76a77a1eb491e301d710d0fbeac/dask/utils.py#L1165) (numeric sort), e.g.
```
part.0.parquet
part.1.parquet
part.10.parquet
part.11.parquet
part.12.parquet
part.2.parquet
part.3.parquet
part.4.parquet
part.5.parquet
part.6.parquet
part.7.parquet
part.8.parquet
part.9.parquet
```
In the absence of a `_metadata` file that was written alongside the files in the same order, this can result in `read_parquet` erroneously believing a given column is not sorted, and bailing out on gathering partition-statistics, which can in turn lead to missing `divisions` that should otherwise be populated.

This issue occurs in [`dask/dataframe/io/tests/test_parquet.py::test_read_dir_nometa`](https://github.com/dask/dask/blob/656028e1b88cf76a77a1eb491e301d710d0fbeac/dask/dataframe/io/tests/test_parquet.py#L2231), but [a `check_divisions=False` in the relevant `assert_eq`](https://github.com/dask/dask/blob/656028e1b88cf76a77a1eb491e301d710d0fbeac/dask/dataframe/io/tests/test_parquet.py#L2243) causes the test to miss it.

I should have a fix incoming momentarily.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet statistics dropped due to non-numeric filename sort in dirs w/ ≥10 partitions #7248

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Parquet statistics dropped due to non-numeric filename sort in dirs w/ ≥10 partitions #7248

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions