Skip to content

Custom sort key for dataframe.read_parquet with multiple paths as input #10045

@alexgorban

Description

@alexgorban

dask.dataframe.read_parquet accepts a list of paths as input. It uses natural_sort_key under the hood to sort the paths in several places. In my case parquet files are not in the natural order and I'd like to sort them outside and keep the same order inside the read_parquet, so I do monkey patching like this to make it work:

with (mock.patch.object(dask.dataframe.io.parquet.core, 'natural_sort_key', new=lambda p: p),
      mock.patch.object(dask.dataframe.io.parquet.utils, 'natural_sort_key', new=lambda p: p)):    
    return dd.read_parquet(paths, index='index', calculate_divisions=True)

It would be great if there was a parameter to control the sort key (disable natural_sort_key or provide my own).

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataframeioneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions