Skip to content

dask DataFrame.read_parquet doesn't preserve the paths order #8829

@rajeee

Description

@rajeee

What happened:
Dask doesn't honor the order of files listed in the path

What you expected to happen:
I would have expected the partition ordering to be consistent with list of paths in read_parquet. Seems like the input path is automatically sorted internally without choice.

Minimal Complete Verifiable Example:

import dask.dataframe as dd
import pandas as pd
df1=pd.DataFrame({'i': [1, 1], 'A': [1.0, 2.0], 'B': [11.0, 12.0]})
df1.to_parquet("df1.parquet")
df2=pd.DataFrame({'i': [0, 0], 'A':  [3.0, 4.0], 'B': [13.0, 14.0]})
df2.to_parquet("df2.parquet")
forward_df = dd.read_parquet(["df1.parquet", "df2.parquet"], engine='pyarrow')
reverse_df = dd.read_parquet(["df2.parquet", "df1.parquet"], engine='pyarrow')
forward_df.npartitions  # 2
reverse_df.npartitions  # 2

forward_df.get_partition(0).compute() # returns df1
reverse_df.get_partition(0).compute() # Also returns df1. Expected to return df2

Anything else we need to know?:

Environment:

  • Dask version:
  • Python version:
  • Operating System:
  • Install method (conda, pip, source):
Cluster Dump State:

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataframefeatureSomething is missingneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions