-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Labels
dataframefeatureSomething is missingSomething is missingneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.It's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet
Description
What happened:
Dask doesn't honor the order of files listed in the path
What you expected to happen:
I would have expected the partition ordering to be consistent with list of paths in read_parquet. Seems like the input path is automatically sorted internally without choice.
Minimal Complete Verifiable Example:
import dask.dataframe as dd
import pandas as pd
df1=pd.DataFrame({'i': [1, 1], 'A': [1.0, 2.0], 'B': [11.0, 12.0]})
df1.to_parquet("df1.parquet")
df2=pd.DataFrame({'i': [0, 0], 'A': [3.0, 4.0], 'B': [13.0, 14.0]})
df2.to_parquet("df2.parquet")
forward_df = dd.read_parquet(["df1.parquet", "df2.parquet"], engine='pyarrow')
reverse_df = dd.read_parquet(["df2.parquet", "df1.parquet"], engine='pyarrow')
forward_df.npartitions # 2
reverse_df.npartitions # 2
forward_df.get_partition(0).compute() # returns df1
reverse_df.get_partition(0).compute() # Also returns df1. Expected to return df2
Anything else we need to know?:
Environment:
- Dask version:
- Python version:
- Operating System:
- Install method (conda, pip, source):
Cluster Dump State:
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
dataframefeatureSomething is missingSomething is missingneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.It's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet