-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
needs triageNeeds a response from a contributorNeeds a response from a contributor
Description
Describe the issue:
When having a dataframe with npartitions > 1, when saving it to parquet and reading it back in, the number of partitions changes when you index the dataframe on one of the columns.
Minimal Complete Verifiable Example:
import dask.dataframe as dd
import pandas as pd
import numpy as np
n_partitions = 10
data = {
'x': np.random.randn(1000),
'y': np.random.randn(1000),
'genes': np.random.choice(['GeneA', 'GeneB', 'GeneC', 'GeneD'], size=1000),
'instance_id': np.arange(1000)
}
df = pd.DataFrame(data)
df['genes'] = df['genes'].astype('category')
ddf = dd.from_pandas(df, npartitions=n_partitions)
ddf.to_parquet('test.parquet')
test = dd.read_parquet("test.parquet")Now when printing:
print(ddf.map_partitions(len).compute())
print(ddf['x'].map_partitions(len).compute())Both return:
Out[13]:
0 1000
0 1000
0 1000
0 1000
0 1000
0 1000
0 1000
0 1000
0 1000
0 1000
dtype: int64printing the following:
print(test.map_partitions(len).compute())
print(test['x'].map_partitions(len).compute())The first statement gives the result as before with ddf. However, the second print gives:
Out[15]:
0 3000
0 3000
0 4000
dtype: int64
Environment:
- Dask version: dask from main
- Python version: tested with multiple python versions
- Operating System: windows but reproduced on mac
- Install method (conda, pip, source): pip development installation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
needs triageNeeds a response from a contributorNeeds a response from a contributor