Skip to content

different number of partitions when indexing after reading dataframe #12193

@melonora

Description

@melonora

Describe the issue:
When having a dataframe with npartitions > 1, when saving it to parquet and reading it back in, the number of partitions changes when you index the dataframe on one of the columns.
Minimal Complete Verifiable Example:

import dask.dataframe as dd
import pandas as pd
import numpy as np

n_partitions = 10
data = {
    'x': np.random.randn(1000),
    'y': np.random.randn(1000),
    'genes': np.random.choice(['GeneA', 'GeneB', 'GeneC', 'GeneD'], size=1000),
    'instance_id': np.arange(1000)
}

df = pd.DataFrame(data)
df['genes'] = df['genes'].astype('category')
ddf = dd.from_pandas(df, npartitions=n_partitions)

ddf.to_parquet('test.parquet')

test = dd.read_parquet("test.parquet")

Now when printing:

print(ddf.map_partitions(len).compute())
print(ddf['x'].map_partitions(len).compute())

Both return:

Out[13]: 
0    1000
0    1000
0    1000
0    1000
0    1000
0    1000
0    1000
0    1000
0    1000
0    1000
dtype: int64

printing the following:

print(test.map_partitions(len).compute())
print(test['x'].map_partitions(len).compute())

The first statement gives the result as before with ddf. However, the second print gives:

Out[15]: 
0    3000
0    3000
0    4000
dtype: int64

Environment:

  • Dask version: dask from main
  • Python version: tested with multiple python versions
  • Operating System: windows but reproduced on mac
  • Install method (conda, pip, source): pip development installation

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs triageNeeds a response from a contributor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions