Skip to content

Non deterministic groupby result when using disk / partd shuffle method  #10034

@fjetter

Description

@fjetter

When running a groupby using the default disk backend I am receiving non-deterministic results

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame({
        'plain_int64': [1, 2, 3],
        'dup_strings': [0, 1, 0]
    })

ddf = dd.from_pandas(df, npartitions=2)
dd_series = ddf.groupby('dup_strings').transform('first')
dd_series.compute()

Most of the time, this example returns

/ plain_int64
1 2
0 1
2 1

but in about 2-3% I get a different result

/ plain_int64
1 2
2 3
0 3

pandas output is

/ plain_int64
0 1
1 2
2 1

When switching to the tasks backend using

import dask
dask.config.set({"dataframe.shuffle.method": "tasks"})

this goes away.

Note: whoever is using a distributed cluster will automatically use the tasks or p2p backend and will not be affected by this

Metadata

Metadata

Assignees

No one assigned

    Labels

    coredataframeneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions