Non deterministic groupby result when using disk / partd shuffle method 

When running a groupby using the default disk backend I am receiving non-deterministic results

```python
import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame({
        'plain_int64': [1, 2, 3],
        'dup_strings': [0, 1, 0]
    })

ddf = dd.from_pandas(df, npartitions=2)
dd_series = ddf.groupby('dup_strings').transform('first')
dd_series.compute()
```

Most of the time, this example returns

 /  | plain_int64               
-- | --
1 | 2
0 | 1
2 | 1

but in about 2-3% I get a different result
 /  | plain_int64               
-- | --
1 | 2
2 | 3
0 | 3


pandas output is

 /  | plain_int64               
-- | --
0 | 1
1 | 2
2 | 1


When switching to the `tasks` backend using

```python
import dask
dask.config.set({"dataframe.shuffle.method": "tasks"})
``` 

this goes away.

Note: whoever is using a distributed cluster will automatically use the `tasks` or `p2p` backend and will not be affected by this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non deterministic groupby result when using disk / partd shuffle method #10034

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Non deterministic groupby result when using disk / partd shuffle method #10034

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions