-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Non deterministic groupby result when using disk / partd shuffle method #10034
Copy link
Copy link
Open
Labels
coredataframeneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Description
When running a groupby using the default disk backend I am receiving non-deterministic results
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({
'plain_int64': [1, 2, 3],
'dup_strings': [0, 1, 0]
})
ddf = dd.from_pandas(df, npartitions=2)
dd_series = ddf.groupby('dup_strings').transform('first')
dd_series.compute()Most of the time, this example returns
| / | plain_int64 |
|---|---|
| 1 | 2 |
| 0 | 1 |
| 2 | 1 |
but in about 2-3% I get a different result
| / | plain_int64 |
|---|---|
| 1 | 2 |
| 2 | 3 |
| 0 | 3 |
pandas output is
| / | plain_int64 |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 1 |
When switching to the tasks backend using
import dask
dask.config.set({"dataframe.shuffle.method": "tasks"})this goes away.
Note: whoever is using a distributed cluster will automatically use the tasks or p2p backend and will not be affected by this
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
coredataframeneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.It's been a while since this was pushed on. Needs attention from the owner or a maintainer.