Shuffle groupby default by ian-r-rose · Pull Request #9453 · dask/dask

ian-r-rose · 2022-09-01T23:49:04Z

I expect this to conflict significantly with the proposed alternative in #9446, though this PR has much less ambition. I'm just trying to select some defaults that are likely to have better performance based on some of the investigations in #9406 (comment)

I feel pretty good about the choice of having shuffle on-by-default when split_out > 1. I'm less sure about what to do about split_every. When we have fewer, larger partitions, we probably want it to be close to one. When we have many smaller partitions, it probably makes sense for it to be eight or sixteen, as @rjzamora was finding (note that in this context split_every is more of a repartitioning granularity rather than a tree-reduction width). I am not sure if there really is a great solution short of the things that @rjzamora is trying in #9446, since we don't typically know partition sizes ahead of time. (Though we could in some instances, like with parquet!)

Closes Shuffle-based groupby aggregation by default #9406
Tests added / passed
Passes pre-commit run --all-files

split_every=1 for shuffle-based groupby

ian-r-rose · 2022-09-02T00:16:28Z

Okay, I take it back, split_every=1 as a default for shuffling is a bad idea. But I still think it's worthwhile to have it as an option, as I do think it is the right choice when there are few, large partitions per worker.

This is largely orthogonal to the choice of when to default to the shuffle-based groupby, which I still believe should just be when split_out > 1.

rjzamora

Thanks @ian-r-rose ! I don't think I agree with the split_every analysis, but I definitely agree that the shuffle-based algorithm should be default for split_out>1.

I expect this to conflict significantly with the proposed alternative in #9446

I wouldn't worry about that PR (I will most likely close it). I'm not crazy about adding a new key-word argument to aggregate anyway :)

rjzamora · 2022-09-02T13:23:20Z

dask/dataframe/groupby.py

+        if shuffle is None:
+            if split_out > 1:
+                shuffle = shuffle or config.get("shuffle", None) or "disk"
+            else:
+                shuffle = False


I'm not sure that we ever want to use shuffle ="disk" by default. It is quite slow (definitely slower than ACA).

I was also unsure about this -- I did this in the interests of being consistent with the other places where a default shuffle implementation is chosen (basically, default to disk, unless there is a Client active).

Do you think disk is also a bad idea there?

Yes (sorry for the delayed response). My personal opinion is that shuffle="disk" should never be default behavior. This PR is using shuffle="task"/"p2p" benchmark results to rationalize a change in default behavior. I don't expect shuffle="disk" to be very performant at all.

Since this is new functionality, I would be comfortable changing the default to "tasks", despite the inconsistency with other shuffling defaults.

I can also open a new issue for making "tasks" the default everywhere.

dask/dataframe/groupby.py

ian-r-rose · 2022-09-02T14:03:55Z

Thanks for taking a look @rjzamora! I'm now backing off of my initial proposal to make split_every default to 1, but I think I'd still advocate for it to be allowed to be one in this case.

I don't like that I haven't been able to think of a good heuristic for what it should be, but all I can think of involves comparing number of partitions, representative partition sizes, and number of workers.

rjzamora · 2022-09-02T14:16:38Z

Thanks for taking a look @rjzamora! I'm now backing off of my initial proposal to make split_every default to 1, but I think I'd still advocate for it to be allowed to be one in this case.

That seems fair to me. For both the shuffle and aca algorithms, split_every is meant to specify how many chunk-task outputs can be safely (and perhaps efficiently) concatenated together. In the aca algo, this is used to specify the "k" in the k-ary reduction tree. While in the shuffle algo, this tells us how many adjacent partitions should be coelesced after the initial blockwise chunk aggregation. These concepts are very similar, but the lower limits happen to slightly different.

Take-away: We should indeed support a distinct "lower-limit" on split_every for the shuffle-based algorithm.

jrbourbeau

Just checking in here, it looks like there's good agreement -- is this safe to merge?

rjzamora · 2022-09-06T13:49:01Z

it looks like there's good agreement -- is this safe to merge?

I think the only sticking point for me is that I'd like to avoid using shuffle="disk" by default on the threaded scheduler - I'd expect that to be a significant performance regression compared to shuffle=False. Therefore, we need to use shuffle="tasks" explicitly, or somehow make the shuffle default scheduler-dependent.

mrocklin · 2022-09-06T17:52:15Z

Disk is much better for the local scheduler. It has the same scalability and out of memory properties as the p2p shuffle

…

On Tue, Sep 6, 2022, 8:59 AM Ian Rose ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask/dataframe/groupby.py <#9453 (comment)>: > + if shuffle is None: + if split_out > 1: + shuffle = shuffle or config.get("shuffle", None) or "disk" + else: + shuffle = False Since this is new functionality, I would be comfortable changing the default to "tasks", despite the inconsistency with other shuffling defaults. I can also open a new issue for making "tasks" the default everywhere. — Reply to this email directly, view it on GitHub <#9453 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHNMPUJPNO452ERYFDV45S7JANCNFSM6AAAAAAQC2B7DI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

rjzamora · 2022-09-06T18:45:20Z

Disk is much better for the local scheduler. It has the same scalability
and out of memory properties as the p2p shuffle

This is probably true for larger-than-memory data, but I definitely see a significant performace regression for "large" data (5-6GB) that still fits comfortably in memory:

from dask.datasets import timeseries

ddf = timeseries(end='2002-01-01', id_lam=1e12)
%time ddf.groupby("id").agg({"x": "mean"}, split_out=4, shuffle=False).compute()
# Wall time: 12.9 s

%time ddf.groupby("id").agg({"x": "mean"}, split_out=4, shuffle="tasks").compute()
# Wall time: 13.2 s

%time ddf.groupby("id").agg({"x": "mean"}, split_out=4, shuffle="disk").compute()
# Wall time: 31 s

Generally speaking, the shuffle-vs-aca decision is a bit more complicated for the threaded scheduler.

ian-r-rose · 2022-09-13T15:35:46Z

From my end, this is ready. Is there anything else to be done from your perspective @rjzamora?

dask/dataframe/groupby.py

rjzamora

LGTM - Thanks @ian-r-rose !

Ian Rose added 2 commits September 1, 2022 15:46

Turn shuffle on by default for split_out > 1

03e82c1

Change split_every default to one in shuffle implementation

512ec17

ian-r-rose added the dataframe label Sep 1, 2022

ian-r-rose requested a review from rjzamora September 1, 2022 23:49

Revert split_every default change, but keep the ability to set

d99346f

split_every=1 for shuffle-based groupby

rjzamora reviewed Sep 2, 2022

View reviewed changes

jrbourbeau reviewed Sep 2, 2022

View reviewed changes

Make tasks default

461e3ff

jrbourbeau mentioned this pull request Sep 13, 2022

Release 2022.9.1 dask/community#275

Closed

7 tasks

rjzamora reviewed Sep 13, 2022

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

update docstring

8738bee

rjzamora approved these changes Sep 13, 2022

View reviewed changes

ian-r-rose merged commit 9a4b069 into dask:main Sep 13, 2022

wence- mentioned this pull request Sep 14, 2022

Can no longer provide split_out=None as a default in groupby.aggregate #9490

Closed

charlesbluca mentioned this pull request Sep 14, 2022

Fix upstream failures in test_groupby_split_out dask-contrib/dask-sql#763

Merged

ian-r-rose mentioned this pull request Sep 28, 2022

[Discussion] Cluster-aware graph logic #9522

Open

jorisvandenbossche mentioned this pull request Jan 6, 2023

COMPAT: specify shuffle=False in dissolve geopandas/dask-geopandas#229

Merged

Uh oh!

Conversation

ian-r-rose commented Sep 1, 2022

Uh oh!

ian-r-rose commented Sep 2, 2022

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora Sep 2, 2022

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Sep 2, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Sep 6, 2022

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Sep 6, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ian-r-rose commented Sep 2, 2022

Uh oh!

rjzamora commented Sep 2, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Sep 6, 2022

Uh oh!

mrocklin commented Sep 6, 2022 via email

Uh oh!

rjzamora commented Sep 6, 2022

Uh oh!

ian-r-rose commented Sep 13, 2022

Uh oh!

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants