Shuffle-based groupby aggregation by default

In https://github.com/dask/dask/pull/9302, @rjzamora added a new shuffle-based groupby algorithm which @ian-r-rose demonstrated has significant performance improvements in certain cases https://github.com/dask/dask/pull/9302#issuecomment-1216782830. Today this new shuffle-based groupby is gated behind an optional `shuffle=` keyword argument (off by default). We should probably turn it on by default in cases where it makes sense. 

Here's a brief summary of @ian-r-rose's findings (see https://github.com/dask/dask/pull/9302#issuecomment-1216782830 for more details)

> TL;DR
> 
> This largely confirms @rjzamora's points above and in conversations I've had with him:
> 
>     ACA-based groupby/aggs do fine for low-cardinality cases (say, where the number of groups is less than a few percent of the dataframe size).
>     Once the cardinality gets larger, it does indeed make sense to start using the shuffle-based algorithm. To me, the range where split_out > 1 and where it's still better to use ACA looks pretty darn narrow. A rule of thumb where shuffle is on-by-default when split_out>1 doesn't sound like the worst idea in the world to me.
>     Memory-usage is indeed higher for task-based shuffling than ACA in the lower-cardinality cases. But that advantage is wiped out in higher-cardinality cases, so it actually doesn't worry me much. And task-based shuffling is less likely to knock over a worker in the last few aggs of the ACA-based approach.
>     p2p shuffling has much better memory performance. I ran into a couple of problems when using it here that I'll open as separate issues, but the results are encouraging.

Opening this issue so we don't loose track of this topic 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shuffle-based groupby aggregation by default #9406

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Shuffle-based groupby aggregation by default #9406

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions