Skip to content

Shuffle-based groupby aggregation by default #9406

@jrbourbeau

Description

@jrbourbeau

In #9302, @rjzamora added a new shuffle-based groupby algorithm which @ian-r-rose demonstrated has significant performance improvements in certain cases #9302 (comment). Today this new shuffle-based groupby is gated behind an optional shuffle= keyword argument (off by default). We should probably turn it on by default in cases where it makes sense.

Here's a brief summary of @ian-r-rose's findings (see #9302 (comment) for more details)

TL;DR

This largely confirms @rjzamora's points above and in conversations I've had with him:

ACA-based groupby/aggs do fine for low-cardinality cases (say, where the number of groups is less than a few percent of the dataframe size).
Once the cardinality gets larger, it does indeed make sense to start using the shuffle-based algorithm. To me, the range where split_out > 1 and where it's still better to use ACA looks pretty darn narrow. A rule of thumb where shuffle is on-by-default when split_out>1 doesn't sound like the worst idea in the world to me.
Memory-usage is indeed higher for task-based shuffling than ACA in the lower-cardinality cases. But that advantage is wiped out in higher-cardinality cases, so it actually doesn't worry me much. And task-based shuffling is less likely to knock over a worker in the last few aggs of the ACA-based approach.
p2p shuffling has much better memory performance. I ran into a couple of problems when using it here that I'll open as separate issues, but the results are encouraging.

Opening this issue so we don't loose track of this topic

Metadata

Metadata

Assignees

Labels

dataframeenhancementImprove existing functionality or make things work better

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions