-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
dataframeenhancementImprove existing functionality or make things work betterImprove existing functionality or make things work better
Description
In #9302, @rjzamora added a new shuffle-based groupby algorithm which @ian-r-rose demonstrated has significant performance improvements in certain cases #9302 (comment). Today this new shuffle-based groupby is gated behind an optional shuffle= keyword argument (off by default). We should probably turn it on by default in cases where it makes sense.
Here's a brief summary of @ian-r-rose's findings (see #9302 (comment) for more details)
TL;DR
This largely confirms @rjzamora's points above and in conversations I've had with him:
ACA-based groupby/aggs do fine for low-cardinality cases (say, where the number of groups is less than a few percent of the dataframe size). Once the cardinality gets larger, it does indeed make sense to start using the shuffle-based algorithm. To me, the range where split_out > 1 and where it's still better to use ACA looks pretty darn narrow. A rule of thumb where shuffle is on-by-default when split_out>1 doesn't sound like the worst idea in the world to me. Memory-usage is indeed higher for task-based shuffling than ACA in the lower-cardinality cases. But that advantage is wiped out in higher-cardinality cases, so it actually doesn't worry me much. And task-based shuffling is less likely to knock over a worker in the last few aggs of the ACA-based approach. p2p shuffling has much better memory performance. I ran into a couple of problems when using it here that I'll open as separate issues, but the results are encouraging.
Opening this issue so we don't loose track of this topic
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
dataframeenhancementImprove existing functionality or make things work betterImprove existing functionality or make things work better