Support shuffle-based groupby aggregations in dask_cudf#11800
Support shuffle-based groupby aggregations in dask_cudf#11800rapids-bot[bot] merged 22 commits intorapidsai:branch-22.10from
Conversation
wence-
left a comment
There was a problem hiding this comment.
Ugh, sorry, all of my suggestions have resulted in disaster. Not sure if the better option given we're close to release is go back to the old code and go round again, or to try and fix things up.
I'm inclined perhaps to revert and try again more systematically, WDYT?
I think the safest path forward is to stick with the ugly What we want for the impending release is to support |
|
One more thing we can do for 22.10 is to improve the explicit-comms shuffle logic a bit to avoid creating/removing a "_partitions" column when it already exists. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11800 +/- ##
===============================================
Coverage ? 87.52%
===============================================
Files ? 133
Lines ? 21801
Branches ? 0
===============================================
Hits ? 19081
Misses ? 2720
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
Update: All changes related to explicit-comms have been removed from this PR - The plan for 22.10 is to support |
Reverts #992, which had led to unexpected issues. See rapidsai/cudf#11800 (review) cc @wence- Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #1001
wence-
left a comment
There was a problem hiding this comment.
Thanks, and sorry for all the going around in circles
|
@gpucibot merge |
## Description This PR fixes a subtle bug introduced in #11800. While working on the corresponding dask-cuda benchmark for that PR rapidsai/dask-cuda#979, we discovered that non-deterministic column ordering in `_groupby_partition_agg` and `_tree_node_agg` can trigger metadata-enforcement errors in follow-up operations. This PR simply sorts the output column ordering in those functions (so that the column ordering is always deterministic). Note that this bug is difficult to reproduce in a pytest, because it rarely occurs with a smaller number of devices (I need to use a full dgx machine to consistently trigger the error). ## Checklist - [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [ ] New or existing tests cover these changes. - [ ] The documentation is up to date with these changes. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ashwin Srinath (https://github.com/shwina)
Description
This PR corresponds to the
dask_cudfversion of dask/dask#9302 (adding a shuffle-based algorithm for high-cardinality groupby aggregations). The benefits of this algorithm are most significant for cases wheresplit_out>1is necessary:NOTES:
shuffle="explicit-comms"is also supported (whendask_cudais installed)dask.dataframeanddask_cudf, the specialized_shuffle_aggregateis currently necessary.Checklist