Add Groupby benchmark by rjzamora · Pull Request #979 · rapidsai/dask-cuda

rjzamora · 2022-08-18T22:17:08Z

Add simple groupby benchmark using the same template as local_cudf_merge.py and local_cudf_shuffle.py.

dask_cuda/benchmarks/utils.py

github-actions · 2022-09-18T19:02:52Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

…chmark

codecov-commenter · 2022-09-29T17:14:23Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@c0ae66c). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@              Coverage Diff               @@
##             branch-22.10    #979   +/-   ##
==============================================
  Coverage                ?   0.00%           
==============================================
  Files                   ?      16           
  Lines                   ?    2109           
  Branches                ?       0           
==============================================
  Hits                    ?       0           
  Misses                  ?    2109           
  Partials                ?       0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

rjzamora · 2022-09-29T19:03:13Z

dask_cuda/benchmarks/local_cudf_groupby.py

+            "help": "Fraction of rows that are unique groups",
+        },
+        {
+            "name": "--sort",


Note that I am seeing errors when I run the benchmark with --shuffle "explicit-comms" --split_out 2 --sort, and the cause seems to be that dask is using a shallow copy in DataFrame.drop, and this "_partitions"-column drop is somehow not working right after an explicit-comms shuffle.

I'm not exactly sure why the "_partitions" column is not getting dropped properly here (I am struggling to create a simple-enough reproducer to properly explore). However, I suspect that the custom round-tripping of of the input dataframe from a persisted list of futures is somehow running us into an undefined edge case.

I see a (possibly different) error:

python -m dask_cuda.benchmarks.local_cudf_groupby --shuffle explicit-comms --split_out 2 --sort -d 0,1,3,4 -p tcp => File "/root/dask/dask/dataframe/utils.py", line 429, in check_matching_columns raise ValueError( ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match

With the columns being:

Index(['key', 'int64___count', 'int64___max', 'float64___count', 'float64___sum'], dtype='object') Index(['key', 'int64___max', 'int64___count', 'float64___sum', 'float64___count'], dtype='object')

This occurs also with --shuffle tasks

My crystal-ball says that somewhere the columns for aggregation are being constructed through a set, which doesn't provided run-to-run consistent ordering, but I can't find it.

This occurs also with --shuffle tasks

Oh no! You see this same error for the command you shared above, but with --shuffle tasks?

yes (not always), more likely with more devices

Thanks for pointing this out @wence-. It seems like we need to tweak the groupby-aggregation functions (defined in dask_cudf) to output intermediate DataFrame results with sorted column names. Otherwise the ordering is non-deterministic, and we open ourselves up to failed metadata-enforcement checks in down-stream operations :/

Update: rapidsai/cudf#11836 should resolve this problem in dask_cudf.

## Description This PR fixes a subtle bug introduced in #11800. While working on the corresponding dask-cuda benchmark for that PR rapidsai/dask-cuda#979, we discovered that non-deterministic column ordering in `_groupby_partition_agg` and `_tree_node_agg` can trigger metadata-enforcement errors in follow-up operations. This PR simply sorts the output column ordering in those functions (so that the column ordering is always deterministic). Note that this bug is difficult to reproduce in a pytest, because it rarely occurs with a smaller number of devices (I need to use a full dgx machine to consistently trigger the error). ## Checklist - [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [ ] New or existing tests cover these changes. - [ ] The documentation is up to date with these changes. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ashwin Srinath (https://github.com/shwina)

rjzamora · 2022-10-05T19:12:40Z

@madsbk @pentschev - Any interest in getting this benchmark in for 22.10? (Shouldn't be a blocker, but a reduction-based benchmark would be "nice to have")

madsbk

Thanks @rjzamora LGTM, I only have a minor change

dask_cuda/benchmarks/local_cudf_groupby.py

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

quasiben · 2022-10-06T13:52:03Z

@madsbk can you accept the change then we can merge in

quasiben · 2022-10-06T14:12:08Z

@gpucibot merge

rjzamora · 2022-10-06T14:14:01Z

Thanks @madsbk and @quasiben !

rjzamora added 6 commits August 18, 2022 12:30

rough groupby benchmark

024c1fe

debug benchmark

6e2de42

remove unused partition-size param

bf23f03

support shuffle

ce5b581

fix unique_ratio bug

271b452

fix shuffle arg handling

519bd83

github-actions bot added the python python code needed label Aug 18, 2022

rjzamora added 2 - In Progress Currently a work in progress benchmark Results of a benchmark run feature request New feature or request non-breaking Non-breaking change and removed python python code needed labels Aug 18, 2022

rjzamora commented Aug 19, 2022

View reviewed changes

dask_cuda/benchmarks/utils.py Outdated Show resolved Hide resolved

update options

9b60faa

github-actions bot added the python python code needed label Aug 19, 2022

rjzamora mentioned this pull request Aug 31, 2022

Refactor dask_cudf groupby to use apply_concat_apply rapidsai/cudf#11571

Merged

3 tasks

github-actions bot added the inactive-30d label Sep 18, 2022

rjzamora added 4 commits September 22, 2022 08:51

save state

0338d3a

Merge remote-tracking branch 'upstream/branch-22.10' into groupby-ben…

4850d5f

…chmark

Merge remote-tracking branch 'upstream/branch-22.10' into groupby-ben…

fc35da3

…chmark

use config context for explicit-comms

8588c22

rjzamora removed the inactive-30d label Sep 29, 2022

rjzamora marked this pull request as ready for review September 29, 2022 18:33

rjzamora requested a review from a team as a code owner September 29, 2022 18:33

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 29, 2022

rjzamora commented Sep 29, 2022

View reviewed changes

rjzamora changed the title ~~[WIP] Add Groupby benchmark~~ Add Groupby benchmark Sep 29, 2022

rjzamora mentioned this pull request Sep 30, 2022

Fix bug in new shuffle-based groupby implementation rapidsai/cudf#11836

Merged

3 tasks

madsbk requested changes Oct 6, 2022

View reviewed changes

dask_cuda/benchmarks/local_cudf_groupby.py Outdated Show resolved Hide resolved

Update dask_cuda/benchmarks/local_cudf_groupby.py

e807d3a

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

madsbk approved these changes Oct 6, 2022

View reviewed changes

rapids-bot bot merged commit 60ea5be into rapidsai:branch-22.10 Oct 6, 2022

rjzamora deleted the groupby-benchmark branch October 6, 2022 14:13

Conversation

rjzamora commented Aug 18, 2022

Uh oh!

Uh oh!

github-actions bot commented Sep 18, 2022

Uh oh!

codecov-commenter commented Sep 29, 2022

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Oct 5, 2022

Uh oh!

madsbk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quasiben commented Oct 6, 2022

Uh oh!

quasiben commented Oct 6, 2022

Uh oh!

rjzamora commented Oct 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants