Add Groupby benchmark#979
Conversation
|
This PR has been labeled |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #979 +/- ##
==============================================
Coverage ? 0.00%
==============================================
Files ? 16
Lines ? 2109
Branches ? 0
==============================================
Hits ? 0
Misses ? 2109
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
| "help": "Fraction of rows that are unique groups", | ||
| }, | ||
| { | ||
| "name": "--sort", |
There was a problem hiding this comment.
Note that I am seeing errors when I run the benchmark with --shuffle "explicit-comms" --split_out 2 --sort, and the cause seems to be that dask is using a shallow copy in DataFrame.drop, and this "_partitions"-column drop is somehow not working right after an explicit-comms shuffle.
I'm not exactly sure why the "_partitions" column is not getting dropped properly here (I am struggling to create a simple-enough reproducer to properly explore). However, I suspect that the custom round-tripping of of the input dataframe from a persisted list of futures is somehow running us into an undefined edge case.
There was a problem hiding this comment.
I see a (possibly different) error:
python -m dask_cuda.benchmarks.local_cudf_groupby --shuffle explicit-comms --split_out 2 --sort -d 0,1,3,4 -p tcp
=> File "/root/dask/dask/dataframe/utils.py", line 429, in check_matching_columns
raise ValueError(
ValueError: The columns in the computed data do not match the columns in the provided metadata
Order of columns does not match
With the columns being:
Index(['key', 'int64___count', 'int64___max', 'float64___count',
'float64___sum'],
dtype='object')
Index(['key', 'int64___max', 'int64___count', 'float64___sum',
'float64___count'],
dtype='object')
There was a problem hiding this comment.
This occurs also with --shuffle tasks
There was a problem hiding this comment.
My crystal-ball says that somewhere the columns for aggregation are being constructed through a set, which doesn't provided run-to-run consistent ordering, but I can't find it.
There was a problem hiding this comment.
This occurs also with --shuffle tasks
Oh no! You see this same error for the command you shared above, but with --shuffle tasks?
There was a problem hiding this comment.
yes (not always), more likely with more devices
There was a problem hiding this comment.
Thanks for pointing this out @wence-. It seems like we need to tweak the groupby-aggregation functions (defined in dask_cudf) to output intermediate DataFrame results with sorted column names. Otherwise the ordering is non-deterministic, and we open ourselves up to failed metadata-enforcement checks in down-stream operations :/
There was a problem hiding this comment.
Update: rapidsai/cudf#11836 should resolve this problem in dask_cudf.
## Description This PR fixes a subtle bug introduced in #11800. While working on the corresponding dask-cuda benchmark for that PR rapidsai/dask-cuda#979, we discovered that non-deterministic column ordering in `_groupby_partition_agg` and `_tree_node_agg` can trigger metadata-enforcement errors in follow-up operations. This PR simply sorts the output column ordering in those functions (so that the column ordering is always deterministic). Note that this bug is difficult to reproduce in a pytest, because it rarely occurs with a smaller number of devices (I need to use a full dgx machine to consistently trigger the error). ## Checklist - [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [ ] New or existing tests cover these changes. - [ ] The documentation is up to date with these changes. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ashwin Srinath (https://github.com/shwina)
|
@madsbk @pentschev - Any interest in getting this benchmark in for 22.10? (Shouldn't be a blocker, but a reduction-based benchmark would be "nice to have") |
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
|
@madsbk can you accept the change then we can merge in |
|
@gpucibot merge |
Add simple groupby benchmark using the same template as
local_cudf_merge.pyandlocal_cudf_shuffle.py.