Skip to content

Add Groupby benchmark#979

Merged
rapids-bot[bot] merged 12 commits intorapidsai:branch-22.10from
rjzamora:groupby-benchmark
Oct 6, 2022
Merged

Add Groupby benchmark#979
rapids-bot[bot] merged 12 commits intorapidsai:branch-22.10from
rjzamora:groupby-benchmark

Conversation

@rjzamora
Copy link
Member

Add simple groupby benchmark using the same template as local_cudf_merge.py and local_cudf_shuffle.py.

@github-actions github-actions bot added the python python code needed label Aug 18, 2022
@rjzamora rjzamora added 2 - In Progress Currently a work in progress benchmark Results of a benchmark run feature request New feature or request non-breaking Non-breaking change and removed python python code needed labels Aug 18, 2022
@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@c0ae66c). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@              Coverage Diff               @@
##             branch-22.10    #979   +/-   ##
==============================================
  Coverage                ?   0.00%           
==============================================
  Files                   ?      16           
  Lines                   ?    2109           
  Branches                ?       0           
==============================================
  Hits                    ?       0           
  Misses                  ?    2109           
  Partials                ?       0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@rjzamora rjzamora marked this pull request as ready for review September 29, 2022 18:33
@rjzamora rjzamora requested a review from a team as a code owner September 29, 2022 18:33
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 29, 2022
"help": "Fraction of rows that are unique groups",
},
{
"name": "--sort",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I am seeing errors when I run the benchmark with --shuffle "explicit-comms" --split_out 2 --sort, and the cause seems to be that dask is using a shallow copy in DataFrame.drop, and this "_partitions"-column drop is somehow not working right after an explicit-comms shuffle.

I'm not exactly sure why the "_partitions" column is not getting dropped properly here (I am struggling to create a simple-enough reproducer to properly explore). However, I suspect that the custom round-tripping of of the input dataframe from a persisted list of futures is somehow running us into an undefined edge case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a (possibly different) error:

python -m dask_cuda.benchmarks.local_cudf_groupby --shuffle explicit-comms  --split_out 2 --sort -d 0,1,3,4   -p tcp
=>   File "/root/dask/dask/dataframe/utils.py", line 429, in check_matching_columns
    raise ValueError(
ValueError: The columns in the computed data do not match the columns in the provided metadata
Order of columns does not match

With the columns being:

Index(['key', 'int64___count', 'int64___max', 'float64___count',
       'float64___sum'],
      dtype='object')

Index(['key', 'int64___max', 'int64___count', 'float64___sum',
       'float64___count'],
      dtype='object')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This occurs also with --shuffle tasks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My crystal-ball says that somewhere the columns for aggregation are being constructed through a set, which doesn't provided run-to-run consistent ordering, but I can't find it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This occurs also with --shuffle tasks

Oh no! You see this same error for the command you shared above, but with --shuffle tasks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes (not always), more likely with more devices

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @wence-. It seems like we need to tweak the groupby-aggregation functions (defined in dask_cudf) to output intermediate DataFrame results with sorted column names. Otherwise the ordering is non-deterministic, and we open ourselves up to failed metadata-enforcement checks in down-stream operations :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: rapidsai/cudf#11836 should resolve this problem in dask_cudf.

@rjzamora rjzamora changed the title [WIP] Add Groupby benchmark Add Groupby benchmark Sep 29, 2022
raydouglass pushed a commit to rapidsai/cudf that referenced this pull request Sep 30, 2022
## Description
This PR fixes a subtle bug introduced in #11800.  While working on the corresponding dask-cuda benchmark for that PR rapidsai/dask-cuda#979, we discovered that non-deterministic column ordering in `_groupby_partition_agg` and `_tree_node_agg` can trigger metadata-enforcement errors in follow-up operations.  This PR simply sorts the output column ordering in those functions (so that the column ordering is always deterministic).

Note that this bug is difficult to reproduce in a pytest, because it rarely occurs with a smaller number of devices (I need to use a full dgx machine to consistently trigger the error).

## Checklist
- [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.

Authors:
   - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
   - GALI PREM SAGAR (https://github.com/galipremsagar)
   - Ashwin Srinath (https://github.com/shwina)
@rjzamora
Copy link
Member Author

rjzamora commented Oct 5, 2022

@madsbk @pentschev - Any interest in getting this benchmark in for 22.10? (Shouldn't be a blocker, but a reduction-based benchmark would be "nice to have")

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora LGTM, I only have a minor change

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
@quasiben
Copy link
Member

quasiben commented Oct 6, 2022

@madsbk can you accept the change then we can merge in

@quasiben
Copy link
Member

quasiben commented Oct 6, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 60ea5be into rapidsai:branch-22.10 Oct 6, 2022
@rjzamora rjzamora deleted the groupby-benchmark branch October 6, 2022 14:13
@rjzamora
Copy link
Member Author

rjzamora commented Oct 6, 2022

Thanks @madsbk and @quasiben !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team benchmark Results of a benchmark run feature request New feature or request non-breaking Non-breaking change python python code needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants