[Experimental] Add cardinality argument to GroupBy.aggregate by rjzamora · Pull Request #9446 · dask/dask

rjzamora · 2022-08-31T15:32:16Z

Adds an optional cardinality argument to GroupBy.aggregate:

"""
cardinality : float or "infer", optional
    Approximate ratio of aggregated data size with respect to the
    initial data size. If specified, this ratio will be used to override
    the defaults for ``split_every``, ``split_out``, and ``shuffle``.
    If ``"infer"`` is specified, the first non-empty partition will be
    used to estimate the global cardinality ratio.
"""

This PR is related to the discussion in #9406 (on setting good defaults automatically). Note that the specific heuristics used to set split_every and split_out in this PR may need to be tweaked - More benchmarking is necessary.

Tests added / passed
Passes pre-commit run --all-files

ian-r-rose · 2022-09-01T23:40:33Z

dask/dataframe/groupby.py

+                ),
+                self.obj.npartitions,
+            )
+            split_every = split_every or min(max(int(1.0 / cardinality), 2), 32)


I think it's important to allow split_every to be one (cf #9406 (comment)), so that there isn't any repartitioning to fewer partitions before shuffling. I found this makes a significant difference for the case when there are fewer, larger partitions.

Otherwise, this looks like a reasonable heuristic to me.

rjzamora · 2022-09-07T13:15:09Z

Closing this for now.

add cardinality argument to aggregate

fc1594f

rjzamora added dataframe enhancement Improve existing functionality or make things work better labels Aug 31, 2022

Merge remote-tracking branch 'upstream/main' into cardinality-arg

8670b0b

gjoseph92 mentioned this pull request Sep 1, 2022

Automation of benchmark comparison coiled/benchmarks#292

Closed

ian-r-rose reviewed Sep 1, 2022

View reviewed changes

ian-r-rose mentioned this pull request Sep 1, 2022

Shuffle groupby default #9453

Merged

3 tasks

rjzamora closed this Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Experimental] Add cardinality argument to GroupBy.aggregate#9446

[Experimental] Add cardinality argument to GroupBy.aggregate#9446
rjzamora wants to merge 2 commits intodask:mainfrom
rjzamora:cardinality-arg

rjzamora commented Aug 31, 2022

Uh oh!

ian-r-rose Sep 1, 2022

Uh oh!

rjzamora commented Sep 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rjzamora commented Aug 31, 2022

Uh oh!

ian-r-rose Sep 1, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Sep 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants