Skip to content

[Experimental] Add cardinality argument to GroupBy.aggregate#9446

Closed
rjzamora wants to merge 2 commits intodask:mainfrom
rjzamora:cardinality-arg
Closed

[Experimental] Add cardinality argument to GroupBy.aggregate#9446
rjzamora wants to merge 2 commits intodask:mainfrom
rjzamora:cardinality-arg

Conversation

@rjzamora
Copy link
Copy Markdown
Member

Adds an optional cardinality argument to GroupBy.aggregate:

"""
cardinality : float or "infer", optional
    Approximate ratio of aggregated data size with respect to the
    initial data size. If specified, this ratio will be used to override
    the defaults for ``split_every``, ``split_out``, and ``shuffle``.
    If ``"infer"`` is specified, the first non-empty partition will be
    used to estimate the global cardinality ratio.
"""

This PR is related to the discussion in #9406 (on setting good defaults automatically). Note that the specific heuristics used to set split_every and split_out in this PR may need to be tweaked - More benchmarking is necessary.

  • Tests added / passed
  • Passes pre-commit run --all-files

@rjzamora rjzamora added dataframe enhancement Improve existing functionality or make things work better labels Aug 31, 2022
),
self.obj.npartitions,
)
split_every = split_every or min(max(int(1.0 / cardinality), 2), 32)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to allow split_every to be one (cf #9406 (comment)), so that there isn't any repartitioning to fewer partitions before shuffling. I found this makes a significant difference for the case when there are fewer, larger partitions.

Otherwise, this looks like a reasonable heuristic to me.

@ian-r-rose ian-r-rose mentioned this pull request Sep 1, 2022
3 tasks
@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented Sep 7, 2022

Closing this for now.

@rjzamora rjzamora closed this Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe enhancement Improve existing functionality or make things work better

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants