Skip to content

Groupby median#9516

Merged
jrbourbeau merged 26 commits intodask:mainfrom
ian-r-rose:groupby-median
Oct 14, 2022
Merged

Groupby median#9516
jrbourbeau merged 26 commits intodask:mainfrom
ian-r-rose:groupby-median

Conversation

@ian-r-rose
Copy link
Collaborator

@ian-r-rose ian-r-rose commented Sep 23, 2022

Implements groupby/median, both as a single function and as a dictionary-based agg spec. We take a similar approach to groupby/apply, in that we shuffle based on the group keys before performing an embarassingly parallel groupby/agg.

This is similar to #9302, but with the important difference that the initial aggregation cannot actually do any aggregation, since we must have the full group together to perform an exact group-wise aggregation. On the other hand, we can't do a naive groupby/apply due to needing to specifically handle the observed keyword for grouping by categoricals. So instead the initial aggregation is replaced by an initial set-index, with special handling for observed to make sure the unobserved keys show up as expected. This is not the only possible approach, but does the job.

@github-actions github-actions bot added dataframe documentation Improve or add to documentation io labels Sep 23, 2022
@github-actions github-actions bot removed io documentation Improve or add to documentation labels Oct 8, 2022
@ian-r-rose ian-r-rose marked this pull request as ready for review October 11, 2022 21:33
@ian-r-rose ian-r-rose added the feature Something is missing label Oct 11, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose. I took a quick look and this looks good from a high-level. It also looks like this won't impact any existing groupby code -- so I'm happy with the changes here if you are.

@rjzamora your thoughts are certainly welcome here if you have bandwidth to take a look (though no obligation though)

return aggfunc(grouped, **kwargs)


def _groupby_aggregate_spec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for ergonomics, or does the existing _groupby_aggregate function not fully have the functionality we need?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mostly for ergonomics. A previous version of this just fed spec into _groupby_aggregate as an optional kwarg, but I didn't really like having that function have two different behaviors based on proliferating kwargs, so I split it out. Mostly a gut feeling, rather than any strong reasoning.

@jrbourbeau jrbourbeau merged commit dcdb90e into dask:main Oct 14, 2022
@jrbourbeau jrbourbeau mentioned this pull request Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe feature Something is missing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement groupby/median

2 participants