Conversation
some special attention in that we cannot aggregate before shuffling.
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @ian-r-rose. I took a quick look and this looks good from a high-level. It also looks like this won't impact any existing groupby code -- so I'm happy with the changes here if you are.
@rjzamora your thoughts are certainly welcome here if you have bandwidth to take a look (though no obligation though)
| return aggfunc(grouped, **kwargs) | ||
|
|
||
|
|
||
| def _groupby_aggregate_spec( |
There was a problem hiding this comment.
Is this just for ergonomics, or does the existing _groupby_aggregate function not fully have the functionality we need?
There was a problem hiding this comment.
This was mostly for ergonomics. A previous version of this just fed spec into _groupby_aggregate as an optional kwarg, but I didn't really like having that function have two different behaviors based on proliferating kwargs, so I split it out. Mostly a gut feeling, rather than any strong reasoning.
Implements groupby/median, both as a single function and as a dictionary-based agg spec. We take a similar approach to groupby/apply, in that we shuffle based on the group keys before performing an embarassingly parallel groupby/agg.
This is similar to #9302, but with the important difference that the initial aggregation cannot actually do any aggregation, since we must have the full group together to perform an exact group-wise aggregation. On the other hand, we can't do a naive groupby/apply due to needing to specifically handle the
observedkeyword for grouping by categoricals. So instead the initial aggregation is replaced by an initial set-index, with special handling forobservedto make sure the unobserved keys show up as expected. This is not the only possible approach, but does the job.pre-commit run --all-files