Groupby median by ian-r-rose · Pull Request #9516 · dask/dask

ian-r-rose · 2022-09-23T20:11:29Z

Implements groupby/median, both as a single function and as a dictionary-based agg spec. We take a similar approach to groupby/apply, in that we shuffle based on the group keys before performing an embarassingly parallel groupby/agg.

This is similar to #9302, but with the important difference that the initial aggregation cannot actually do any aggregation, since we must have the full group together to perform an exact group-wise aggregation. On the other hand, we can't do a naive groupby/apply due to needing to specifically handle the observed keyword for grouping by categoricals. So instead the initial aggregation is replaced by an initial set-index, with special handling for observed to make sure the unobserved keys show up as expected. This is not the only possible approach, but does the job.

Closes Implement groupby/median #9489
Tests added / passed
Passes pre-commit run --all-files

some special attention in that we cannot aggregate before shuffling.

… not others (dask#9500)

jrbourbeau

Thanks @ian-r-rose. I took a quick look and this looks good from a high-level. It also looks like this won't impact any existing groupby code -- so I'm happy with the changes here if you are.

@rjzamora your thoughts are certainly welcome here if you have bandwidth to take a look (though no obligation though)

jrbourbeau · 2022-10-13T17:35:22Z

dask/dataframe/groupby.py

    return aggfunc(grouped, **kwargs)


+def _groupby_aggregate_spec(


Is this just for ergonomics, or does the existing _groupby_aggregate function not fully have the functionality we need?

This was mostly for ergonomics. A previous version of this just fed spec into _groupby_aggregate as an optional kwarg, but I didn't really like having that function have two different behaviors based on proliferating kwargs, so I split it out. Mostly a gut feeling, rather than any strong reasoning.

dask/dataframe/groupby.py

dask/dataframe/tests/test_groupby.py

Ian Rose and others added 21 commits September 19, 2022 10:44

WIP more shuffle-based groupby/agg, including for medians, which need

f01287f

some special attention in that we cannot aggregate before shuffling.

Don't select grouped-by columns

5eaa55b

Cleanup

c4b48cf

Handle series.

08444f0

Add shuffle-based agg to a bunch of functions

8b88b85

Consolidate logic

17b85d6

Better handle meta for median

8b60f8a

Pass observed through

7fe03df

Extract median stuff for another branch

a7af367

Increase coverage of agg functions to cover split_out>1/shuffle

3120638

Workaround for parquet writing failure using some datetime series but…

b662a85

… not others (dask#9500)

bump version to 2022.9.1

3e31d56

WIP median

fa2837e

experimenting with categoricals

da9eb3b

Make pass

9b5da87

WIP

4142c33

messy, but passing

4c4d382

Correct name

5cdc37b

Remove unused function

120eccf

Cleanup

1aa5e50

Add some documentation

674c55c

github-actions bot added dataframe documentation Improve or add to documentation io labels Sep 23, 2022

Ian Rose added 2 commits October 6, 2022 12:48

Implement median for agg spec

8ccd332

Merge branch 'main' into groupby-median

8eded87

github-actions bot removed io documentation Improve or add to documentation labels Oct 8, 2022

Ian Rose added 2 commits October 11, 2022 13:19

Cleanup

dd4c389

Add test

90c1f88

ian-r-rose marked this pull request as ready for review October 11, 2022 21:33

ian-r-rose added the feature Something is missing label Oct 11, 2022

jrbourbeau reviewed Oct 13, 2022

View reviewed changes

Allow shuffle method to be configured in .median()

d52a8f3

jrbourbeau approved these changes Oct 14, 2022

View reviewed changes

jrbourbeau merged commit dcdb90e into dask:main Oct 14, 2022

jrbourbeau mentioned this pull request Jan 11, 2023

Groupby Quantile #9824

Open

fjetter mentioned this pull request Jan 17, 2023

Dont require dask config shuffle default #9826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Groupby median#9516

Groupby median#9516
jrbourbeau merged 26 commits intodask:mainfrom
ian-r-rose:groupby-median

ian-r-rose commented Sep 23, 2022 •

edited

Loading

Uh oh!

jrbourbeau left a comment

Uh oh!

jrbourbeau Oct 13, 2022

Uh oh!

ian-r-rose Oct 13, 2022

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return aggfunc(grouped, **kwargs)


		def _groupby_aggregate_spec(

Uh oh!

Conversation

ian-r-rose commented Sep 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 13, 2022

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Oct 13, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ian-r-rose commented Sep 23, 2022 •

edited

Loading