Skip to content

Shuffle-based groupby for single functions#9504

Merged
jrbourbeau merged 15 commits intodask:mainfrom
ian-r-rose:more-shuffle-based-groupby
Oct 7, 2022
Merged

Shuffle-based groupby for single functions#9504
jrbourbeau merged 15 commits intodask:mainfrom
ian-r-rose:more-shuffle-based-groupby

Conversation

@ian-r-rose
Copy link
Collaborator

@ian-r-rose ian-r-rose commented Sep 19, 2022

This includes the new shuffle kwarg in more of the groupby aggregations, and enables shuffle-based aggregation in those places. So now df.groupby("a").sum() and df.groupby("a").agg("sum") both have access to the same aggregations.

New functions with shuffling:

  • sum
  • prod
  • min
  • max
  • idxmin
  • idxmax
  • count
  • mean
  • size
  • first
  • last
  • head
  • tail

While I'm here, I also added groupby().median(). This one is a bit unusual, in that I'm skipping the initial agg step before the shuffle (there is a dummy set_index step instead). (Leaving for a follow-up)

TODO

  • add some more tests.
  • Possibly add cov, var, nunique, which need special treatment

@ian-r-rose ian-r-rose marked this pull request as draft September 19, 2022 22:43
@ian-r-rose ian-r-rose marked this pull request as ready for review September 20, 2022 18:31
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose -- apologies for the delay. Taking a look at this now

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose. Overall these changes looks good -- just left a few small-ish comments / questions

cc @rjzamora for visibility

@ian-r-rose ian-r-rose force-pushed the more-shuffle-based-groupby branch from 7a71d2b to cf975fc Compare October 7, 2022 19:24
@ian-r-rose ian-r-rose force-pushed the more-shuffle-based-groupby branch from cf975fc to 57cd786 Compare October 7, 2022 19:30
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose -- let's merge after CI finishes

@jrbourbeau jrbourbeau merged commit 5ba240b into dask:main Oct 7, 2022
@ian-r-rose
Copy link
Collaborator Author

Thanks for the review @jrbourbeau !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement shuffle-based-groupby for simpler aggregations

2 participants