Optimize groupby when `by` contains `ddf.index` by jsignell · Pull Request #8442 · dask/dask

jsignell · 2021-12-02T16:26:00Z

Closes Optimized groupby aggregations when grouping by a sorted index #8361
Tests added / passed
Passes pre-commit run --all-files

gjoseph92

Nice, I think this is pretty much it! We'll need to figure out how to apply this to other groupby methods too, I think the list of things calling aca directly are:

dask/dataframe/groupby.py

jsignell · 2021-12-03T19:30:34Z

Just for my own reference, this is the examples I've been playing around with:

import dask

ddf = dask.datasets.timeseries(freq="1H")
ddf = ddf.set_index("name", divisions=("Alice", "Laura", "Ursula", "Zelda")).persist()
ddf.groupby("name").first()
# ntasks = 6

and comparing that with a naive approach:

import dask

ddf = dask.datasets.timeseries(freq="1H").persist()
ddf.groupby("name").first()
# ntasks = 65

jsignell · 2021-12-13T17:23:33Z

Ok I think I've hit them all now. Let me know if you have ideas for more tests.

jsignell · 2021-12-13T17:24:18Z

Actually. I think I missed aggregate. I'll get that one after lunch

jsignell · 2021-12-13T21:01:58Z

Turns out aggregate is a little tricky. I'm still looking at it.

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

jsignell · 2022-02-01T15:53:04Z

Ok this has now sat for a while and I am tempted to merge it and handle aggregate later. Thoughts @gjoseph92?

gjoseph92 · 2022-02-01T16:06:57Z

I do really want to see this. I haven't looked at aggregate myself so I don't have a sense of how difficult it is. If we don't have time to get it right now, I agree it's probably worth merging as is—still adds a lot of value. Would be good to document in some way that aggregate won't be as efficient on the index for now?

jsignell · 2022-02-01T22:21:08Z

Ok so either tests never really passed or something strange is going on. I'm going to keep investigating, but for now I am suspicious about how _groupby_raise_unaligned mutates the kwargs.

gjoseph92 · 2022-02-24T20:05:34Z

@jsignell just checking in, have you looked into this any more? Anything I can help with?

jsignell · 2022-02-24T20:23:28Z

@jsignell just checking in, have you looked into this any more? Anything I can help with?

It turned into a bit of a mess. I realized that groupby_raise_unaligned was mutating the kwargs which I think was an issue because I was running it more than once.

jsignell · 2022-04-19T12:52:50Z

Good lord. Tried to sort this one out again yesterday and it's just getting all kinds of errors. I think I'll need to go through commit by commit and try to figure out where this went sideways.

github-actions bot added the dataframe label Dec 2, 2021

gjoseph92 reviewed Dec 2, 2021

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

gjoseph92 reviewed Dec 3, 2021

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

jsignell and others added 8 commits February 1, 2022 10:39

First attempt at using map_partitions instead of aca

430b58b

Only do the special case if divisions are unique

554f0f8

Add warning for case where divisions are duplicated

a10dbf5

Update dask/dataframe/groupby.py

541fec3

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Update dask/dataframe/groupby.py

ca37d16

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Fix merge and don't include data in warning

3e87dfc

All groupby transformations and aggregations now have fastpath

73d2e29

Fix

27ff4ce

jsignell force-pushed the optimized-groupby branch from d9bc63d to 27ff4ce Compare February 1, 2022 15:50

jsignell marked this pull request as ready for review February 1, 2022 15:52

jsignell mentioned this pull request Feb 17, 2022

Add Groupby.rank for DataFrame and Series GroupBy #8659

Open

3 tasks

ian-r-rose mentioned this pull request Apr 21, 2022

DataFrame groupby+apply/transform/shift on index results in unwanted shuffle #8959

Open

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Aug 26, 2024

Uh oh!

Conversation

jsignell commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjoseph92 left a comment • edited by jsignell Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsignell commented Dec 3, 2021

Uh oh!

jsignell commented Dec 13, 2021

Uh oh!

jsignell commented Dec 13, 2021

Uh oh!

jsignell commented Dec 13, 2021

Uh oh!

jsignell commented Feb 1, 2022

Uh oh!

gjoseph92 commented Feb 1, 2022

Uh oh!

jsignell commented Feb 1, 2022

Uh oh!

gjoseph92 commented Feb 24, 2022

Uh oh!

jsignell commented Feb 24, 2022

Uh oh!

jsignell commented Apr 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsignell commented Dec 2, 2021 •

edited

Loading

gjoseph92 left a comment •

edited by jsignell

Loading