Skip to content

Optimize groupby when by contains ddf.index#8442

Open
jsignell wants to merge 8 commits intodask:mainfrom
jsignell:optimized-groupby
Open

Optimize groupby when by contains ddf.index#8442
jsignell wants to merge 8 commits intodask:mainfrom
jsignell:optimized-groupby

Conversation

@jsignell
Copy link
Member

@jsignell jsignell commented Dec 2, 2021

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I think this is pretty much it! We'll need to figure out how to apply this to other groupby methods too, I think the list of things calling aca directly are:

  • _cum_agg
  • var
  • cov
  • aggregate
  • SeriesGroupBy.nunique (could use a totally different implementation when divisions are known I think)

@jsignell
Copy link
Member Author

jsignell commented Dec 3, 2021

Just for my own reference, this is the examples I've been playing around with:

import dask

ddf = dask.datasets.timeseries(freq="1H")
ddf = ddf.set_index("name", divisions=("Alice", "Laura", "Ursula", "Zelda")).persist()
ddf.groupby("name").first()
# ntasks = 6

and comparing that with a naive approach:

import dask

ddf = dask.datasets.timeseries(freq="1H").persist()
ddf.groupby("name").first()
# ntasks = 65

@jsignell
Copy link
Member Author

Ok I think I've hit them all now. Let me know if you have ideas for more tests.

@jsignell
Copy link
Member Author

Actually. I think I missed aggregate. I'll get that one after lunch

@jsignell
Copy link
Member Author

Turns out aggregate is a little tricky. I'm still looking at it.

@jsignell jsignell marked this pull request as ready for review February 1, 2022 15:52
@jsignell
Copy link
Member Author

jsignell commented Feb 1, 2022

Ok this has now sat for a while and I am tempted to merge it and handle aggregate later. Thoughts @gjoseph92?

@gjoseph92
Copy link
Collaborator

I do really want to see this. I haven't looked at aggregate myself so I don't have a sense of how difficult it is. If we don't have time to get it right now, I agree it's probably worth merging as is—still adds a lot of value. Would be good to document in some way that aggregate won't be as efficient on the index for now?

@jsignell
Copy link
Member Author

jsignell commented Feb 1, 2022

Ok so either tests never really passed or something strange is going on. I'm going to keep investigating, but for now I am suspicious about how _groupby_raise_unaligned mutates the kwargs.

@gjoseph92
Copy link
Collaborator

@jsignell just checking in, have you looked into this any more? Anything I can help with?

@jsignell
Copy link
Member Author

@jsignell just checking in, have you looked into this any more? Anything I can help with?

It turned into a bit of a mess. I realized that groupby_raise_unaligned was mutating the kwargs which I think was an issue because I was running it more than once.

@jsignell
Copy link
Member Author

Good lord. Tried to sort this one out again yesterday and it's just getting all kinds of errors. I think I'll need to go through commit by commit and try to figure out where this went sideways.

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimized groupby aggregations when grouping by a sorted index

2 participants