Optimize groupby when by contains ddf.index#8442
Conversation
There was a problem hiding this comment.
Nice, I think this is pretty much it! We'll need to figure out how to apply this to other groupby methods too, I think the list of things calling aca directly are:
-
_cum_agg -
var -
cov -
aggregate -
SeriesGroupBy.nunique(could use a totally different implementation when divisions are known I think)
|
Just for my own reference, this is the examples I've been playing around with: import dask
ddf = dask.datasets.timeseries(freq="1H")
ddf = ddf.set_index("name", divisions=("Alice", "Laura", "Ursula", "Zelda")).persist()
ddf.groupby("name").first()
# ntasks = 6and comparing that with a naive approach: import dask
ddf = dask.datasets.timeseries(freq="1H").persist()
ddf.groupby("name").first()
# ntasks = 65 |
|
Ok I think I've hit them all now. Let me know if you have ideas for more tests. |
|
Actually. I think I missed aggregate. I'll get that one after lunch |
|
Turns out aggregate is a little tricky. I'm still looking at it. |
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
d9bc63d to
27ff4ce
Compare
|
Ok this has now sat for a while and I am tempted to merge it and handle |
|
I do really want to see this. I haven't looked at |
|
Ok so either tests never really passed or something strange is going on. I'm going to keep investigating, but for now I am suspicious about how |
|
@jsignell just checking in, have you looked into this any more? Anything I can help with? |
It turned into a bit of a mess. I realized that groupby_raise_unaligned was mutating the kwargs which I think was an issue because I was running it more than once. |
|
Good lord. Tried to sort this one out again yesterday and it's just getting all kinds of errors. I think I'll need to go through commit by commit and try to figure out where this went sideways. |
pre-commit run --all-files