Implement {Series,DataFrame}GroupBy fillna methods#8869
Implement {Series,DataFrame}GroupBy fillna methods#8869jrbourbeau merged 27 commits intodask:mainfrom
fillna methods#8869Conversation
ian-r-rose
left a comment
There was a problem hiding this comment.
Thanks @pavithraes, this is looking great! The implementation is looking good to me, I think we could get some more coverage of the optional args, but otherwise I think this looks close.
I also notice that there are forwardfill and backfill aliases in pandas, we could think about adding those as well while we're at it.
601d509 to
eff9714
Compare
|
@ian-r-rose Thanks for the review!
Looks like I also noticed the Full tracebackddf.groupby("A").fillna(0, axis=1)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
545 try:
--> 546 return cls._AXIS_TO_AXIS_NUMBER[axis]
547 except KeyError:
KeyError: 1
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
1316 try:
-> 1317 path, res = self._choose_path(fast_path, slow_path, group)
1318 except TypeError:
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _choose_path(self, fast_path, slow_path, group)
1393 path = slow_path
-> 1394 res = slow_path(group)
1395
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in <lambda>(group)
1386 fast_path = lambda group: func(group, *args, **kwargs)
-> 1387 slow_path = lambda group: group.apply(
1388 lambda x: func(x, *args, **kwargs), axis=self.axis
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
8739 )
-> 8740 return op.apply()
8741
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
687
--> 688 return self.apply_standard()
689
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
811 def apply_standard(self):
--> 812 results, res_index = self.apply_series_generator()
813
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
827 # ignore SettingWithCopy here in case the user mutates
--> 828 results[i] = self.f(v)
829 if isinstance(results[i], ABCSeries):
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in <lambda>(x)
1387 slow_path = lambda group: group.apply(
-> 1388 lambda x: func(x, *args, **kwargs), axis=self.axis
1389 )
~/Developer/Dask/dask/dask/dataframe/groupby.py in _fillna_groups(self, groups, **kwargs)
1987 def _fillna_groups(self, groups, **kwargs):
-> 1988 return groups.fillna(**kwargs)
1989
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
310 )
--> 311 return func(*args, **kwargs)
312
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/series.py in fillna(self, value, method, axis, inplace, limit, downcast)
4815 ) -> Series | None:
-> 4816 return super().fillna(
4817 value=value,
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
6320 axis = 0
-> 6321 axis = self._get_axis_number(axis)
6322
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
547 except KeyError:
--> 548 raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
549
ValueError: No axis named 1 for object type Series
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-4-f9b1a98697a0> in <module>
----> 1 ddf.groupby("A").fillna(0, axis=1)
~/Developer/Dask/dask/dask/dataframe/groupby.py in fillna(self, value, method, limit, axis)
1993 "groupby-fillna with value=dict/Series/DataFrame is currently not supported"
1994 )
-> 1995 meta = self._meta_nonempty.transform(
1996 self._fillna_groups, value=value, method=method, limit=limit, axis=axis
1997 )
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
1355 @Appender(_transform_template)
1356 def transform(self, func, *args, engine=None, engine_kwargs=None, **kwargs):
-> 1357 return self._transform(
1358 func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
1359 )
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in _transform(self, func, engine, engine_kwargs, *args, **kwargs)
1444
1445 if not isinstance(func, str):
-> 1446 return self._transform_general(func, *args, **kwargs)
1447
1448 elif func not in base.transform_kernel_allowlist:
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
1320 except ValueError as err:
1321 msg = "transform must return a scalar value for each group"
-> 1322 raise ValueError(msg) from err
1323
1324 if isinstance(res, Series):
ValueError: transform must return a scalar value for each group |
ian-r-rose
left a comment
There was a problem hiding this comment.
Thanks @pavithraes, this is looking close.
I did some experimenting with grouping by multiple columns, and ran into some troubles with transform not liking empty groups very much, but apply did just fine. I wonder if we should just simplify things by using apply everywhere
Thanks for checking! I agree that we can use apply throughout. I'll also include a test for grouping by multiple columns. :) |
|
@pavithraes This change in upstream pandas might also affect what you are doing here. |
Thanks for the note! I just triggered an upstream build to check and it does seem to be affected. :/ |
|
@jrbourbeau Thanks for helping review this! To solve the upstream build issues, we need to add I think we can't add that right now because Dask's groupby-apply doesn't support the |
ian-r-rose
left a comment
There was a problem hiding this comment.
I rebased your PR on top of #8961 to see if it still works with the group_keys changes, and it does 🎉 !
The bad news is: with the group_keys changes I think there is one more thing we need to cover. I tried parameterizing your test_fillna over group_keys=[True, False, None]. Everything works on pandas 1.4.x, but on 1.5.x, group_keys=True presents a problem: in that case, the apply returns with extra multiindex columns for your grouped-by items.
I think the fix should be fairly straightforward, however. For the case where self.group_keys is True in the dd.GroupBy object, we can do a final map_partitions to df.droplevel(by). That is to say, we can easily drop the extra index columns at the end when group_keys is True. It's annoying, but doable
|
@ian-r-rose Thanks for the notes! I've pushed the |
|
I'm not sure why the macos-python-3.8 test is failing with: https://github.com/dask/dask/runs/6209266570?check_suite_focus=true#step:6:21560 :/ Maybe it's related to #8889 (comment)? Edit: Looks like it was a one-off. |
ian-r-rose
left a comment
There was a problem hiding this comment.
Sorry to be slow in reviewing @pavithraes, and thanks for your persistence. This looks great! I have one minor comment that I would consider optional. But from my perspective, this is ready to go into our next release!
Co-authored-by: Ian Rose <ian.r.rose@gmail.com>
|
@jrbourbeau I think this is ready to merge! Could you please help take a quick look? |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @pavithraes and @ian-r-rose! There's a merge conflict that's popped up (apologies for the delayed response). @pavithraes would you mind fixing that conflict? Otherwise, this looks good to go
|
@jrbourbeau Thanks for taking a look! I've fixed the conflict. :) |
|
Thank you @pavithraes! |
ffill#8708pre-commit run --all-filesTODO before merging: