Implement {Series,DataFrame}GroupBy `fillna` methods by pavithraes · Pull Request #8869 · dask/dask

pavithraes · 2022-03-31T15:32:53Z

Closes Missing Built-in methods for SeriesGroupBy ffill #8708
Tests added / passed
Passes pre-commit run --all-files

TODO before merging:

Add docstring
Open new issue for implementing value = dict/Series/DataFrame (done: Dask Dataframe groupby-fillna for other value types #8922)

ian-r-rose

Thanks @pavithraes, this is looking great! The implementation is looking good to me, I think we could get some more coverage of the optional args, but otherwise I think this looks close.

I also notice that there are forwardfill and backfill aliases in pandas, we could think about adding those as well while we're at it.

dask/dataframe/groupby.py

dask/dataframe/tests/test_groupby.py

pavithraes · 2022-04-05T19:36:07Z

@ian-r-rose Thanks for the review!

I also notice that there are forwardfill and backfill aliases in pandas, we could think about adding those as well while we're at it.

Looks like pad and backfill has been deprecated in pandas, so I think we don't need to add them?

I also noticed the transform implementation doesn't work for axis=1, it throws: ValueError: transform must return a scalar value for each group. I'm assuming this is because transform applies the function on a column-by-column basis?

Full traceback

ddf.groupby("A").fillna(0, axis=1)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    545         try:
--> 546             return cls._AXIS_TO_AXIS_NUMBER[axis]
    547         except KeyError:

KeyError: 1

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
   1316             try:
-> 1317                 path, res = self._choose_path(fast_path, slow_path, group)
   1318             except TypeError:

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _choose_path(self, fast_path, slow_path, group)
   1393         path = slow_path
-> 1394         res = slow_path(group)
   1395 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in <lambda>(group)
   1386             fast_path = lambda group: func(group, *args, **kwargs)
-> 1387             slow_path = lambda group: group.apply(
   1388                 lambda x: func(x, *args, **kwargs), axis=self.axis

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
   8739         )
-> 8740         return op.apply()
   8741 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
    687 
--> 688         return self.apply_standard()
    689 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
    811     def apply_standard(self):
--> 812         results, res_index = self.apply_series_generator()
    813 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
    827                 # ignore SettingWithCopy here in case the user mutates
--> 828                 results[i] = self.f(v)
    829                 if isinstance(results[i], ABCSeries):

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in <lambda>(x)
   1387             slow_path = lambda group: group.apply(
-> 1388                 lambda x: func(x, *args, **kwargs), axis=self.axis
   1389             )

~/Developer/Dask/dask/dask/dataframe/groupby.py in _fillna_groups(self, groups, **kwargs)
   1987     def _fillna_groups(self, groups, **kwargs):
-> 1988         return groups.fillna(**kwargs)
   1989 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    310                 )
--> 311             return func(*args, **kwargs)
    312 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/series.py in fillna(self, value, method, axis, inplace, limit, downcast)
   4815     ) -> Series | None:
-> 4816         return super().fillna(
   4817             value=value,

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6320             axis = 0
-> 6321         axis = self._get_axis_number(axis)
   6322 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    547         except KeyError:
--> 548             raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
    549 

ValueError: No axis named 1 for object type Series

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-4-f9b1a98697a0> in <module>
----> 1 ddf.groupby("A").fillna(0, axis=1)

~/Developer/Dask/dask/dask/dataframe/groupby.py in fillna(self, value, method, limit, axis)
   1993                 "groupby-fillna with value=dict/Series/DataFrame is currently not supported"
   1994             )
-> 1995         meta = self._meta_nonempty.transform(
   1996             self._fillna_groups, value=value, method=method, limit=limit, axis=axis
   1997         )

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
   1355     @Appender(_transform_template)
   1356     def transform(self, func, *args, engine=None, engine_kwargs=None, **kwargs):
-> 1357         return self._transform(
   1358             func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
   1359         )

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in _transform(self, func, engine, engine_kwargs, *args, **kwargs)
   1444 
   1445         if not isinstance(func, str):
-> 1446             return self._transform_general(func, *args, **kwargs)
   1447 
   1448         elif func not in base.transform_kernel_allowlist:

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
   1320             except ValueError as err:
   1321                 msg = "transform must return a scalar value for each group"
-> 1322                 raise ValueError(msg) from err
   1323 
   1324             if isinstance(res, Series):

ValueError: transform must return a scalar value for each group

ian-r-rose · 2022-04-06T15:46:38Z

Looks like pad and backfill has been deprecated in pandas, so I think we don't need to add them?

Sounds good to me, good catch.

I also noticed the transform implementation doesn't work for axis=1

Interesting! Let's take a look in person.

dask/dataframe/groupby.py

dask/dataframe/tests/test_groupby.py

dask/dataframe/groupby.py

ian-r-rose

Thanks @pavithraes, this is looking close.

I did some experimenting with grouping by multiple columns, and ran into some troubles with transform not liking empty groups very much, but apply did just fine. I wonder if we should just simplify things by using apply everywhere

dask/dataframe/groupby.py

pavithraes · 2022-04-08T13:04:24Z

I did some experimenting with grouping by multiple columns, and ran into some troubles with transform not liking empty groups very much, but apply did just fine. I wonder if we should just simplify things by using apply everywhere

Thanks for checking! I agree that we can use apply throughout. I'll also include a test for grouping by multiple columns. :)

ian-r-rose · 2022-04-08T20:16:45Z

@pavithraes This change in upstream pandas might also affect what you are doing here.

pavithraes · 2022-04-11T19:17:50Z

This change in upstream pandas might also affect what you are doing here.

Thanks for the note! I just triggered an upstream build to check and it does seem to be affected. :/

pavithraes · 2022-04-12T18:09:29Z

@jrbourbeau Thanks for helping review this!

To solve the upstream build issues, we need to add group_keys=False to the apply calls in this PR (it's False because fillna disregards the columns we grouped-by in the output.)

I think we can't add that right now because Dask's groupby-apply doesn't support the group_keys parameter yet, so we may need to resolve this with the other failing pandas-upstream-tests in #8875?

ian-r-rose

I rebased your PR on top of #8961 to see if it still works with the group_keys changes, and it does 🎉 !

The bad news is: with the group_keys changes I think there is one more thing we need to cover. I tried parameterizing your test_fillna over group_keys=[True, False, None]. Everything works on pandas 1.4.x, but on 1.5.x, group_keys=True presents a problem: in that case, the apply returns with extra multiindex columns for your grouped-by items.

I think the fix should be fairly straightforward, however. For the case where self.group_keys is True in the dd.GroupBy object, we can do a final map_partitions to df.droplevel(by). That is to say, we can easily drop the extra index columns at the end when group_keys is True. It's annoying, but doable

…-groupby-fillna

pavithraes · 2022-04-26T12:26:08Z

@ian-r-rose Thanks for the notes! I've pushed the map_partitions fix that you suggested, and it seems to work locally with pandas 1.4.2 and 1.5.0!

pavithraes · 2022-04-28T11:42:56Z

I'm not sure why the macos-python-3.8 test is failing with: https://github.com/dask/dask/runs/6209266570?check_suite_focus=true#step:6:21560 :/

Maybe it's related to #8889 (comment)?

Edit: Looks like it was a one-off.

ian-r-rose

Sorry to be slow in reviewing @pavithraes, and thanks for your persistence. This looks great! I have one minor comment that I would consider optional. But from my perspective, this is ready to go into our next release!

dask/dataframe/groupby.py

Co-authored-by: Ian Rose <ian.r.rose@gmail.com>

ian-r-rose

Thanks @pavithraes!

pavithraes · 2022-05-02T19:23:57Z

@jrbourbeau I think this is ready to merge! Could you please help take a quick look?

jrbourbeau

Thanks @pavithraes and @ian-r-rose! There's a merge conflict that's popped up (apologies for the delayed response). @pavithraes would you mind fixing that conflict? Otherwise, this looks good to go

pavithraes · 2022-05-10T17:57:45Z

@jrbourbeau Thanks for taking a look! I've fixed the conflict. :)

ian-r-rose · 2022-05-10T21:32:52Z

Thank you @pavithraes!

Co-authored-by: Ian Rose <ian.r.rose@gmail.com>

pavithraes added 2 commits March 31, 2022 20:57

use transform to implement groupby-fillna

0d66c9f

formatting

242b40c

pavithraes marked this pull request as draft March 31, 2022 15:33

github-actions bot added the dataframe label Mar 31, 2022

ian-r-rose reviewed Apr 1, 2022

View reviewed changes

dask/dataframe/groupby.py Show resolved Hide resolved

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved

pavithraes added 4 commits April 5, 2022 22:22

remove axis keyword from ffill() and bfill()

cdc3e0f

add top-level _fillna_groups()

46600e3

add NotImplementedError for non-scalar values

7b690dc

update tests

eff9714

pavithraes force-pushed the groupby-fillna branch from 601d509 to eff9714 Compare April 5, 2022 18:08

linting fixes

d6fc0f5

ian-r-rose reviewed Apr 6, 2022

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

ian-r-rose reviewed Apr 6, 2022

View reviewed changes

dask/dataframe/tests/test_groupby.py Show resolved Hide resolved

pavithraes added 4 commits April 6, 2022 22:31

move _fillna_groups() outside _GroupBy

19692d6

linting fixes

eb2738c

remove axis from ffill() and bfill()

e0df9ef

implement axis=1 using apply()

24f8120

pavithraes commented Apr 7, 2022

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

pavithraes marked this pull request as ready for review April 7, 2022 14:13

ian-r-rose self-requested a review April 7, 2022 16:04

pavithraes commented Apr 7, 2022

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

ian-r-rose reviewed Apr 7, 2022

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

pavithraes added 2 commits April 8, 2022 16:53

remove transform() and use 'by' to drop cols

2314be6

add tests for multi-column groupby

16cd842

add api docs

6818434

github-actions bot added the documentation Improve or add to documentation label Apr 8, 2022

remove limit from ffill and bfill tests

cce69fe

pavithraes added 2 commits April 11, 2022 19:50

add comment to _fillna_group()

9975dec

Test upstream CI build [test-upstream]

0de215f

pavithraes mentioned this pull request Apr 13, 2022

Dask Dataframe groupby-fillna for other value types #8922

Open

pavithraes requested a review from ian-r-rose April 19, 2022 16:04

ian-r-rose reviewed Apr 21, 2022

View reviewed changes

pavithraes added 7 commits April 25, 2022 23:19

updates for handling

578c96e

updates for handling group_keys

e8121fe

Merge branch 'groupby-fillna' of github.com:pavithraes/dask into 8708…

948fcec

…-groupby-fillna

linting fixes

779c90e

test with upstream pandas [test-upstream]

0205e82

add pandas 1.5.0 check

f1aabda

trigger upstream tests [test-upstream]

d31c6f0

pavithraes requested a review from ian-r-rose April 26, 2022 12:27

Merge branch 'main' into 8708-groupby-fillna

4de46cb

ian-r-rose approved these changes Apr 28, 2022

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

Use MethodCache instead of lambda function

bd465ed

Co-authored-by: Ian Rose <ian.r.rose@gmail.com>

ian-r-rose approved these changes Apr 29, 2022

View reviewed changes

jrbourbeau approved these changes May 10, 2022

View reviewed changes

Merge branch 'main' into 8708-groupby-fillna

c3088d4

jrbourbeau merged commit 5fbda77 into dask:main May 10, 2022

pavithraes deleted the groupby-fillna branch May 11, 2022 13:01

erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022

Implement {Series,DataFrame}GroupBy fillna methods (dask#8869)

98004fd

Co-authored-by: Ian Rose <ian.r.rose@gmail.com>

Uh oh!

Conversation

pavithraes commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pavithraes commented Apr 5, 2022

Uh oh!

ian-r-rose commented Apr 6, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pavithraes commented Apr 8, 2022

Uh oh!

ian-r-rose commented Apr 8, 2022

Uh oh!

pavithraes commented Apr 11, 2022

Uh oh!

pavithraes commented Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

pavithraes commented Apr 26, 2022

Uh oh!

pavithraes commented Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

pavithraes commented May 2, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

pavithraes commented May 10, 2022

Uh oh!

ian-r-rose commented May 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pavithraes commented Mar 31, 2022 •

edited

Loading

pavithraes commented Apr 12, 2022 •

edited

Loading

pavithraes commented Apr 28, 2022 •

edited

Loading