API: User-control of result keys in GroupBy.apply#34998
API: User-control of result keys in GroupBy.apply#34998jreback merged 110 commits intopandas-dev:mainfrom
Conversation
|
Is it possible to leverage the existing as_index keyword instead of introducing a new one? I've never been quite clear on what the former should do, so maybe it should cover at least part of this |
jreback
left a comment
There was a problem hiding this comment.
this is exactly what group_keys keys does no? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby
|
adding more keywords is a huge -1 here. Ok if you deprecate and remove one or more, but adding more to an already really complex operation is sure to cause way more bugs. |
|
The documentation of In [10]: def f(x):
...: return x.copy() # same index
In [11]: def g(x):
...: return x.copy().rename(lambda x: x + 1) # different index
In [12]: df = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})
In [4]: df.groupby("A", group_keys=True).apply(f)
Out[4]:
A B
0 a 1
1 b 2
In [5]: df.groupby("A", group_keys=True).apply(g)
Out[5]:
A B
A
a 1 a 1
b 2 b 2
In [6]: df.groupby("A", group_keys=False).apply(f)
Out[6]:
A B
0 a 1
1 b 2
In [7]: df.groupby("A", group_keys=False).apply(g)
Out[7]:
A B
1 a 1
2 b 2 |
|
In [8]: df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
...: df.groupby("key", group_keys=True).apply(lambda x: x[:2])
...:
...:
...:
Out[8]:
key value
key
1 0 1 0
1 1 1
2 3 2 3
4 2 4
3 6 3 6
7 3 7
In [9]: df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
...: df.groupby("key", group_keys=False).apply(lambda x: x[:2])
...:
...:
...:
...:
Out[9]:
key value
0 1 0
1 1 1
3 2 3
4 2 4
6 3 6
7 3 7So it seems the intent is similar to what we want. However I don't know about the defaults. |
|
might be ok to add result_type keyword similar to what we use in apply but must depreciate group_keys then |
|
Well, 50 failures in |
|
It does seem like the intent of
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})
In [3]: gb = df.groupby("A", group_keys=False)
In [4]: gb.group_keys
Out[4]: False
In [5]: gb[['B']].group_keys
Out[5]: True
In [6]: gb['B'].group_keys
Out[6]: TrueI'll push a commit later with some updates to always honor group keys, but my initial reaction is that this is too large of a change. I'm sure there are people relying on the buggy behavior, so we'll need to do this with a deprecation. |
|
OK, tests should hopefully pass, but this isn't mergeable yet because of the behavior change. Namely honoring # 1.0.4
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})
In [3]: df.groupby("A").apply(lambda x: x)
Out[3]:
A B
0 0 1
1 0 2
2 1 3# PR
In [3]: df.groupby("A").apply(lambda x: x)
Out[3]:
A B
A
0 0 0 1
1 0 2
1 2 1 3My next commit will have compatibility layer. We'll change the default Note that this doesn't help people who were already doing |
|
Here's the warning In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})
In [3]: df.groupby("A").apply(lambda x: x)
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use
>>> .groupby(..., group_keys=False)
To adopt the future behavior and silence this warning, use
>>> .groupby(..., group_keys=True)
#!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[3]:
A B
0 0 1
1 0 2
2 1 3 |
|
On board with the direction here; I was initially thinking through if the default should be True or False going forward but I think the way you have it is nice if can get green |
|
will look again soon |
…9-result-type � Conflicts: � pandas/core/groupby/groupby.py
…pandas into 34809-result-type
|
@jreback ping |
…9-result-type � Conflicts: � pandas/core/groupby/groupby.py
…pandas into 34809-result-type
|
@jreback - ping |
|
thanks @rhshadrach and @TomAugspurger nice to have the consistency! and of course let the bug reports roll in...... |
|
Really glad to see this in. Great work @rhshadrach! |
|
@TomAugspurger this is causing warnings in test_multiindex_group_all_columns_when_empty, can you update the affected tests to avoid/catch these warnings? |
|
Thanks for finding these @jbrockmendel - this will be addressed in #48159 |
- [x] This PR adds support for `group_keys` in `groupby`. Starting pandas 1.5.0, issues around `group_keys` have been resolved: pandas-dev/pandas#34998 pandas-dev/pandas#47185 - [x] This PR defaults `group_keys` to `False` which is the same as what pandas is going to be defaulting to in the future version. - [x] Required to unblock `pandas-1.5.0` upgrade in cudf: #11617 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #11659
|
@TomAugspurger want to make a PR enforcing this deprecation? |
|
@TomAugspurger - more than happy if you want to take this, otherwise I can take it. |
|
Thanks Richard. It’s all yours :)
…On Oct 28, 2022 at 2:04:22 PM, Richard Shadrach ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> - more than happy if
you want to take this, otherwise I can take it.
—
Reply to this email directly, view it on GitHub
<#34998 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUKBYS3GFUR6RWSG73WFQPTNANCNFSM4OIX2YCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Currently we determine whether to include the group keys as a level of the result's MultiIndex by looking at whether the result's index matches the original DataFrame. This provides a keyword to control that behavior so that users can consistently get the group keys or not, regardless of whether the udf happens to be a transform or not. The default behavior remains the same.
I called the keyword
result_group_keys, but would welcome alternatives.Closes #34809
Closes #31612
Closes #14927
Closes #13056
Closes #27212
Closes #9704
cc @WillAyd and @jorisvandenbossche from the issue.