Do not allow iterating a DataFrameGroupBy by bryanwweber · Pull Request #8696 · dask/dask

bryanwweber · 2022-02-09T16:12:09Z

I ran into this issue yesterday (similar to #8695, a naive Pandas-aware mistake) and the fix seemed simple enough.

Closes Dask 2.1.0, KeyError: 'Column not found: 0' ? #5124
Tests added / passed
Passes pre-commit run --all-files

Closes #5124.

phobson · 2022-02-09T16:15:26Z

This is definitely an improvement

phobson · 2022-02-09T22:38:34Z

To expand on my previous comment, I think this is should be merged now and if additional work is needed to get this in a desired state, I'd put that in a separate PR.

ian-r-rose

Thanks! I'm also trying to think of ways to better communicate that you can't directly compute() a DataFrameGroupby object. It has a lot of things in common with DataFrame, and can produce Dask DataFrames, but it's not a true collection in the sense of

import dask

ddf = dask.datasets.timeseries()
dask.is_dask_collection(ddf) # True
dask.is_dask_collection(ddf.groupby("name")) # False

ian-r-rose · 2022-02-09T22:38:07Z

dask/dataframe/groupby.py

+            "may be slow. You probably want to use 'apply' to execute a function for "
+            "all the columns. To access individual groups, use 'get_group'. To list "
+            "all the group names, use 'df[<group column>].unique().compute()'."


Suggested change

"may be slow. You probably want to use 'apply' to execute a function for "

"all the columns. To access individual groups, use 'get_group'. To list "

"all the group names, use 'df[<group column>].unique().compute()'."

"may be slow. You may want to use 'apply' or 'transform' to execute a function for "

"all the groups. To access individual groups, use 'get_group'. To list "

"all the group names, use 'df.groupby(<group-columns-or-index>).size().compute()'."

Suggesting a different way to get the group names which should also work for multiple columns or the index.

Thanks @ian-r-rose! The only quibble I have is with the last change from .unique() to .size(). I think the former would give you the values of the groups to put into `get_group()' whereas the latter just tells you how many rows are in the dataset right?

bryanwweber · 2022-02-09T23:00:11Z

I'm also trying to think of ways to better communicate that you can't directly compute() a DataFrameGroupby object

I opened #8695 to discuss that, since it's a little bit different than this case ☺️

jcrist

Thanks y'all!

Do not allow iterating a DataFrameGroupBy

d37e59f

Closes #5124.

github-actions bot added the dataframe label Feb 9, 2022

phobson approved these changes Feb 9, 2022

View reviewed changes

ian-r-rose reviewed Feb 9, 2022

View reviewed changes

jcrist approved these changes Feb 15, 2022

View reviewed changes

jcrist merged commit de88ce9 into dask:main Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not allow iterating a DataFrameGroupBy#8696

Do not allow iterating a DataFrameGroupBy#8696
jcrist merged 1 commit intodask:mainfrom
bryanwweber:groupby-iter-notimplemented

bryanwweber commented Feb 9, 2022

Uh oh!

phobson commented Feb 9, 2022

Uh oh!

phobson commented Feb 9, 2022

Uh oh!

ian-r-rose left a comment

Uh oh!

ian-r-rose Feb 9, 2022

Uh oh!

bryanwweber Feb 9, 2022

Uh oh!

bryanwweber commented Feb 9, 2022

Uh oh!

jcrist left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

bryanwweber commented Feb 9, 2022

Uh oh!

phobson commented Feb 9, 2022

Uh oh!

phobson commented Feb 9, 2022

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Feb 9, 2022

Choose a reason for hiding this comment

Uh oh!

bryanwweber Feb 9, 2022

Choose a reason for hiding this comment

Uh oh!

bryanwweber commented Feb 9, 2022

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants