Skip to content

Do not allow iterating a DataFrameGroupBy#8696

Merged
jcrist merged 1 commit intodask:mainfrom
bryanwweber:groupby-iter-notimplemented
Feb 15, 2022
Merged

Do not allow iterating a DataFrameGroupBy#8696
jcrist merged 1 commit intodask:mainfrom
bryanwweber:groupby-iter-notimplemented

Conversation

@bryanwweber
Copy link
Copy Markdown
Contributor

I ran into this issue yesterday (similar to #8695, a naive Pandas-aware mistake) and the fix seemed simple enough.

@phobson
Copy link
Copy Markdown
Contributor

phobson commented Feb 9, 2022

This is definitely an improvement

@phobson
Copy link
Copy Markdown
Contributor

phobson commented Feb 9, 2022

To expand on my previous comment, I think this is should be merged now and if additional work is needed to get this in a desired state, I'd put that in a separate PR.

Copy link
Copy Markdown
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm also trying to think of ways to better communicate that you can't directly compute() a DataFrameGroupby object. It has a lot of things in common with DataFrame, and can produce Dask DataFrames, but it's not a true collection in the sense of

import dask

ddf = dask.datasets.timeseries()
dask.is_dask_collection(ddf) # True
dask.is_dask_collection(ddf.groupby("name")) # False

Comment on lines +1130 to +1132
"may be slow. You probably want to use 'apply' to execute a function for "
"all the columns. To access individual groups, use 'get_group'. To list "
"all the group names, use 'df[<group column>].unique().compute()'."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"may be slow. You probably want to use 'apply' to execute a function for "
"all the columns. To access individual groups, use 'get_group'. To list "
"all the group names, use 'df[<group column>].unique().compute()'."
"may be slow. You may want to use 'apply' or 'transform' to execute a function for "
"all the groups. To access individual groups, use 'get_group'. To list "
"all the group names, use 'df.groupby(<group-columns-or-index>).size().compute()'."

Suggesting a different way to get the group names which should also work for multiple columns or the index.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose! The only quibble I have is with the last change from .unique() to .size(). I think the former would give you the values of the groups to put into `get_group()' whereas the latter just tells you how many rows are in the dataset right?

@bryanwweber
Copy link
Copy Markdown
Contributor Author

I'm also trying to think of ways to better communicate that you can't directly compute() a DataFrameGroupby object

I opened #8695 to discuss that, since it's a little bit different than this case ☺️

Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks y'all!

@jcrist jcrist merged commit de88ce9 into dask:main Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dask 2.1.0, KeyError: 'Column not found: 0' ?

4 participants