Skip to content

Add attrs property to Series/Dataframe#6742

Merged
TomAugspurger merged 15 commits intodask:masterfrom
Illviljan:Illviljan-daskattrs
Oct 19, 2020
Merged

Add attrs property to Series/Dataframe#6742
TomAugspurger merged 15 commits intodask:masterfrom
Illviljan:Illviljan-daskattrs

Conversation

@Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Oct 17, 2020

Pandas has a property called attrs that supports attaching arbitrary metadata, such as physical units, to DataFrames and persisting it across operations. I've added support for that property in the dask Series/Dataframe as well.

The attrs doesn't work that great with dask series because the pandas iloc method doesn't currently persist the attrs dict. Because dask uses df.iloc when creating the _meta dataframe in make_meta_pandas(x, index=None) the attrs of all the series are therefore lost.

I've added some simple tests that passes the dataframe tests but fails when testing series. It shouldn't once iloc persists the attrs dict but I find this important enough to be reminded by the failing tests.

This fails currently when testing series because df.iloc[:0], which is used in make_meta_pandas(x, index=None), does not keep the attrs.
Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. A few comments.

def test_attrs():
df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [5, 6]})
df.attrs = {"date": "2020-10-16"}
df.A.attrs["unit"] = "kg"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't recommend setting the attrs on a Series like this. It's not clear to me that it's a case supported by pandas (indexing into a DataFrame and setting on a Series).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would be consistent with the rest of the DataFrame methods if it wasn't supported.
Why should df.A[0] = 10 or df.A.values[0] = 100 or df.A.name = "A_new" be allowed but not df.A.attrs["unit"] = "kg"?

What way were you thinking? I'm not even sure how to do this in any other way to be honest. I've never initialized series separately and then appended them to a dataframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some discussion at pandas-dev/pandas#35425, but let's not focus on it here.

For this PR we just need two tests. One on a dd.from_pandas(dataframe_with_attrs) and a second test with dd.from_pandas(series_with_attrs).

@TomAugspurger
Copy link
Member

I think attrs is new in pandas 1.0, so will have to be skipped for versions older than that.

@Illviljan
Copy link
Contributor Author

@TomAugspurger The tests are skipping now on older version but I suppose I have to increase the minimum pandas version as well?

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we extract the attrs, we should do so conditional on the pandas version.

Illviljan and others added 4 commits October 19, 2020 21:44
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
@TomAugspurger
Copy link
Member

Looks good, thanks!

@TomAugspurger TomAugspurger merged commit ba576a2 into dask:master Oct 19, 2020
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Series/DataFrame.attrs

2 participants