Skip to content

ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata#1397

Closed
cpcloud wants to merge 9 commits intoapache:masterfrom
cpcloud:ARROW-1895
Closed

ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata#1397
cpcloud wants to merge 9 commits intoapache:masterfrom
cpcloud:ARROW-1895

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Dec 6, 2017

No description provided.

@wesm
Copy link
Member

wesm commented Dec 6, 2017

@jorisvandenbossche would you mind reviewing and making sure this jives with the discussion so far?

@jorisvandenbossche
Copy link
Member

One special case that I encountered in #1386 is a DataFrame with column name None (from ipc when serializing a Series without name).
This case is not yet handled here:

In [6]: pa.Table.from_pandas(pd.DataFrame({None: [1,2,3]}))
Out[6]: 
pyarrow.Table
None: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "pandas_type": "mixed", "numpy_type": "object", "meta'
            b'data": null}], "columns": [{"name": null, "field_name": null, "p'
            b'andas_type": "int64", "numpy_type": "int64", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.22.0.dev0+260.g5da3759"}'}

So for the column, "name": null, "field_name": null, are both null, while field_name should be "None"

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 6, 2017

@jorisvandenbossche How is it possible in practice to get a Series with a name of None? How can one exist without explicitly constructing it or assigning its name to be None?

@jorisvandenbossche
Copy link
Member

Indeed by just constructing it and not getting it as a column from a dataframe.
Such a dataframe as I show above is constructed in the ipc code to serialize Series objects:

def _serialize_pandas_series(obj):
return _serialize_pandas_dataframe(pd.DataFrame({obj.name: obj}))

It is also tested explicitly: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L458

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the None column name comment, looks good to me! And resembles the discussion IMO.

Added some minor comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe update the docstring here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column names are automatically stringified, should the same be done for index columns? (so str(level.name))
(although I don't find it that important to support, as it is never roundtrippable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to preserve the None value when parsing null in JSON.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, of course. (So for columns it is converted to string when not None)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the comment below you already describe the logical name as the "user-facing" name ("There must be the same number of logical names (user-facing) and physical names (fields in the arrow Table)")
I would add here as well an explanation how a "logical name" should be interpreted for clarity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also assert that idx0['name'] is None, or is that already tested elsewhere adequately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I would assert that for the other columns, the name and field_name are equal (this is obvious from the current code, but not explicitly tested):

col1, col2, idx0, foo = js['columns']

assert col1['name'] == col1['field_name']
assert col2['name'] == col2['field_name']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in principle, IMO this should be the string "None" to make field_name perfectly usable for schema lookup.
Otherwise you need to do schema.get_field_index(field_name or "None")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this works without that, let me see what's happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually explicitly handling this case later on in table_to_blockmanager so this is okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current code works without (just as it did work before with a "name" of None) as it is handled by table_to_blockmanager.
But that means that you will always have to special case this option, and for me that should be the point of "field_name" that schema.get_field_index(field_name) is guaranteed to not error (and you thus don't have to special case None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've implemented this. Pushing it up now.

@cpcloud cpcloud force-pushed the ARROW-1895 branch 3 times, most recently from d448fe0 to 672d011 Compare December 7, 2017 14:40
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Thanks for the updates, once this is merged, I will rebase my PR #1386 against this to use the new functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that important, but doing columns= ['a', None, '__index_level_0__'] inside the DataFrame call is a bit simpler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thank you. That is much better.

@cpcloud cpcloud changed the title ARROW-1895: [Python] Add field_name to pandas index metadata ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata Dec 7, 2017
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failing on this one. Maybe this is only the case for unicode and not for bytes, and thus only for not PY2 ?

@wesm
Copy link
Member

wesm commented Dec 10, 2017

I can merge this tomorrow once the build is passing, will also take a brief look through

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let me look into the test failure

Change-Id: I8f90992bdda8aa0852e0a96d5078c3fe6df61352
@wesm
Copy link
Member

wesm commented Dec 10, 2017

I'm going to be AFK for about 5 or 6 hours, @cpcloud or @xhochy please go and ahead and merge this once the Python builds run in Travis CI so @jorisvandenbossche can rebase. I'll be back on later today to work on some other patches for 0.8.0

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@xhochy xhochy closed this in 501d60e Dec 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants