ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata#1397
ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata#1397cpcloud wants to merge 9 commits intoapache:masterfrom
Conversation
|
@jorisvandenbossche would you mind reviewing and making sure this jives with the discussion so far? |
|
One special case that I encountered in #1386 is a DataFrame with column name So for the column, |
|
@jorisvandenbossche How is it possible in practice to get a |
|
Indeed by just constructing it and not getting it as a column from a dataframe. arrow/python/pyarrow/serialization.py Lines 194 to 195 in 1d519d8 It is also tested explicitly: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L458 |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Apart from the None column name comment, looks good to me! And resembles the discussion IMO.
Added some minor comments.
python/pyarrow/pandas_compat.py
Outdated
There was a problem hiding this comment.
maybe update the docstring here as well
python/pyarrow/pandas_compat.py
Outdated
There was a problem hiding this comment.
column names are automatically stringified, should the same be done for index columns? (so str(level.name))
(although I don't find it that important to support, as it is never roundtrippable)
There was a problem hiding this comment.
This is to preserve the None value when parsing null in JSON.
There was a problem hiding this comment.
Ah yes, of course. (So for columns it is converted to string when not None)
python/pyarrow/pandas_compat.py
Outdated
There was a problem hiding this comment.
in the comment below you already describe the logical name as the "user-facing" name ("There must be the same number of logical names (user-facing) and physical names (fields in the arrow Table)")
I would add here as well an explanation how a "logical name" should be interpreted for clarity
There was a problem hiding this comment.
Should we also assert that idx0['name'] is None, or is that already tested elsewhere adequately?
There was a problem hiding this comment.
For completeness, I would assert that for the other columns, the name and field_name are equal (this is obvious from the current code, but not explicitly tested):
col1, col2, idx0, foo = js['columns']
assert col1['name'] == col1['field_name']
assert col2['name'] == col2['field_name']
There was a problem hiding this comment.
So in principle, IMO this should be the string "None" to make field_name perfectly usable for schema lookup.
Otherwise you need to do schema.get_field_index(field_name or "None")
There was a problem hiding this comment.
Hm, this works without that, let me see what's happening.
There was a problem hiding this comment.
We're actually explicitly handling this case later on in table_to_blockmanager so this is okay.
There was a problem hiding this comment.
Yes, the current code works without (just as it did work before with a "name" of None) as it is handled by table_to_blockmanager.
But that means that you will always have to special case this option, and for me that should be the point of "field_name" that schema.get_field_index(field_name) is guaranteed to not error (and you thus don't have to special case None)
There was a problem hiding this comment.
Ok, I've implemented this. Pushing it up now.
d448fe0 to
672d011
Compare
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Looks good!
Thanks for the updates, once this is merged, I will rebase my PR #1386 against this to use the new functionality.
There was a problem hiding this comment.
not that important, but doing columns= ['a', None, '__index_level_0__'] inside the DataFrame call is a bit simpler
There was a problem hiding this comment.
Yep, thank you. That is much better.
There was a problem hiding this comment.
Tests are failing on this one. Maybe this is only the case for unicode and not for bytes, and thus only for not PY2 ?
|
I can merge this tomorrow once the build is passing, will also take a brief look through |
wesm
left a comment
There was a problem hiding this comment.
+1, let me look into the test failure
Change-Id: I8f90992bdda8aa0852e0a96d5078c3fe6df61352
|
I'm going to be AFK for about 5 or 6 hours, @cpcloud or @xhochy please go and ahead and merge this once the Python builds run in Travis CI so @jorisvandenbossche can rebase. I'll be back on later today to work on some other patches for 0.8.0 |
No description provided.