ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available#1408
Conversation
|
Can this be closed? |
|
My opinion is to merge this, but I had the feeling nobody else was feeling strongly in favor of it. See the top-level post for my reasoning. |
|
OK, if @cpcloud could take a look at this and advise (since he worked on this code most recently) I'm fine with merging |
|
Looking now. |
python/pyarrow/pandas_compat.py
Outdated
There was a problem hiding this comment.
Should we be concerned about the linear search for index.name not in column_names? If so, let's create a set outside the loop below that we can check so that we don't need to do a full scan of the column names for every index column.
There was a problem hiding this comment.
I did some timings, and conversion to a set typically takes twice the time of a single search in the list. So you already need to have 3 index levels to benefit from this, and I don't think this is the typical use case?
So I would personally leave it as is, but can certainly also easily add the suggestion.
|
LGTM other than the comment. Should be rebased to run tests against current master. |
|
Rebased |
54c5bf5 to
b7e0560
Compare
|
LGTM |
|
Regarding the PR backlog, given the comments above I think there was agreement to merge this. |
|
@jorisvandenbossche Yep that's a good idea, I can merge on green. |
|
Seems like this could be a stale merge -- doesn't look like it got the ARROW-2062 patch |
|
I see the ARROW-2062 commit in the history of this branch: https://github.com/jorisvandenbossche/arrow/commits/index-names (I fetched upstream master just before I merged / pushed) But, it is failing on travis (amongst others, a timeout for the first (gcc) build), is that the reason you were thinking this is not up to date? |
…name if available Change-Id: I68ca058b7d038a9f30d265aeaad192d0f86757cc
|
It looked a lot like the failure that was happening before ARROW-2062, I triggered a new build to see if it's transient |
|
Hmm, still timing out on the first one (but the other failures seems resolved) |
|
No problem, I'm merging this, thanks @jorisvandenbossche! |
|
Thanks for merging! |
Related to the discussion about the pandas metadata specification in pandas-dev/pandas#18201, and an alternative to #1271.
I don't open this PR because it should necessarily be merged, I just want to show that it is not that difficult to both fix ARROW-1754 and preserve index names as field names when possible (as this was mentioned in pandas-dev/pandas#18201 as the reason to make this change to not preserve index names).
The diff is partly a revert of #1271, but then adapted to the current codebase.
Main reasons I prefer to preserve index names: 1) usability in pyarrow itself (if you would want to work with pyarrow Tables created from pandas) and 2) when interchanging parquet files with other people / other non-pandas systems, then it would be much nicer to not have
__index_level_n__column names if possible.