ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available by jorisvandenbossche · Pull Request #1408 · apache/arrow

jorisvandenbossche · 2017-12-10T21:58:53Z

Related to the discussion about the pandas metadata specification in pandas-dev/pandas#18201, and an alternative to #1271.

I don't open this PR because it should necessarily be merged, I just want to show that it is not that difficult to both fix ARROW-1754 and preserve index names as field names when possible (as this was mentioned in pandas-dev/pandas#18201 as the reason to make this change to not preserve index names).
The diff is partly a revert of #1271, but then adapted to the current codebase.

Main reasons I prefer to preserve index names: 1) usability in pyarrow itself (if you would want to work with pyarrow Tables created from pandas) and 2) when interchanging parquet files with other people / other non-pandas systems, then it would be much nicer to not have __index_level_n__ column names if possible.

wesm · 2018-01-17T17:24:43Z

Can this be closed?

jorisvandenbossche · 2018-01-17T17:30:42Z

My opinion is to merge this, but I had the feeling nobody else was feeling strongly in favor of it. See the top-level post for my reasoning.

wesm · 2018-01-17T19:51:48Z

OK, if @cpcloud could take a look at this and advise (since he worked on this code most recently) I'm fine with merging

cpcloud · 2018-01-17T23:58:12Z

Looking now.

cpcloud · 2018-01-18T00:03:49Z

python/pyarrow/pandas_compat.py

Should we be concerned about the linear search for index.name not in column_names? If so, let's create a set outside the loop below that we can check so that we don't need to do a full scan of the column names for every index column.

I did some timings, and conversion to a set typically takes twice the time of a single search in the list. So you already need to have 3 index levels to benefit from this, and I don't think this is the typical use case?
So I would personally leave it as is, but can certainly also easily add the suggestion.

Fine by me.

cpcloud · 2018-01-18T00:05:23Z

LGTM other than the comment. Should be rebased to run tests against current master.

wesm · 2018-01-18T00:24:28Z

Rebased

cpcloud · 2018-01-24T22:17:47Z

LGTM

jorisvandenbossche · 2018-02-01T21:15:04Z

Regarding the PR backlog, given the comments above I think there was agreement to merge this.
There are no merge conflicts yet, but should I update with master to ensure tests are still passing?

cpcloud · 2018-02-01T21:17:11Z

@jorisvandenbossche Yep that's a good idea, I can merge on green.

wesm · 2018-02-01T22:19:23Z

Seems like this could be a stale merge -- doesn't look like it got the ARROW-2062 patch

jorisvandenbossche · 2018-02-01T22:25:06Z

I see the ARROW-2062 commit in the history of this branch: https://github.com/jorisvandenbossche/arrow/commits/index-names (I fetched upstream master just before I merged / pushed)

But, it is failing on travis (amongst others, a timeout for the first (gcc) build), is that the reason you were thinking this is not up to date?

…name if available Change-Id: I68ca058b7d038a9f30d265aeaad192d0f86757cc

wesm · 2018-02-02T04:43:52Z

It looked a lot like the failure that was happening before ARROW-2062, I triggered a new build to see if it's transient

jorisvandenbossche · 2018-02-02T08:13:52Z

Hmm, still timing out on the first one (but the other failures seems resolved)

wesm · 2018-02-02T17:25:11Z

No problem, I'm merging this, thanks @jorisvandenbossche!

jorisvandenbossche · 2018-02-02T19:50:46Z

Thanks for merging!

jorisvandenbossche mentioned this pull request Dec 10, 2017

DOC: Update parquet metadata format description around index levels pandas-dev/pandas#18201

Merged

cpcloud reviewed Jan 18, 2018

View reviewed changes

wesm force-pushed the index-names branch from f8e5b79 to b7e0560 Compare January 18, 2018 00:24

jorisvandenbossche force-pushed the index-names branch from 54c5bf5 to b7e0560 Compare January 24, 2018 21:58

cpcloud approved these changes Jan 24, 2018

View reviewed changes

alternative fix for duplicate index/column name that preserves index …

eef1d33

…name if available Change-Id: I68ca058b7d038a9f30d265aeaad192d0f86757cc

wesm force-pushed the index-names branch from 80f0492 to eef1d33 Compare February 2, 2018 04:43

wesm closed this in 5042863 Feb 2, 2018

jorisvandenbossche deleted the index-names branch February 2, 2018 19:50

Conversation

jorisvandenbossche commented Dec 10, 2017

Uh oh!

wesm commented Jan 17, 2018

Uh oh!

jorisvandenbossche commented Jan 17, 2018

Uh oh!

wesm commented Jan 17, 2018

Uh oh!

cpcloud commented Jan 17, 2018

Uh oh!

cpcloud Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 24, 2018

Choose a reason for hiding this comment

Uh oh!

cpcloud Jan 24, 2018

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Jan 18, 2018

Uh oh!

wesm commented Jan 18, 2018

Uh oh!

cpcloud commented Jan 24, 2018

Uh oh!

jorisvandenbossche commented Feb 1, 2018

Uh oh!

cpcloud commented Feb 1, 2018

Uh oh!

wesm commented Feb 1, 2018

Uh oh!

jorisvandenbossche commented Feb 1, 2018

Uh oh!

wesm commented Feb 2, 2018

Uh oh!

jorisvandenbossche commented Feb 2, 2018

Uh oh!

wesm commented Feb 2, 2018

Uh oh!

jorisvandenbossche commented Feb 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants