[REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow by rjzamora · Pull Request #6282 · dask/dask

rjzamora · 2020-06-03T00:13:22Z

May address #6277

The _write_partitioned logic does not currently preserve the index. This PR fixes this bug, but still does not preserve the index name if it is None (the name is converted to "index", because dask.dataframe.to_parquet immediately resets the index and parquet will assign it this name).

Note that the _write_partitioned logic was mostly copied from write_to_dataset in pyarrow, where the same bug was fixed in pyarrow#7054. Since those fixes are in, and ARROW-8244 has been addressed, we may be able to use write_to_dataset for pyarrow>=1.0. With that said, I got a test failure (test_to_parquet_pyarrow_w_inconsistent_schema_by_partition_succeeds_w_manual_schema) when making the same changes as pyarrow#7054, so this solution is slightly different. More specifically, I ran into issues when removing the ignore_metadata=True option for pyarrow-to-pandas conversion, so I am manually resetting the index when necessary.

Questions/TODO:

Can we preserve an index name of None? This question is the primary reason for the "WIP" status of this PR
Can we use the write_to_dataset for pyarrow >=1.0? Since this solution is different from pyarrow#7054, this may be tricky.
Tests added / passed
Passes black dask / flake8 dask

…ndex

dask/dataframe/io/tests/test_parquet.py

dask/dataframe/io/parquet/arrow.py

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

TomAugspurger · 2020-06-17T11:18:04Z

Thanks this is looking nice. Are you wanting to investigate the index.name = None case here too?

rjzamora · 2020-06-17T15:29:49Z

Are you wanting to investigate the index.name = None case here too?

I'm still looking into this - It is proving to be pretty tricky.

rjzamora · 2020-06-17T20:45:37Z

@TomAugspurger - I spent more time than I expected trying to investigate the best way to preserve an index labelled as None. This PR implements the null-preservation feature by reserving the column-name "__null_dask_index__" for any index with the original name None. That is, rather than letting pandas change None to "index" during the write, I am using a more obscure name, and assuming that any index with that name should become None during a read.

We should be able to do this in a "cleaner" way by working with the pandas metadata only. However, the fact that we are starting the to_parquet logic by resetting the index (along with the need to support multiple engines, partitioning, and appending) turns this into a bit of a headache. Note that I do have a rough/messy version working for pyarrow only with this approach, but I prefer the "__null_dask_index__"-based approach for now.

rjzamora · 2020-06-17T23:33:56Z

Most recent commits allow us to handle None index from pandas/pyarrow. I needed to modify the read_metadata API to make this happen.

TomAugspurger

cc @martindurant if you have time to take a look.

@rjzamora are there any changes we could make to the pandas metadata that might have made this whole process easier, or perhaps avoided the need for the special NONE_LABEL?

dask/dataframe/io/parquet/fastparquet.py

rjzamora · 2020-06-18T21:00:31Z

@rjzamora are there any changes we could make to the pandas metadata that might have made this whole process easier, or perhaps avoided the need for the special NONE_LABEL?

I'd like to think about this a bit more, but I get the feeling that the challenge with the label of None is all on the dask/fastparquet side.

…ndex

rjzamora · 2020-06-23T16:55:33Z

@TomAugspurger - I may be able to work out a way to avoid avoid the special NONE_LABEL, but it seems tricky enough for fastparquet that I'd prefer to prioritize my read_parquet-dev time on other performance-related improvements. Do you get the feeling that the current state of this PR is a reasonable stopping point, or is the NONE_LABEL change too much of a "sideways" move?

TomAugspurger · 2020-06-23T17:38:04Z

Agreed with your assessment of priorities. Let me skim through here one more time, but I think we're good.

dask/dataframe/io/parquet/core.py

dask/dataframe/io/tests/test_parquet.py

…ndex

TomAugspurger · 2020-06-24T11:27:12Z

Thanks @rjzamora!

…dask#6282) * handle auto-index detection for partitioned datasets in pyarrow

rjzamora added 3 commits June 2, 2020 12:13

handle auto-index detection for partitioned datasets in pyarrow

ab4aa5c

Merge remote-tracking branch 'upstream/master' into fix-partitioned-i…

6664a67

…ndex

skip index-preservation support for older pyarrow versions

3ee0d14

TomAugspurger reviewed Jun 16, 2020

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved

rjzamora and others added 2 commits June 16, 2020 16:53

Update dask/dataframe/io/tests/test_parquet.py

73bfb2c

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

protect against None

c5cf67c

rjzamora added 2 commits June 17, 2020 13:08

use new __null_dask_index__ name as default

e16abaf

support None-index preservation - but using reserved name for this

35e6230

rjzamora added 2 commits June 17, 2020 14:17

minor cleanup

67c86bc

handle pandas/pyarrow-written df with None index

f66d46e

skip test for older pyarrow

c922ebb

TomAugspurger reviewed Jun 18, 2020

View reviewed changes

dask/dataframe/io/parquet/fastparquet.py Outdated Show resolved Hide resolved

rjzamora added 2 commits June 18, 2020 21:00

remove comment

8fd29c8

Merge remote-tracking branch 'upstream/master' into fix-partitioned-i…

5389a19

…ndex

rjzamora changed the title ~~[WIP] Preserve index when writing partitioned parquet datasets with pyarrow~~ [REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow Jun 23, 2020

TomAugspurger added the io label Jun 23, 2020

TomAugspurger reviewed Jun 23, 2020

View reviewed changes

dask/dataframe/io/parquet/core.py Show resolved Hide resolved

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

rjzamora added 2 commits June 23, 2020 11:34

address code review

97bb85e

Merge remote-tracking branch 'upstream/master' into fix-partitioned-i…

7fe12fd

…ndex

rjzamora mentioned this pull request Jun 23, 2020

[REVIEW] Fix CudfEngine.read_metadata API in dask_cudf rapidsai/cudf#5564

Merged

TomAugspurger merged commit 7138f47 into dask:master Jun 24, 2020

rjzamora deleted the fix-partitioned-index branch June 24, 2020 13:50

gforsyth mentioned this pull request Jun 26, 2020

Regression in index when using fastparquet #6348

Closed

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Preserve index when writing partitioned parquet datasets with pyarrow (…

c07a05a

…dask#6282) * handle auto-index detection for partitioned datasets in pyarrow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow#6282

[REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow#6282
TomAugspurger merged 14 commits intodask:masterfrom
rjzamora:fix-partitioned-index

rjzamora commented Jun 3, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Jun 17, 2020

Uh oh!

rjzamora commented Jun 17, 2020

Uh oh!

rjzamora commented Jun 17, 2020

Uh oh!

rjzamora commented Jun 17, 2020

Uh oh!

TomAugspurger left a comment

Uh oh!

Uh oh!

rjzamora commented Jun 18, 2020

Uh oh!

rjzamora commented Jun 23, 2020

Uh oh!

TomAugspurger commented Jun 23, 2020

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Jun 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rjzamora commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Jun 17, 2020

Uh oh!

rjzamora commented Jun 17, 2020

Uh oh!

rjzamora commented Jun 17, 2020

Uh oh!

rjzamora commented Jun 17, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjzamora commented Jun 18, 2020

Uh oh!

rjzamora commented Jun 23, 2020

Uh oh!

TomAugspurger commented Jun 23, 2020

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Jun 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjzamora commented Jun 3, 2020 •

edited

Loading