Skip to content

[REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow#6282

Merged
TomAugspurger merged 14 commits intodask:masterfrom
rjzamora:fix-partitioned-index
Jun 24, 2020
Merged

[REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow#6282
TomAugspurger merged 14 commits intodask:masterfrom
rjzamora:fix-partitioned-index

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Jun 3, 2020

May address #6277

The _write_partitioned logic does not currently preserve the index. This PR fixes this bug, but still does not preserve the index name if it is None (the name is converted to "index", because dask.dataframe.to_parquet immediately resets the index and parquet will assign it this name).

Note that the _write_partitioned logic was mostly copied from write_to_dataset in pyarrow, where the same bug was fixed in pyarrow#7054. Since those fixes are in, and ARROW-8244 has been addressed, we may be able to use write_to_dataset for pyarrow>=1.0. With that said, I got a test failure (test_to_parquet_pyarrow_w_inconsistent_schema_by_partition_succeeds_w_manual_schema) when making the same changes as pyarrow#7054, so this solution is slightly different. More specifically, I ran into issues when removing the ignore_metadata=True option for pyarrow-to-pandas conversion, so I am manually resetting the index when necessary.

Questions/TODO:

  • Can we preserve an index name of None? This question is the primary reason for the "WIP" status of this PR
  • Can we use the write_to_dataset for pyarrow >=1.0? Since this solution is different from pyarrow#7054, this may be tricky.
  • Tests added / passed
  • Passes black dask / flake8 dask

rjzamora and others added 2 commits June 16, 2020 16:53
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
@TomAugspurger
Copy link
Member

Thanks this is looking nice. Are you wanting to investigate the index.name = None case here too?

@rjzamora
Copy link
Member Author

Are you wanting to investigate the index.name = None case here too?

I'm still looking into this - It is proving to be pretty tricky.

@rjzamora
Copy link
Member Author

@TomAugspurger - I spent more time than I expected trying to investigate the best way to preserve an index labelled as None. This PR implements the null-preservation feature by reserving the column-name "__null_dask_index__" for any index with the original name None. That is, rather than letting pandas change None to "index" during the write, I am using a more obscure name, and assuming that any index with that name should become None during a read.

We should be able to do this in a "cleaner" way by working with the pandas metadata only. However, the fact that we are starting the to_parquet logic by resetting the index (along with the need to support multiple engines, partitioning, and appending) turns this into a bit of a headache. Note that I do have a rough/messy version working for pyarrow only with this approach, but I prefer the "__null_dask_index__"-based approach for now.

@rjzamora
Copy link
Member Author

Most recent commits allow us to handle None index from pandas/pyarrow. I needed to modify the read_metadata API to make this happen.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @martindurant if you have time to take a look.

@rjzamora are there any changes we could make to the pandas metadata that might have made this whole process easier, or perhaps avoided the need for the special NONE_LABEL?

@rjzamora
Copy link
Member Author

@rjzamora are there any changes we could make to the pandas metadata that might have made this whole process easier, or perhaps avoided the need for the special NONE_LABEL?

I'd like to think about this a bit more, but I get the feeling that the challenge with the label of None is all on the dask/fastparquet side.

@rjzamora rjzamora changed the title [WIP] Preserve index when writing partitioned parquet datasets with pyarrow [REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow Jun 23, 2020
@rjzamora
Copy link
Member Author

@TomAugspurger - I may be able to work out a way to avoid avoid the special NONE_LABEL, but it seems tricky enough for fastparquet that I'd prefer to prioritize my read_parquet-dev time on other performance-related improvements. Do you get the feeling that the current state of this PR is a reasonable stopping point, or is the NONE_LABEL change too much of a "sideways" move?

@TomAugspurger
Copy link
Member

Agreed with your assessment of priorities. Let me skim through here one more time, but I think we're good.

@TomAugspurger
Copy link
Member

Thanks @rjzamora!

@rjzamora rjzamora deleted the fix-partitioned-index branch June 24, 2020 13:50
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
…dask#6282)

* handle auto-index detection for partitioned datasets in pyarrow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants