[REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow#6282
[REVIEW] Preserve index when writing partitioned parquet datasets with pyarrow#6282TomAugspurger merged 14 commits intodask:masterfrom
Conversation
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
|
Thanks this is looking nice. Are you wanting to investigate the |
I'm still looking into this - It is proving to be pretty tricky. |
|
@TomAugspurger - I spent more time than I expected trying to investigate the best way to preserve an index labelled as We should be able to do this in a "cleaner" way by working with the pandas metadata only. However, the fact that we are starting the |
|
Most recent commits allow us to handle |
TomAugspurger
left a comment
There was a problem hiding this comment.
cc @martindurant if you have time to take a look.
@rjzamora are there any changes we could make to the pandas metadata that might have made this whole process easier, or perhaps avoided the need for the special NONE_LABEL?
I'd like to think about this a bit more, but I get the feeling that the challenge with the label of |
|
@TomAugspurger - I may be able to work out a way to avoid avoid the special |
|
Agreed with your assessment of priorities. Let me skim through here one more time, but I think we're good. |
|
Thanks @rjzamora! |
…dask#6282) * handle auto-index detection for partitioned datasets in pyarrow
May address #6277
The
_write_partitionedlogic does not currently preserve the index. This PR fixes this bug, but still does not preserve the index name if it isNone(the name is converted to"index", becausedask.dataframe.to_parquetimmediately resets the index and parquet will assign it this name).Note that the
_write_partitionedlogic was mostly copied fromwrite_to_datasetin pyarrow, where the same bug was fixed in pyarrow#7054. Since those fixes are in, and ARROW-8244 has been addressed, we may be able to usewrite_to_datasetforpyarrow>=1.0. With that said, I got a test failure (test_to_parquet_pyarrow_w_inconsistent_schema_by_partition_succeeds_w_manual_schema) when making the same changes as pyarrow#7054, so this solution is slightly different. More specifically, I ran into issues when removing theignore_metadata=Trueoption for pyarrow-to-pandas conversion, so I am manually resetting the index when necessary.Questions/TODO:
None? This question is the primary reason for the "WIP" status of this PRwrite_to_datasetfor pyarrow >=1.0? Since this solution is different from pyarrow#7054, this may be tricky.black dask/flake8 dask