-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
hygieneImprove code quality and reduce maintenance overheadImprove code quality and reduce maintenance overheadneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.It's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet
Description
Describe the issue:
Because of apache/arrow#24440 (https://issues.apache.org/jira/browse/ARROW-8244), dask is using its own code to write a table to a partitioned pyarrow dataset. The code lives here:
dask/dask/dataframe/io/parquet/arrow.py
Lines 89 to 101 in 3124376
| def _write_partitioned( | |
| table, | |
| df, | |
| root_path, | |
| filename, | |
| partition_cols, | |
| fs, | |
| pandas_to_arrow_table, | |
| preserve_index, | |
| index_cols=(), | |
| return_metadata=True, | |
| **kwargs, | |
| ): |
Since this code was added in March 2020, pyarrow came a long way. The original issue was fixed in pyarrow in April 2020. There's a TODO in Dask code to re-integrate back write_to_dataset, we still need to do it. The custom code already required multiple bugfixes, and we can't be sure that it's up to date with pyarrow.
- Dask version: 2023.2.0+11.g0890b96b
cc @rjzamora .
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
hygieneImprove code quality and reduce maintenance overheadImprove code quality and reduce maintenance overheadneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.It's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet