Skip to content

Re-integrate pyarrow write_to_dataset instead of arrow._write_partitioned in dask #9968

@j-bennet

Description

@j-bennet

Describe the issue:

Because of apache/arrow#24440 (https://issues.apache.org/jira/browse/ARROW-8244), dask is using its own code to write a table to a partitioned pyarrow dataset. The code lives here:

def _write_partitioned(
table,
df,
root_path,
filename,
partition_cols,
fs,
pandas_to_arrow_table,
preserve_index,
index_cols=(),
return_metadata=True,
**kwargs,
):

Since this code was added in March 2020, pyarrow came a long way. The original issue was fixed in pyarrow in April 2020. There's a TODO in Dask code to re-integrate back write_to_dataset, we still need to do it. The custom code already required multiple bugfixes, and we can't be sure that it's up to date with pyarrow.

  • Dask version: 2023.2.0+11.g0890b96b

cc @rjzamora .

Metadata

Metadata

Assignees

No one assigned

    Labels

    hygieneImprove code quality and reduce maintenance overheadneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions