Skip to content

[Python][Parquet] Add write_to_dataset option to populate the "file_path" metadata fields #24440

@asfimport

Description

@asfimport

Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been using the write_to_dataset API to write partitioned parquet datasets.  This PR is switching to a (hopefully temporary) custom solution, because that API makes it difficult to populate the the "file_path"  column-chunk metadata fields that are returned within the optional metadata_collector kwarg.  Dask needs to set these fields correctly in order to generate a proper global "_metadata" file.

Possible solutions to this problem:

  1. Optionally populate the file-path fields within write_to_dataset
  2. Always populate the file-path fields within write_to_dataset
  3. Return the file paths for the data written within write_to_dataset (up to the user to manually populate the file-path fields)

Reporter: Rick Zamora / @rjzamora
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-8244. Please see the migration documentation for further details.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions