Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been using the write_to_dataset API to write partitioned parquet datasets. This PR is switching to a (hopefully temporary) custom solution, because that API makes it difficult to populate the the "file_path" column-chunk metadata fields that are returned within the optional metadata_collector kwarg. Dask needs to set these fields correctly in order to generate a proper global "_metadata" file.
Possible solutions to this problem:
- Optionally populate the file-path fields within
write_to_dataset
- Always populate the file-path fields within
write_to_dataset
- Return the file paths for the data written within
write_to_dataset (up to the user to manually populate the file-path fields)
Reporter: Rick Zamora / @rjzamora
Assignee: Joris Van den Bossche / @jorisvandenbossche
PRs and other links:
Note: This issue was originally created as ARROW-8244. Please see the migration documentation for further details.