ARROW-8244: [Python] Fix parquet.write_to_dataset to set file path in metadata_collector by jorisvandenbossche · Pull Request #6797 · apache/arrow

jorisvandenbossche · 2020-04-01T08:54:13Z

This explores a potential fix for ARROW-8244, it seems rather straightforward to set the file path in write_to_dataset (write_table does not do this, because there the user passes a full path, so no relative path is known).

cc @rjzamora does this look the correct logic?

github-actions · 2020-04-01T09:02:26Z

https://issues.apache.org/jira/browse/ARROW-8244

rjzamora · 2020-04-01T16:53:19Z

cc @rjzamora does this look the correct logic?

Thank you for working on this @jorisvandenbossche!

Yes, these changes are exactly what I had in mind. I don't think there is any danger in setting the file_path here, because the returned metadata will never be included in the file where the data is stored (the only case where the value of None is technically correct). Leaving the field empty in write_table makes sense, because the user is already specifying the full path and may want to manually construct a partitioned dataset. However, for write_to_dataset, I would argue that the only correct behavior is to set the file_path by default, and allow the user to modify it if desired/necessary (which is what you have in this PR).

… metadata_collector

jorisvandenbossche · 2020-04-02T09:00:35Z

@rjzamora Thanks for the feedback! I agree that just setting the file path is probably the only sensible behaviour, so we can simply change that.

I added a test for the non-partitioned case as well.

fsaintjacques

LGTM!

jorisvandenbossche added 2 commits April 2, 2020 10:46

ARROW-8244: [Python] Fix parquet.write_to_dataset to set file path in…

1787a02

… metadata_collector

add test for non-partitioned write_to_dataset

1eb7ef9

jorisvandenbossche force-pushed the ARROW-8244 branch from 2f446b9 to 1eb7ef9 Compare April 2, 2020 08:59

fsaintjacques approved these changes Apr 2, 2020

View reviewed changes

fsaintjacques closed this in ac3bfe4 Apr 2, 2020

jorisvandenbossche deleted the ARROW-8244 branch April 2, 2020 19:26

asfimport mentioned this pull request Apr 5, 2020

[Python][Parquet] Add write_to_dataset option to populate the "file_path" metadata fields #24440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8244: [Python] Fix parquet.write_to_dataset to set file path in metadata_collector#6797

ARROW-8244: [Python] Fix parquet.write_to_dataset to set file path in metadata_collector#6797
jorisvandenbossche wants to merge 2 commits intoapache:masterfrom
jorisvandenbossche:ARROW-8244

jorisvandenbossche commented Apr 1, 2020

Uh oh!

github-actions bot commented Apr 1, 2020

Uh oh!

rjzamora commented Apr 1, 2020 •

edited

Loading

Uh oh!

jorisvandenbossche commented Apr 2, 2020

Uh oh!

fsaintjacques left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jorisvandenbossche commented Apr 1, 2020

Uh oh!

github-actions bot commented Apr 1, 2020

Uh oh!

rjzamora commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Apr 2, 2020

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjzamora commented Apr 1, 2020 •

edited

Loading