-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset #12811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason that this is otherwise failing, is because with the new dataset implementation, we are using a fixed file name (part-0.parquet), while before we where using a uuid filename. And so therefore, with the non-legacy writer, it is each time overwriting the same file inside the loop.
To what extent would this be something that users also could bump into? We could in theory use the basename_template argument to replicate this "uuid" filename behaviour inside pq.write_to_dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, makes sense.
Yes, I think we should add what you are suggesting to the pq.write_to_dataset. Will try and commit for review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, thinking aloud: existing_data_behavior controls how the dataset will handle data that already exists. If I implement a unique way of writing parquet files when using the new API in wrtie_to_dataset I will also have to set existing_data_behavior to be overwrite_or_ignore. That will then make trouble when exposing the same parameter in https://issues.apache.org/jira/browse/ARROW-15757.
I could do a check if the parameter is specified or not. But am not sure if there will be additional complications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I don't understand here is that the dataset.write_dataset function has a default to raise an error if there is existing data? But then why doesn't the above test fail with that error? (instead of failing in the test because we now overwrote the files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this: if we switch the default as we are now doing, I think we should try to preserve the current behaviour of overwriting/adding data (otherwise it would be a quite breaking change for people using pq.write_to_dataset this way). We can still try to deprecate this and later move towards the same default as the dataset.write_dataset implementation.
But that can be done in a later stage with a proper deprecation warning (eg detect if the directory already exists and is not empty, and in that case indicate this will start raising an error in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point.
I will step back and redo the issue in a way that the default stays True and it raises a deprecation warning in this case.
For the later stage is it worth creating a JIRA already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case where the default will still be True, does exposing write_dataset keywords still makes sense or should we wait once the default changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up on this thread: I kept the new implementation as the default and added the use of basename_template to mimic the legacy behaviour.
At a later stage I think it will be important to rearrange the keywords for write_to_dataset to first list the keywords connected to the new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest changes, it should be possible to now change the hardcoded use_legacy_dataset=True to use_legacy_dataset=use_legacy_dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here as my comment above (https://github.com/apache/arrow/pull/12811/files#r845291213), as the TODO comment also indicates: calling this twice with use_legacy_dataset=False would overwrite the same file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why I didn't get this yesterday 🤦♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this test can now be updated?
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here this fails with using the new dataset implementation, because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword (to which chunk_size gets translated). The ParquetFileWriteOptions doesn't support this keyword.
On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at
arrow/cpp/src/parquet/arrow/writer.h
Lines 62 to 71 in 76d064c
| static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool, | |
| std::shared_ptr<::arrow::io::OutputStream> sink, | |
| std::shared_ptr<WriterProperties> properties, | |
| std::shared_ptr<ArrowWriterProperties> arrow_properties, | |
| std::unique_ptr<FileWriter>* writer); | |
| virtual std::shared_ptr<::arrow::Schema> schema() const = 0; | |
| /// \brief Write a Table to Parquet. | |
| virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t chunk_size) = 0; |
cc @westonpace do you remember if this has been discussed before how the row_group_size/chunk_size setting from Parquet fits into the dataset API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataset API now has a max_rows_per_group, I see, but that doesn't necessarily directly relate to Parquet row groups?
It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can open a follow-up JIRA for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyarrow/parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche should other keywords also be passed? (partitioning_flavor, max_*)
|
@jorisvandenbossche the PR should be ready now. I exposed |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest changes, it should be possible to now change the hardcoded use_legacy_dataset=True to use_legacy_dataset=use_legacy_dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this test can now be updated?
|
@jorisvandenbossche I applied the suggestions 👍 |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can open a follow-up JIRA for this one?
|
We should also ensure to suppress the warnings when running the tests (when we are explicitly testing |
…rect some of the tests to not fail due to the change
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
|
Created https://issues.apache.org/jira/browse/ARROW-16241 for the follow-up on the warnings when explicitly using |
…es it impossible to maintain old behavior This PR tries to pass `existing_data_behavior` to `write_to_dataset` in case of the new dataset implementation. Connected to #12811. Closes #12838 from AlenkaF/ARROW-15757 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
|
Benchmark runs are scheduled for baseline = 4f08a9b and contender = 1763622. 1763622 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Remove the lines that unconditionally set `partitioning` and `file_visitor` in `pq.write_to_dataset` to None. This is a leftover from #12811 where additional `pq.write_dataset` keywords were exposed. Closes #13062 from AlenkaF/ARROW-16420 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
This PR tries to amend
pq.write_to_datasetto:use_legacy_dataset=Trueand already switch the default toFalse.use_legacy_dataset=True) that won't be supported in the new implementation.