Skip to content

Conversation

@AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Apr 6, 2022

This PR tries to amend pq.write_to_dataset to:

  1. raise a deprecation warning for use_legacy_dataset=True and already switch the default to False.
  2. raise deprecation warnings for all keywords (when use_legacy_dataset=True) that won't be supported in the new implementation.

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

@AlenkaF AlenkaF marked this pull request as ready for review April 7, 2022 05:58
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason that this is otherwise failing, is because with the new dataset implementation, we are using a fixed file name (part-0.parquet), while before we where using a uuid filename. And so therefore, with the non-legacy writer, it is each time overwriting the same file inside the loop.

To what extent would this be something that users also could bump into? We could in theory use the basename_template argument to replicate this "uuid" filename behaviour inside pq.write_to_dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, makes sense.
Yes, I think we should add what you are suggesting to the pq.write_to_dataset. Will try and commit for review!

Copy link
Member Author

@AlenkaF AlenkaF Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, thinking aloud: existing_data_behavior controls how the dataset will handle data that already exists. If I implement a unique way of writing parquet files when using the new API in wrtie_to_dataset I will also have to set existing_data_behavior to be overwrite_or_ignore. That will then make trouble when exposing the same parameter in https://issues.apache.org/jira/browse/ARROW-15757.

I could do a check if the parameter is specified or not. But am not sure if there will be additional complications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I don't understand here is that the dataset.write_dataset function has a default to raise an error if there is existing data? But then why doesn't the above test fail with that error? (instead of failing in the test because we now overwrote the files)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this: if we switch the default as we are now doing, I think we should try to preserve the current behaviour of overwriting/adding data (otherwise it would be a quite breaking change for people using pq.write_to_dataset this way). We can still try to deprecate this and later move towards the same default as the dataset.write_dataset implementation.
But that can be done in a later stage with a proper deprecation warning (eg detect if the directory already exists and is not empty, and in that case indicate this will start raising an error in the future).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point.
I will step back and redo the issue in a way that the default stays True and it raises a deprecation warning in this case.

For the later stage is it worth creating a JIRA already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case where the default will still be True, does exposing write_dataset keywords still makes sense or should we wait once the default changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up on this thread: I kept the new implementation as the default and added the use of basename_template to mimic the legacy behaviour.

At a later stage I think it will be important to rearrange the keywords for write_to_dataset to first list the keywords connected to the new API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest changes, it should be possible to now change the hardcoded use_legacy_dataset=True to use_legacy_dataset=use_legacy_dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here as my comment above (https://github.com/apache/arrow/pull/12811/files#r845291213), as the TODO comment also indicates: calling this twice with use_legacy_dataset=False would overwrite the same file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why I didn't get this yesterday 🤦‍♀️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this test can now be updated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here this fails with using the new dataset implementation, because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword (to which chunk_size gets translated). The ParquetFileWriteOptions doesn't support this keyword.

On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at

static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,
std::shared_ptr<::arrow::io::OutputStream> sink,
std::shared_ptr<WriterProperties> properties,
std::shared_ptr<ArrowWriterProperties> arrow_properties,
std::unique_ptr<FileWriter>* writer);
virtual std::shared_ptr<::arrow::Schema> schema() const = 0;
/// \brief Write a Table to Parquet.
virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t chunk_size) = 0;

cc @westonpace do you remember if this has been discussed before how the row_group_size/chunk_size setting from Parquet fits into the dataset API?

Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset API now has a max_rows_per_group, I see, but that doesn't necessarily directly relate to Parquet row groups?

It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can open a follow-up JIRA for this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche should other keywords also be passed? (partitioning_flavor, max_*)

@AlenkaF
Copy link
Member Author

AlenkaF commented Apr 15, 2022

@jorisvandenbossche the PR should be ready now. I exposed basename_template and it now mimics the legacy behaviour.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest changes, it should be possible to now change the hardcoded use_legacy_dataset=True to use_legacy_dataset=use_legacy_dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this test can now be updated?

@AlenkaF
Copy link
Member Author

AlenkaF commented Apr 15, 2022

@jorisvandenbossche I applied the suggestions 👍

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can open a follow-up JIRA for this one?

@jorisvandenbossche
Copy link
Member

We should also ensure to suppress the warnings when running the tests (when we are explicitly testing use_legacy_dataset=True, and thus want to ignore the warning), but that could maybe be done as a follow-up as well.

@AlenkaF
Copy link
Member Author

AlenkaF commented Apr 19, 2022

Created https://issues.apache.org/jira/browse/ARROW-16241 for the follow-up on the warnings when explicitly using use_legacy_dataset=True in the tests.

@jorisvandenbossche jorisvandenbossche changed the title ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset Apr 21, 2022
@AlenkaF AlenkaF deleted the ARROW-16122 branch April 21, 2022 08:44
kszucs pushed a commit that referenced this pull request Apr 22, 2022
…es it impossible to maintain old behavior

This PR tries to pass `existing_data_behavior` to `write_to_dataset` in case of the new dataset implementation.

Connected to #12811.

Closes #12838 from AlenkaF/ARROW-15757

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
@ursabot
Copy link

ursabot commented Apr 23, 2022

Benchmark runs are scheduled for baseline = 4f08a9b and contender = 1763622. 1763622 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.38% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.55% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/565| 1763622b ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/553| 1763622b test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/551| 1763622b ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/563| 1763622b ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/564| 4f08a9b6 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/552| 4f08a9b6 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/550| 4f08a9b6 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/562| 4f08a9b6 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jorisvandenbossche pushed a commit that referenced this pull request May 19, 2022
Remove the lines that unconditionally set `partitioning` and `file_visitor` in `pq.write_to_dataset` to None. This is a leftover from #12811 where additional `pq.write_dataset` keywords were exposed.

Closes #13062 from AlenkaF/ARROW-16420

Authored-by: Alenka Frim <frim.alenka@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants