ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset #12811

AlenkaF · 2022-04-06T10:58:01Z

This PR tries to amend pq.write_to_dataset to:

raise a deprecation warning for use_legacy_dataset=True and already switch the default to False.
raise deprecation warnings for all keywords (when use_legacy_dataset=True) that won't be supported in the new implementation.

github-actions · 2022-04-06T11:51:35Z

https://issues.apache.org/jira/browse/ARROW-16122

jorisvandenbossche

Thanks for taking a look at this!

python/pyarrow/parquet.py

jorisvandenbossche · 2022-04-07T15:45:41Z

python/pyarrow/tests/parquet/test_dataset.py

So the reason that this is otherwise failing, is because with the new dataset implementation, we are using a fixed file name (part-0.parquet), while before we where using a uuid filename. And so therefore, with the non-legacy writer, it is each time overwriting the same file inside the loop.

To what extent would this be something that users also could bump into? We could in theory use the basename_template argument to replicate this "uuid" filename behaviour inside pq.write_to_dataset.

Of course, makes sense.
Yes, I think we should add what you are suggesting to the pq.write_to_dataset. Will try and commit for review!

Hm, thinking aloud: existing_data_behavior controls how the dataset will handle data that already exists. If I implement a unique way of writing parquet files when using the new API in wrtie_to_dataset I will also have to set existing_data_behavior to be overwrite_or_ignore. That will then make trouble when exposing the same parameter in https://issues.apache.org/jira/browse/ARROW-15757.

I could do a check if the parameter is specified or not. But am not sure if there will be additional complications.

So what I don't understand here is that the dataset.write_dataset function has a default to raise an error if there is existing data? But then why doesn't the above test fail with that error? (instead of failing in the test because we now overwrote the files)

Thinking more about this: if we switch the default as we are now doing, I think we should try to preserve the current behaviour of overwriting/adding data (otherwise it would be a quite breaking change for people using pq.write_to_dataset this way). We can still try to deprecate this and later move towards the same default as the dataset.write_dataset implementation.
But that can be done in a later stage with a proper deprecation warning (eg detect if the directory already exists and is not empty, and in that case indicate this will start raising an error in the future).

Yes, good point.
I will step back and redo the issue in a way that the default stays True and it raises a deprecation warning in this case.

For the later stage is it worth creating a JIRA already?

In this case where the default will still be True, does exposing write_dataset keywords still makes sense or should we wait once the default changes?

Follow up on this thread: I kept the new implementation as the default and added the use of basename_template to mimic the legacy behaviour.

At a later stage I think it will be important to rearrange the keywords for write_to_dataset to first list the keywords connected to the new API.

With the latest changes, it should be possible to now change the hardcoded use_legacy_dataset=True to use_legacy_dataset=use_legacy_dataset?

jorisvandenbossche · 2022-04-07T15:47:08Z

python/pyarrow/tests/parquet/test_dataset.py

Same here as my comment above (https://github.com/apache/arrow/pull/12811/files#r845291213), as the TODO comment also indicates: calling this twice with use_legacy_dataset=False would overwrite the same file.

Not sure why I didn't get this yesterday 🤦‍♀️

Same here, this test can now be updated?

jorisvandenbossche · 2022-04-07T15:58:02Z

python/pyarrow/tests/test_dataset.py

So here this fails with using the new dataset implementation, because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword (to which chunk_size gets translated). The ParquetFileWriteOptions doesn't support this keyword.

On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at

arrow/cpp/src/parquet/arrow/writer.h

Lines 62 to 71 in 76d064c

static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,

std::shared_ptr<::arrow::io::OutputStream> sink,

std::shared_ptr<WriterProperties> properties,

std::shared_ptr<ArrowWriterProperties> arrow_properties,

std::unique_ptr<FileWriter>* writer);

virtual std::shared_ptr<::arrow::Schema> schema() const = 0;

/// \brief Write a Table to Parquet.

virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t chunk_size) = 0;

cc @westonpace do you remember if this has been discussed before how the row_group_size/chunk_size setting from Parquet fits into the dataset API?

The dataset API now has a max_rows_per_group, I see, but that doesn't necessarily directly relate to Parquet row groups?

It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go)

Maybe we can open a follow-up JIRA for this one?

Created https://issues.apache.org/jira/browse/ARROW-16240

AlenkaF · 2022-04-08T13:49:47Z

python/pyarrow/parquet.py

@jorisvandenbossche should other keywords also be passed? (partitioning_flavor, max_*)

python/pyarrow/parquet/__init__.py

AlenkaF · 2022-04-15T12:16:00Z

@jorisvandenbossche the PR should be ready now. I exposed basename_template and it now mimics the legacy behaviour.

jorisvandenbossche

Thanks for the updates!

python/pyarrow/parquet/__init__.py

jorisvandenbossche · 2022-04-15T12:08:02Z

python/pyarrow/tests/parquet/test_dataset.py

With the latest changes, it should be possible to now change the hardcoded use_legacy_dataset=True to use_legacy_dataset=use_legacy_dataset?

jorisvandenbossche · 2022-04-15T12:08:16Z

python/pyarrow/tests/parquet/test_dataset.py

Same here, this test can now be updated?

AlenkaF · 2022-04-15T13:04:26Z

@jorisvandenbossche I applied the suggestions 👍

jorisvandenbossche

Looks good!

python/pyarrow/parquet/__init__.py

jorisvandenbossche · 2022-04-19T16:38:39Z

python/pyarrow/tests/test_dataset.py

Maybe we can open a follow-up JIRA for this one?

python/pyarrow/tests/test_dataset.py

jorisvandenbossche · 2022-04-19T16:41:31Z

We should also ensure to suppress the warnings when running the tests (when we are explicitly testing use_legacy_dataset=True, and thus want to ignore the warning), but that could maybe be done as a follow-up as well.

…rect some of the tests to not fail due to the change

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

AlenkaF · 2022-04-19T18:14:21Z

Created https://issues.apache.org/jira/browse/ARROW-16241 for the follow-up on the warnings when explicitly using use_legacy_dataset=True in the tests.

…es it impossible to maintain old behavior This PR tries to pass `existing_data_behavior` to `write_to_dataset` in case of the new dataset implementation. Connected to #12811. Closes #12838 from AlenkaF/ARROW-15757 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

ursabot · 2022-04-23T08:21:09Z

Benchmark runs are scheduled for baseline = 4f08a9b and contender = 1763622. 1763622 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.38% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.55% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/565| 1763622b ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/553| 1763622b test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/551| 1763622b ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/563| 1763622b ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/564| 4f08a9b6 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/552| 4f08a9b6 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/550| 4f08a9b6 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/562| 4f08a9b6 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Remove the lines that unconditionally set `partitioning` and `file_visitor` in `pq.write_to_dataset` to None. This is a leftover from #12811 where additional `pq.write_dataset` keywords were exposed. Closes #13062 from AlenkaF/ARROW-16420 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions bot added the Component: Python label Apr 6, 2022

AlenkaF marked this pull request as ready for review April 7, 2022 05:58

jorisvandenbossche reviewed Apr 7, 2022

View reviewed changes

python/pyarrow/parquet.py Outdated Show resolved Hide resolved

python/pyarrow/parquet.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Apr 7, 2022

View reviewed changes

AlenkaF force-pushed the ARROW-16122 branch from 3686dc7 to 5ffe506 Compare April 8, 2022 12:30

AlenkaF commented Apr 8, 2022

View reviewed changes

python/pyarrow/parquet.py Outdated

Copy link

Member Author

AlenkaF Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche should other keywords also be passed? (partitioning_flavor, max_*)

AlenkaF mentioned this pull request Apr 8, 2022

ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior #12838

Closed

AlenkaF force-pushed the ARROW-16122 branch from 852d912 to 17af706 Compare April 14, 2022 06:09

jorisvandenbossche reviewed Apr 15, 2022

View reviewed changes

python/pyarrow/parquet/__init__.py Outdated Show resolved Hide resolved

python/pyarrow/parquet/__init__.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Apr 15, 2022

View reviewed changes

AlenkaF force-pushed the ARROW-16122 branch from 907e407 to b71f49c Compare April 15, 2022 13:03

jorisvandenbossche approved these changes Apr 19, 2022

View reviewed changes

AlenkaF added 12 commits April 19, 2022 20:08

Deprecate use_legacy_dataset=True in parquet.write_to_dataset and cor…

23c62fd

…rect some of the tests to not fail due to the change

Linter corrections

a75664e

Rearrange the code a bit

5a9f5c5

Revert to using None and checking for partition_filename_cb keyward

359ddc0

Add info to partition_filename_cb depr msg and add it where missing

eb891c9

Add change to the write_to_dataset docstring

90e6b72

Expose all write_dataset keywords to write_to_dataset

55069f0

Add keyword parameters to the docstrings

19c9440

Remove format from write_to_dataset keywords

1417886

Move the check and warning for partition_filename_cb

52b7f5a

Linter corrections

26ad057

Add basename_tamplate to new implementation - mimic old implementation

b0e4423

AlenkaF and others added 7 commits April 19, 2022 20:08

Make corrections to the docstrings

0515a7d

Remove format from keywords

4477b5f

Remove file_options keyword

71c97dc

Amend the tests to use use_legacy_dataset=use_legacy_dataset

170ef52

Apply suggestions from code review

aa33d71

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update python/pyarrow/parquet/__init__.py

c2cf529

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Correct _create_parquet_dataset_simple helper function and linter error

6c88c33

AlenkaF force-pushed the ARROW-16122 branch from 1796b46 to 6c88c33 Compare April 19, 2022 18:09

AlenkaF requested a review from jorisvandenbossche April 20, 2022 06:46

jorisvandenbossche changed the title ~~ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset~~ ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset Apr 21, 2022

jorisvandenbossche closed this in 1763622 Apr 21, 2022

AlenkaF deleted the ARROW-16122 branch April 21, 2022 08:44

AlenkaF mentioned this pull request May 4, 2022

ARROW-16420: [Python] pq.write_to_dataset always ignores partitioning #13062

Closed

	static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,
	std::shared_ptr<::arrow::io::OutputStream> sink,
	std::shared_ptr<WriterProperties> properties,
	std::shared_ptr<ArrowWriterProperties> arrow_properties,
	std::unique_ptr<FileWriter>* writer);

	virtual std::shared_ptr<::arrow::Schema> schema() const = 0;

	/// \brief Write a Table to Parquet.
	virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t chunk_size) = 0;

ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset #12811

ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset #12811

Uh oh!

Conversation

AlenkaF commented Apr 6, 2022

Uh oh!

github-actions bot commented Apr 6, 2022

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlenkaF Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AlenkaF commented Apr 15, 2022

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlenkaF commented Apr 15, 2022

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Apr 19, 2022

Uh oh!

AlenkaF commented Apr 19, 2022

Uh oh!

ursabot commented Apr 23, 2022

Uh oh!

AlenkaF Apr 8, 2022 •

edited

Loading

jorisvandenbossche Apr 7, 2022 •

edited

Loading