[Data] Allow specifying partitioning style or flavor in write_parquet() by wingkitlee0 · Pull Request #59102 · ray-project/ray

wingkitlee0 · 2025-12-02T02:27:03Z

Description

Currently, write_parquet has been hard-coded to use hive partititoning. This PR allows passing partitioning_flavor via arrow_parquet_args/arrow_parquet_args_fn.

Since the default behaviors are different between Ray Data and pyarrow:

Ray Data defaults to "hive", which is the case when we do not specify this partitioning_flavor
pyarrow uses None to represent dictionary partitioning. So we can use partitioning_flavor=None

Also, I did not use the Partitioning class in ray.data.read_parquet, which seems to be overkill (e.g., we exposed partition_cols as top-level args here..)

Finally, I have rearranged the docstring a little bit.

Related issues

NA

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

python/ray/data/_internal/datasource/parquet_datasink.py

wingkitlee0 · 2025-12-03T01:45:48Z

One question... if users specify partitioning_flavor but didn't set partition_cols, there will be error in pyarrow. It seems to be useful to raise error in Ray Data before the pipeline runs..

goutamvenkat-anyscale · 2025-12-03T18:31:52Z

One question... if users specify partitioning_flavor but didn't set partition_cols, there will be error in pyarrow. It seems to be useful to raise error in Ray Data before the pipeline runs..

We can add this validation in ParquetDatasink 's constructor. Feel free to add it in this PR.

python/ray/data/_internal/datasource/parquet_datasink.py

wingkitlee0 · 2025-12-13T21:20:48Z

Hi @goutamvenkat-anyscale this is ready for re-review. Thanks!

goutamvenkat-anyscale

Just 1 comment. But looks good otherwise

goutamvenkat-anyscale

lgtm! Thanks

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

bveeramani

Overall LGTM

bveeramani · 2025-12-17T02:54:09Z

python/ray/data/_internal/datasource/parquet_datasink.py

+            self.arrow_parquet_args_fn is not None
+            and "partitioning_flavor" in self.arrow_parquet_args_fn()
+        ):
+            assert (


Nit: I think ValueError is more appropriate here because this is an error with user-provided values rather an internal correctness

Fixed. also fixed another assert in the same function.

bveeramani · 2025-12-17T02:55:30Z

python/ray/data/_internal/datasource/parquet_datasink.py

            self.arrow_parquet_args_fn, **self.arrow_parquet_args
        )
        user_schema = write_kwargs.pop("schema", None)
+        # Extract partitioning_flavor before the closure to preserve it across retries


I felt confused by this comment. What does extracting partition_flavor have to do with retries?

I removed this comment since it's more of a python thing. This was originally pointed by the cursor review: write_kargs is mutable and so the fields are not available in the second try if they are already popped (inside the function) in the first try.

python/ray/data/dataset.py

bveeramani · 2025-12-17T02:57:46Z

python/ray/data/tests/test_parquet.py

+
+    ds = ray.data.range(1000).add_column("grp", lambda x: x["id"] % 10)
+
+    with tempfile.TemporaryDirectory() as tmp_dir:


Nit: Consider using the tmp_path pytest fixture instead of tempfile. You can avoid the context manager, and I think it'll be easier to read

Thanks. TIL this builtin fixture.

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

…() (ray-project#59102) ## Description Currently, `write_parquet` has been hard-coded to use `hive` partititoning. This PR allows passing `partitioning_flavor` via `arrow_parquet_args`/`arrow_parquet_args_fn`. Since the default behaviors are different between Ray Data and pyarrow: - Ray Data defaults to "hive", which is the case when we do not specify this `partitioning_flavor` - pyarrow uses `None` to represent dictionary partitioning. So we can use partitioning_flavor=None Also, I did not use the Partitioning class in `ray.data.read_parquet`, which seems to be overkill (e.g., we exposed `partition_cols` as top-level args here..) Finally, I have rearranged the docstring a little bit. ## Related issues NA ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

…() (ray-project#59102) ## Description Currently, `write_parquet` has been hard-coded to use `hive` partititoning. This PR allows passing `partitioning_flavor` via `arrow_parquet_args`/`arrow_parquet_args_fn`. Since the default behaviors are different between Ray Data and pyarrow: - Ray Data defaults to "hive", which is the case when we do not specify this `partitioning_flavor` - pyarrow uses `None` to represent dictionary partitioning. So we can use partitioning_flavor=None Also, I did not use the Partitioning class in `ray.data.read_parquet`, which seems to be overkill (e.g., we exposed `partition_cols` as top-level args here..) Finally, I have rearranged the docstring a little bit. ## Related issues NA ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch 2 times, most recently from 9f47731 to 0be1f96 Compare December 2, 2025 02:57

wingkitlee0 changed the title ~~Add option to specify partitioning style/flavor in write_parquet~~ [Data] Add option to specify partitioning style/flavor in write_parquet Dec 2, 2025

wingkitlee0 marked this pull request as ready for review December 2, 2025 03:53

wingkitlee0 requested a review from a team as a code owner December 2, 2025 03:53

wingkitlee0 changed the title ~~[Data] Add option to specify partitioning style/flavor in write_parquet~~ [Data] Allow specifying partitioning style/flavor in write_parquet Dec 2, 2025

wingkitlee0 changed the title ~~[Data] Allow specifying partitioning style/flavor in write_parquet~~ [Data] Allow specifying partitioning style or flavor in write_parquet() Dec 2, 2025

cursor bot reviewed Dec 2, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasink.py Outdated Show resolved Hide resolved

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 0c17dc7 to 57556d2 Compare December 2, 2025 04:27

goutamvenkat-anyscale added the data Ray Data-related issues label Dec 2, 2025

goutamvenkat-anyscale reviewed Dec 2, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasink.py Outdated Show resolved Hide resolved

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch 3 times, most recently from f1c2c7d to 72e32c9 Compare December 2, 2025 17:13

goutamvenkat-anyscale reviewed Dec 2, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasink.py Show resolved Hide resolved

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 194aa09 to 19b0b7b Compare December 3, 2025 01:34

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch 3 times, most recently from b8ef583 to 27308bb Compare December 3, 2025 04:50

wingkitlee0 added the go add ONLY when ready to merge, run all tests label Dec 3, 2025

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 207ac82 to 91b6f06 Compare December 4, 2025 05:24

cursor bot reviewed Dec 4, 2025

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasink.py Outdated Show resolved Hide resolved

wingkitlee0 requested a review from goutamvenkat-anyscale December 6, 2025 02:38

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 91b6f06 to a272663 Compare December 6, 2025 02:41

wingkitlee0 requested a review from a team December 9, 2025 23:44

goutamvenkat-anyscale reviewed Dec 16, 2025

View reviewed changes

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 9293640 to b96c587 Compare December 16, 2025 21:57

goutamvenkat-anyscale approved these changes Dec 16, 2025

View reviewed changes

[Data] Add option to specify partitioning style/flavor in write_parquet

54b42c1

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from b96c587 to 54b42c1 Compare December 16, 2025 22:16

bveeramani reviewed Dec 17, 2025

View reviewed changes

address pr comments

f8b4c83

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

bveeramani merged commit 65d735e into ray-project:master Dec 17, 2025
6 checks passed


		ds = ray.data.range(1000).add_column("grp", lambda x: x["id"] % 10)

		with tempfile.TemporaryDirectory() as tmp_dir:

Conversation

wingkitlee0 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wingkitlee0 commented Dec 3, 2025

Uh oh!

goutamvenkat-anyscale commented Dec 3, 2025

Uh oh!

Uh oh!

wingkitlee0 commented Dec 13, 2025

Uh oh!

goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

bveeramani Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

wingkitlee0 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

wingkitlee0 Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bveeramani Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

wingkitlee0 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wingkitlee0 commented Dec 2, 2025 •

edited

Loading

wingkitlee0 Dec 17, 2025 •

edited

Loading