Skip to content

[Data] Allow specifying partitioning style or flavor in write_parquet()#59102

Merged
bveeramani merged 2 commits intoray-project:masterfrom
wingkitlee0:kit/write-parquet-partition-style
Dec 17, 2025
Merged

[Data] Allow specifying partitioning style or flavor in write_parquet()#59102
bveeramani merged 2 commits intoray-project:masterfrom
wingkitlee0:kit/write-parquet-partition-style

Conversation

@wingkitlee0
Copy link
Copy Markdown
Contributor

@wingkitlee0 wingkitlee0 commented Dec 2, 2025

Description

Currently, write_parquet has been hard-coded to use hive partititoning. This PR allows passing partitioning_flavor via arrow_parquet_args/arrow_parquet_args_fn.

Since the default behaviors are different between Ray Data and pyarrow:

  • Ray Data defaults to "hive", which is the case when we do not specify this partitioning_flavor
  • pyarrow uses None to represent dictionary partitioning. So we can use partitioning_flavor=None

Also, I did not use the Partitioning class in ray.data.read_parquet, which seems to be overkill (e.g., we exposed partition_cols as top-level args here..)

Finally, I have rearranged the docstring a little bit.

Related issues

NA

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch 2 times, most recently from 9f47731 to 0be1f96 Compare December 2, 2025 02:57
@wingkitlee0 wingkitlee0 changed the title Add option to specify partitioning style/flavor in write_parquet [Data] Add option to specify partitioning style/flavor in write_parquet Dec 2, 2025
@wingkitlee0 wingkitlee0 marked this pull request as ready for review December 2, 2025 03:53
@wingkitlee0 wingkitlee0 requested a review from a team as a code owner December 2, 2025 03:53
@wingkitlee0 wingkitlee0 changed the title [Data] Add option to specify partitioning style/flavor in write_parquet [Data] Allow specifying partitioning style/flavor in write_parquet Dec 2, 2025
@wingkitlee0 wingkitlee0 changed the title [Data] Allow specifying partitioning style/flavor in write_parquet [Data] Allow specifying partitioning style or flavor in write_parquet() Dec 2, 2025
@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 0c17dc7 to 57556d2 Compare December 2, 2025 04:27
@goutamvenkat-anyscale goutamvenkat-anyscale added the data Ray Data-related issues label Dec 2, 2025
@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch 3 times, most recently from f1c2c7d to 72e32c9 Compare December 2, 2025 17:13
@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 194aa09 to 19b0b7b Compare December 3, 2025 01:34
@wingkitlee0
Copy link
Copy Markdown
Contributor Author

One question... if users specify partitioning_flavor but didn't set partition_cols, there will be error in pyarrow. It seems to be useful to raise error in Ray Data before the pipeline runs..

@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch 3 times, most recently from b8ef583 to 27308bb Compare December 3, 2025 04:50
@wingkitlee0 wingkitlee0 added the go add ONLY when ready to merge, run all tests label Dec 3, 2025
@goutamvenkat-anyscale
Copy link
Copy Markdown
Contributor

One question... if users specify partitioning_flavor but didn't set partition_cols, there will be error in pyarrow. It seems to be useful to raise error in Ray Data before the pipeline runs..

We can add this validation in ParquetDatasink 's constructor. Feel free to add it in this PR.

@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 207ac82 to 91b6f06 Compare December 4, 2025 05:24
@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 91b6f06 to a272663 Compare December 6, 2025 02:41
@wingkitlee0 wingkitlee0 requested a review from a team December 9, 2025 23:44
@wingkitlee0
Copy link
Copy Markdown
Contributor Author

Hi @goutamvenkat-anyscale this is ready for re-review. Thanks!

Copy link
Copy Markdown
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 1 comment. But looks good otherwise

@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from 9293640 to b96c587 Compare December 16, 2025 21:57
Copy link
Copy Markdown
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
@wingkitlee0 wingkitlee0 force-pushed the kit/write-parquet-partition-style branch from b96c587 to 54b42c1 Compare December 16, 2025 22:16
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

self.arrow_parquet_args_fn is not None
and "partitioning_flavor" in self.arrow_parquet_args_fn()
):
assert (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think ValueError is more appropriate here because this is an error with user-provided values rather an internal correctness

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. also fixed another assert in the same function.

self.arrow_parquet_args_fn, **self.arrow_parquet_args
)
user_schema = write_kwargs.pop("schema", None)
# Extract partitioning_flavor before the closure to preserve it across retries
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt confused by this comment. What does extracting partition_flavor have to do with retries?

Copy link
Copy Markdown
Contributor Author

@wingkitlee0 wingkitlee0 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this comment since it's more of a python thing. This was originally pointed by the cursor review: write_kargs is mutable and so the fields are not available in the second try if they are already popped (inside the function) in the first try.


ds = ray.data.range(1000).add_column("grp", lambda x: x["id"] % 10)

with tempfile.TemporaryDirectory() as tmp_dir:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Consider using the tmp_path pytest fixture instead of tempfile. You can avoid the context manager, and I think it'll be easier to read

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. TIL this builtin fixture.

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
@bveeramani bveeramani merged commit 65d735e into ray-project:master Dec 17, 2025
6 checks passed
zzchun pushed a commit to zzchun/ray that referenced this pull request Dec 18, 2025
…() (ray-project#59102)

## Description
Currently, `write_parquet` has been hard-coded to use `hive`
partititoning. This PR allows passing `partitioning_flavor` via
`arrow_parquet_args`/`arrow_parquet_args_fn`.

Since the default behaviors are different between Ray Data and pyarrow:
- Ray Data defaults to "hive", which is the case when we do not specify
this `partitioning_flavor`
- pyarrow uses `None` to represent dictionary partitioning. So we can
use partitioning_flavor=None

Also, I did not use the Partitioning class in `ray.data.read_parquet`,
which seems to be overkill (e.g., we exposed `partition_cols` as
top-level args here..)

Finally, I have rearranged the docstring a little bit.

## Related issues
NA

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
…() (ray-project#59102)

## Description
Currently, `write_parquet` has been hard-coded to use `hive`
partititoning. This PR allows passing `partitioning_flavor` via
`arrow_parquet_args`/`arrow_parquet_args_fn`.

Since the default behaviors are different between Ray Data and pyarrow:
- Ray Data defaults to "hive", which is the case when we do not specify
this `partitioning_flavor`
- pyarrow uses `None` to represent dictionary partitioning. So we can
use partitioning_flavor=None

Also, I did not use the Partitioning class in `ray.data.read_parquet`,
which seems to be overkill (e.g., we exposed `partition_cols` as
top-level args here..)

Finally, I have rearranged the docstring a little bit.

## Related issues
NA

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…() (ray-project#59102)

## Description
Currently, `write_parquet` has been hard-coded to use `hive`
partititoning. This PR allows passing `partitioning_flavor` via
`arrow_parquet_args`/`arrow_parquet_args_fn`.

Since the default behaviors are different between Ray Data and pyarrow:
- Ray Data defaults to "hive", which is the case when we do not specify
this `partitioning_flavor`
- pyarrow uses `None` to represent dictionary partitioning. So we can
use partitioning_flavor=None

Also, I did not use the Partitioning class in `ray.data.read_parquet`,
which seems to be overkill (e.g., we exposed `partition_cols` as
top-level args here..)

Finally, I have rearranged the docstring a little bit.

## Related issues
NA

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants