[Data] Allow specifying partitioning style or flavor in write_parquet()#59102
Conversation
9f47731 to
0be1f96
Compare
0c17dc7 to
57556d2
Compare
f1c2c7d to
72e32c9
Compare
194aa09 to
19b0b7b
Compare
|
One question... if users specify |
b8ef583 to
27308bb
Compare
We can add this validation in |
207ac82 to
91b6f06
Compare
91b6f06 to
a272663
Compare
|
Hi @goutamvenkat-anyscale this is ready for re-review. Thanks! |
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
Just 1 comment. But looks good otherwise
9293640 to
b96c587
Compare
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
lgtm! Thanks
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
b96c587 to
54b42c1
Compare
| self.arrow_parquet_args_fn is not None | ||
| and "partitioning_flavor" in self.arrow_parquet_args_fn() | ||
| ): | ||
| assert ( |
There was a problem hiding this comment.
Nit: I think ValueError is more appropriate here because this is an error with user-provided values rather an internal correctness
There was a problem hiding this comment.
Fixed. also fixed another assert in the same function.
| self.arrow_parquet_args_fn, **self.arrow_parquet_args | ||
| ) | ||
| user_schema = write_kwargs.pop("schema", None) | ||
| # Extract partitioning_flavor before the closure to preserve it across retries |
There was a problem hiding this comment.
I felt confused by this comment. What does extracting partition_flavor have to do with retries?
There was a problem hiding this comment.
I removed this comment since it's more of a python thing. This was originally pointed by the cursor review: write_kargs is mutable and so the fields are not available in the second try if they are already popped (inside the function) in the first try.
|
|
||
| ds = ray.data.range(1000).add_column("grp", lambda x: x["id"] % 10) | ||
|
|
||
| with tempfile.TemporaryDirectory() as tmp_dir: |
There was a problem hiding this comment.
Nit: Consider using the tmp_path pytest fixture instead of tempfile. You can avoid the context manager, and I think it'll be easier to read
There was a problem hiding this comment.
Thanks. TIL this builtin fixture.
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
…() (ray-project#59102) ## Description Currently, `write_parquet` has been hard-coded to use `hive` partititoning. This PR allows passing `partitioning_flavor` via `arrow_parquet_args`/`arrow_parquet_args_fn`. Since the default behaviors are different between Ray Data and pyarrow: - Ray Data defaults to "hive", which is the case when we do not specify this `partitioning_flavor` - pyarrow uses `None` to represent dictionary partitioning. So we can use partitioning_flavor=None Also, I did not use the Partitioning class in `ray.data.read_parquet`, which seems to be overkill (e.g., we exposed `partition_cols` as top-level args here..) Finally, I have rearranged the docstring a little bit. ## Related issues NA ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
…() (ray-project#59102) ## Description Currently, `write_parquet` has been hard-coded to use `hive` partititoning. This PR allows passing `partitioning_flavor` via `arrow_parquet_args`/`arrow_parquet_args_fn`. Since the default behaviors are different between Ray Data and pyarrow: - Ray Data defaults to "hive", which is the case when we do not specify this `partitioning_flavor` - pyarrow uses `None` to represent dictionary partitioning. So we can use partitioning_flavor=None Also, I did not use the Partitioning class in `ray.data.read_parquet`, which seems to be overkill (e.g., we exposed `partition_cols` as top-level args here..) Finally, I have rearranged the docstring a little bit. ## Related issues NA ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
…() (ray-project#59102) ## Description Currently, `write_parquet` has been hard-coded to use `hive` partititoning. This PR allows passing `partitioning_flavor` via `arrow_parquet_args`/`arrow_parquet_args_fn`. Since the default behaviors are different between Ray Data and pyarrow: - Ray Data defaults to "hive", which is the case when we do not specify this `partitioning_flavor` - pyarrow uses `None` to represent dictionary partitioning. So we can use partitioning_flavor=None Also, I did not use the Partitioning class in `ray.data.read_parquet`, which seems to be overkill (e.g., we exposed `partition_cols` as top-level args here..) Finally, I have rearranged the docstring a little bit. ## Related issues NA ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Currently,
write_parquethas been hard-coded to usehivepartititoning. This PR allows passingpartitioning_flavorviaarrow_parquet_args/arrow_parquet_args_fn.Since the default behaviors are different between Ray Data and pyarrow:
partitioning_flavorNoneto represent dictionary partitioning. So we can use partitioning_flavor=NoneAlso, I did not use the Partitioning class in
ray.data.read_parquet, which seems to be overkill (e.g., we exposedpartition_colsas top-level args here..)Finally, I have rearranged the docstring a little bit.
Related issues
NA
Additional information