[Data] Prevent filename collisions on write by bveeramani · Pull Request #53890 · ray-project/ray

bveeramani · 2025-06-17T16:49:44Z

Why are these changes needed?

Currently, Ray Data uses a counter to determine dataset IDs and consequently written filenames. The issue with this approach is that if you re-run a job, Ray Data might re-use the same filenames and override existing data, even if you specify save_mode="append".

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Copilot

Pull Request Overview

This PR augments the filename generation API to include a per-write UUID, preventing collisions when re-running jobs in append mode.

Added write_uuid parameter to the filename provider interface and implementations
Generated and propagated a UUID through the write planning stage to all write tasks
Updated tests to pass a fixed write_uuid and verify deterministic and unique filenames

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
python/ray/data/tests/test_filename_provider.py	Updated tests to include `write_uuid` in filename calls
python/ray/data/datasource/filename_provider.py	Extended provider interface and docs to accept `write_uuid`
python/ray/data/datasource/file_datasink.py	Passed `write_uuid` into row/block filename generation
python/ray/data/_internal/planner/plan_write_op.py	Generated a UUID and attached it to all write tasks
python/ray/data/_internal/datasource/parquet_datasink.py	Added `write_uuid` argument for parquet filename calls

python/ray/data/_internal/planner/plan_write_op.py

python/ray/data/_internal/datasource/parquet_datasink.py

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

python/ray/data/_internal/planner/plan_write_op.py

alexeykudinkin · 2025-06-17T18:37:24Z

python/ray/data/_internal/planner/plan_write_op.py

+
+    # Add a UUID to write tasks to prevent filename collisions. This a UUID for the
+    # overall write operation, not the individual write tasks.
+    write_uuid = uuid.uuid4().hex


Let's use dataset-id we don't need a different one

@alexeykudinkin that'd change the way datasets appear in our observability tools. Are we okay that?

I don't mind, but in the interest of speed, I tried to leave that aspect unchanged

Ah, i keep forgetting that we call it uuid, but it's actually not (and we'd fix that madness btw)

python/ray/data/_internal/planner/plan_write_op.py

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…nto fix-collisions Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

## Why are these changes needed?  Currently, Ray Data uses a counter to determine dataset IDs and consequently written filenames. The issue with this approach is that if you re-run a job, Ray Data might re-use the same filenames and override existing data, even if you specify `save_mode="append"`. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

## Why are these changes needed?  Currently, Ray Data uses a counter to determine dataset IDs and consequently written filenames. The issue with this approach is that if you re-run a job, Ray Data might re-use the same filenames and override existing data, even if you specify `save_mode="append"`. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Initial commit

c401b76

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Copilot AI review requested due to automatic review settings June 17, 2025 16:49

bveeramani requested a review from a team as a code owner June 17, 2025 16:49

bveeramani assigned raulchen Jun 17, 2025

Copilot AI reviewed Jun 17, 2025

View reviewed changes

python/ray/data/_internal/planner/plan_write_op.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource/parquet_datasink.py Outdated Show resolved Hide resolved

Address review comments

71c04a0

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani added the go add ONLY when ready to merge, run all tests label Jun 17, 2025

raulchen approved these changes Jun 17, 2025

View reviewed changes

python/ray/data/_internal/planner/plan_write_op.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Jun 17, 2025

View reviewed changes

bveeramani and others added 8 commits June 17, 2025 12:14

Fix bug

acea2fc

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Merge branch 'master' into fix-collisions

02745f5

Merge branch 'master' into fix-collisions

f47ff38

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Address review comments

46e9f7e

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Merge branch 'fix-collisions' of https://github.com/ray-project/ray i…

c01b9fe

…nto fix-collisions Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Fix bug

19ae2a8

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Fix some testS

9c658f5

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Merge branch 'master' into fix-collisions

3d3967a

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani enabled auto-merge (squash) June 19, 2025 07:42

Merge branch 'master' into fix-collisions

8e74aa1

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

github-actions bot disabled auto-merge June 19, 2025 09:38

bveeramani merged commit 6316d98 into master Jun 19, 2025
5 checks passed

bveeramani deleted the fix-collisions branch June 19, 2025 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Prevent filename collisions on write#53890

[Data] Prevent filename collisions on write#53890
bveeramani merged 11 commits intomasterfrom
fix-collisions

bveeramani commented Jun 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Jun 17, 2025

Uh oh!

bveeramani Jun 17, 2025

Uh oh!

alexeykudinkin Jun 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bveeramani commented Jun 17, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexeykudinkin Jun 17, 2025 •

edited

Loading