Skip to content

[Data] Prevent filename collisions on write#53890

Merged
bveeramani merged 11 commits intomasterfrom
fix-collisions
Jun 19, 2025
Merged

[Data] Prevent filename collisions on write#53890
bveeramani merged 11 commits intomasterfrom
fix-collisions

Conversation

@bveeramani
Copy link
Copy Markdown
Member

Why are these changes needed?

Currently, Ray Data uses a counter to determine dataset IDs and consequently written filenames. The issue with this approach is that if you re-run a job, Ray Data might re-use the same filenames and override existing data, even if you specify save_mode="append".

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copilot AI review requested due to automatic review settings June 17, 2025 16:49
@bveeramani bveeramani requested a review from a team as a code owner June 17, 2025 16:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR augments the filename generation API to include a per-write UUID, preventing collisions when re-running jobs in append mode.

  • Added write_uuid parameter to the filename provider interface and implementations
  • Generated and propagated a UUID through the write planning stage to all write tasks
  • Updated tests to pass a fixed write_uuid and verify deterministic and unique filenames

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python/ray/data/tests/test_filename_provider.py Updated tests to include write_uuid in filename calls
python/ray/data/datasource/filename_provider.py Extended provider interface and docs to accept write_uuid
python/ray/data/datasource/file_datasink.py Passed write_uuid into row/block filename generation
python/ray/data/_internal/planner/plan_write_op.py Generated a UUID and attached it to all write tasks
python/ray/data/_internal/datasource/parquet_datasink.py Added write_uuid argument for parquet filename calls

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani added the go add ONLY when ready to merge, run all tests label Jun 17, 2025

# Add a UUID to write tasks to prevent filename collisions. This a UUID for the
# overall write operation, not the individual write tasks.
write_uuid = uuid.uuid4().hex
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use dataset-id we don't need a different one

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin that'd change the way datasets appear in our observability tools. Are we okay that?

I don't mind, but in the interest of speed, I tried to leave that aspect unchanged

Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i keep forgetting that we call it uuid, but it's actually not (and we'd fix that madness btw)

bveeramani and others added 8 commits June 17, 2025 12:14
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…nto fix-collisions

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) June 19, 2025 07:42
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@github-actions github-actions bot disabled auto-merge June 19, 2025 09:38
@bveeramani bveeramani merged commit 6316d98 into master Jun 19, 2025
5 checks passed
@bveeramani bveeramani deleted the fix-collisions branch June 19, 2025 10:41
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Currently, Ray Data uses a counter to determine dataset IDs and
consequently written filenames. The issue with this approach is that if
you re-run a job, Ray Data might re-use the same filenames and override
existing data, even if you specify `save_mode="append"`.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Currently, Ray Data uses a counter to determine dataset IDs and
consequently written filenames. The issue with this approach is that if
you re-run a job, Ray Data might re-use the same filenames and override
existing data, even if you specify `save_mode="append"`.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants