[Data] Use iterator in write ops instead of accumulating all of the blocks in memory instead by alexeykudinkin · Pull Request #57108 · ray-project/ray

alexeykudinkin · 2025-10-01T21:22:18Z

Changes

Execute as a typical iteration writing and releasing resulting blocks incrementally rather than holding on to them until the whole op completes.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Write ops now consume results via an iterator and aggregate per-bundle WriteResult using a new combine API, avoiding accumulating all blocks in memory.

Write Path (Dataset.write_datasink):
- Switch to streaming execution (_execute_to_iterator) to process write results incrementally instead of ray.get on all blocks.
- Accumulate per-bundle results via _gen_datasink_write_result and aggregate with WriteResult.combine; log and callback use the combined result.
APIs/Utilities:
- Move write-result aggregation helper to ray.data.datasource.datasink as _gen_datasink_write_result.
- Add WriteResult.combine(...) for aggregating multiple write results.
Code Cleanup:
- Remove gen_datasink_write_result from planner/plan_write_op.py and related imports; adjust imports in dataset.py accordingly.

^{Written by Cursor Bugbot for commit 4c5a373. This will update automatically on new commits. Configure here.}

…n mem instead Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

gemini-code-assist

Code Review

This pull request aims to optimize memory usage in write_datasink by using an iterator. While the change correctly introduces an iterator, it doesn't fully achieve the memory-saving goal because it still accumulates all write results in memory before processing. I've provided a review comment with a suggested code change to process the results incrementally, which should properly address the underlying memory issue.

python/ray/data/dataset.py

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

goutamvenkat-anyscale · 2025-10-02T00:38:35Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the write operations to use an iterator-based approach for handling write results, which is a great improvement for memory efficiency. Instead of accumulating all block references and getting their results at once with ray.get(), the new implementation processes them incrementally. This avoids holding all results in memory, which is particularly beneficial for large datasets.

The changes include:

Modifying Dataset.write_datasink to use _execute_to_iterator and process write results in a streaming fashion.
Introducing a WriteResult.combine class method to aggregate per-bundle write results.
Relocating the gen_datasink_write_result helper function to a more appropriate module.

The implementation is solid. I've suggested a minor enhancement to use a generator expression, which further improves memory usage by avoiding an intermediate list of results. Overall, this is a valuable optimization.

gemini-code-assist · 2025-10-02T00:39:38Z

python/ray/data/dataset.py

+            iter_, stats = self._write_ds._execute_to_iterator()
+            write_results = []
+
+            for bundle in iter_:
+                res = ray.get(bundle.block_refs)
+                # Generate write result report
+                write_results.append(_gen_datasink_write_result(res))
+
+            combined_write_result = WriteResult.combine(*write_results)


Great work on switching to an iterator to process write results incrementally! This significantly improves memory efficiency.

To further enhance this and make it even more memory-friendly, you can use a generator expression instead of creating an intermediate write_results list. This avoids materializing all WriteResult objects in memory before combining them.

Also, the stats variable returned by _execute_to_iterator() is not used, so it can be replaced with _ to signal that it's intentionally ignored.

iter_, _ = self._write_ds._execute_to_iterator() write_results_iter = ( _gen_datasink_write_result(ray.get(bundle.block_refs)) for bundle in iter_ ) combined_write_result = WriteResult.combine(*write_results_iter)

…locks in memory instead (ray-project#57108)   ## Changes Execute as a typical iteration writing and releasing resulting blocks incrementally rather than holding on to them until the whole op completes.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Write ops now consume results via an iterator and aggregate per-bundle WriteResult using a new combine API, avoiding accumulating all blocks in memory. > > - **Write Path (Dataset.write_datasink)**: > - Switch to streaming execution (`_execute_to_iterator`) to process write results incrementally instead of `ray.get` on all blocks. > - Accumulate per-bundle results via `_gen_datasink_write_result` and aggregate with `WriteResult.combine`; log and callback use the combined result. > - **APIs/Utilities**: > - Move write-result aggregation helper to `ray.data.datasource.datasink` as `_gen_datasink_write_result`. > - Add `WriteResult.combine(...)` for aggregating multiple write results. > - **Code Cleanup**: > - Remove `gen_datasink_write_result` from `planner/plan_write_op.py` and related imports; adjust imports in `dataset.py` accordingly. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4c5a373. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…locks in memory instead (ray-project#57108)   ## Changes Execute as a typical iteration writing and releasing resulting blocks incrementally rather than holding on to them until the whole op completes.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Write ops now consume results via an iterator and aggregate per-bundle WriteResult using a new combine API, avoiding accumulating all blocks in memory. > > - **Write Path (Dataset.write_datasink)**: > - Switch to streaming execution (`_execute_to_iterator`) to process write results incrementally instead of `ray.get` on all blocks. > - Accumulate per-bundle results via `_gen_datasink_write_result` and aggregate with `WriteResult.combine`; log and callback use the combined result. > - **APIs/Utilities**: > - Move write-result aggregation helper to `ray.data.datasource.datasink` as `_gen_datasink_write_result`. > - Add `WriteResult.combine(...)` for aggregating multiple write results. > - **Code Cleanup**: > - Remove `gen_datasink_write_result` from `planner/plan_write_op.py` and related imports; adjust imports in `dataset.py` accordingly. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4c5a373. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…locks in memory instead (ray-project#57108)   ## Changes Execute as a typical iteration writing and releasing resulting blocks incrementally rather than holding on to them until the whole op completes.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Write ops now consume results via an iterator and aggregate per-bundle WriteResult using a new combine API, avoiding accumulating all blocks in memory. > > - **Write Path (Dataset.write_datasink)**: > - Switch to streaming execution (`_execute_to_iterator`) to process write results incrementally instead of `ray.get` on all blocks. > - Accumulate per-bundle results via `_gen_datasink_write_result` and aggregate with `WriteResult.combine`; log and callback use the combined result. > - **APIs/Utilities**: > - Move write-result aggregation helper to `ray.data.datasource.datasink` as `_gen_datasink_write_result`. > - Add `WriteResult.combine(...)` for aggregating multiple write results. > - **Code Cleanup**: > - Remove `gen_datasink_write_result` from `planner/plan_write_op.py` and related imports; adjust imports in `dataset.py` accordingly. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4c5a373. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…locks in memory instead (ray-project#57108)   ## Changes Execute as a typical iteration writing and releasing resulting blocks incrementally rather than holding on to them until the whole op completes.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Write ops now consume results via an iterator and aggregate per-bundle WriteResult using a new combine API, avoiding accumulating all blocks in memory. > > - **Write Path (Dataset.write_datasink)**: > - Switch to streaming execution (`_execute_to_iterator`) to process write results incrementally instead of `ray.get` on all blocks. > - Accumulate per-bundle results via `_gen_datasink_write_result` and aggregate with `WriteResult.combine`; log and callback use the combined result. > - **APIs/Utilities**: > - Move write-result aggregation helper to `ray.data.datasource.datasink` as `_gen_datasink_write_result`. > - Add `WriteResult.combine(...)` for aggregating multiple write results. > - **Code Cleanup**: > - Remove `gen_datasink_write_result` from `planner/plan_write_op.py` and related imports; adjust imports in `dataset.py` accordingly. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4c5a373. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>

…locks in memory instead (ray-project#57108)   ## Changes Execute as a typical iteration writing and releasing resulting blocks incrementally rather than holding on to them until the whole op completes.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Write ops now consume results via an iterator and aggregate per-bundle WriteResult using a new combine API, avoiding accumulating all blocks in memory. > > - **Write Path (Dataset.write_datasink)**: > - Switch to streaming execution (`_execute_to_iterator`) to process write results incrementally instead of `ray.get` on all blocks. > - Accumulate per-bundle results via `_gen_datasink_write_result` and aggregate with `WriteResult.combine`; log and callback use the combined result. > - **APIs/Utilities**: > - Move write-result aggregation helper to `ray.data.datasource.datasink` as `_gen_datasink_write_result`. > - Add `WriteResult.combine(...)` for aggregating multiple write results. > - **Code Cleanup**: > - Remove `gen_datasink_write_result` from `planner/plan_write_op.py` and related imports; adjust imports in `dataset.py` accordingly. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4c5a373. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…locks in memory instead (ray-project#57108)   ## Changes Execute as a typical iteration writing and releasing resulting blocks incrementally rather than holding on to them until the whole op completes.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Write ops now consume results via an iterator and aggregate per-bundle WriteResult using a new combine API, avoiding accumulating all blocks in memory. > > - **Write Path (Dataset.write_datasink)**: > - Switch to streaming execution (`_execute_to_iterator`) to process write results incrementally instead of `ray.get` on all blocks. > - Accumulate per-bundle results via `_gen_datasink_write_result` and aggregate with `WriteResult.combine`; log and callback use the combined result. > - **APIs/Utilities**: > - Move write-result aggregation helper to `ray.data.datasource.datasink` as `_gen_datasink_write_result`. > - Add `WriteResult.combine(...)` for aggregating multiple write results. > - **Code Cleanup**: > - Remove `gen_datasink_write_result` from `planner/plan_write_op.py` and related imports; adjust imports in `dataset.py` accordingly. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4c5a373. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…locks in memory instead (ray-project#57108)   ## Changes Execute as a typical iteration writing and releasing resulting blocks incrementally rather than holding on to them until the whole op completes.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Write ops now consume results via an iterator and aggregate per-bundle WriteResult using a new combine API, avoiding accumulating all blocks in memory. > > - **Write Path (Dataset.write_datasink)**: > - Switch to streaming execution (`_execute_to_iterator`) to process write results incrementally instead of `ray.get` on all blocks. > - Accumulate per-bundle results via `_gen_datasink_write_result` and aggregate with `WriteResult.combine`; log and callback use the combined result. > - **APIs/Utilities**: > - Move write-result aggregation helper to `ray.data.datasource.datasink` as `_gen_datasink_write_result`. > - Add `WriteResult.combine(...)` for aggregating multiple write results. > - **Code Cleanup**: > - Remove `gen_datasink_write_result` from `planner/plan_write_op.py` and related imports; adjust imports in `dataset.py` accordingly. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4c5a373. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

Use iterator in write ops instead of accumulating all of the blocks i…

d9bb6df

…n mem instead Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner October 1, 2025 21:22

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 1, 2025

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

alexeykudinkin enabled auto-merge (squash) October 1, 2025 21:38

alexeykudinkin added 3 commits October 1, 2025 15:10

Added util to combine WriteResults

fe38865

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Avoid holding on to blocks, instead keeping just WriteResults

eb9842a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

e090b72

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

github-actions bot disabled auto-merge October 1, 2025 22:15

Made util private

4c5a373

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Oct 2, 2025

alexeykudinkin enabled auto-merge (squash) October 2, 2025 01:55

goutamvenkat-anyscale approved these changes Oct 2, 2025

View reviewed changes

alexeykudinkin merged commit 540fe55 into ray-project:master Oct 2, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Use iterator in write ops instead of accumulating all of the blocks in memory instead#57108

[Data] Use iterator in write ops instead of accumulating all of the blocks in memory instead#57108
alexeykudinkin merged 5 commits intoray-project:masterfrom
alexeykudinkin:ak/wrt-iter-clup

alexeykudinkin commented Oct 1, 2025 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale commented Oct 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexeykudinkin commented Oct 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale commented Oct 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexeykudinkin commented Oct 1, 2025 •

edited by cursor bot

Loading