[data] Download op fusion / removal of interleaved partitioners by omatthew98 · Pull Request #56462 · ray-project/ray

omatthew98 · 2025-09-11T18:31:40Z

Why are these changes needed?

If we have multiple chained downloads e.g. ds.with_column("bytes_1", download("uri_1")).with_column("bytes_2", download("uri_2")).with_column("bytes_3", download("uri_3")), then we would have an operator structure like URIPartitioner->URIDownloader->URIPartitioner->URIDownloader->URIPartitioner->URIDownloader. Each of the URIPartitioner operators will be implemented with an ActorPoolMapOperator with concurrency of 1. In these chained downloads, these become bottlenecks and scaling the concurrency of these up will result in additional resource usage that will take resources away from other operators.

This solves the problem by deferring some of the partitioning to the URIDownloader so we can remove the interleaved partitioners. The result is an operator structure like URIPartitioner->URIDownloader->URIDownloader->URIDownloader which delivers much better performance for these cases.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Owen <mowen@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a valuable optimization for chained download operations by removing intermediate partitioner operators, which should significantly improve performance in those scenarios. The logic to detect chained downloads and defer block splitting to the URIDownloader is well-implemented. I have identified a critical syntax error related to an invalid return statement in a generator function, and a high-severity bug in the block splitting logic that could result in an incorrect number of output blocks. Addressing these issues will ensure the correctness and robustness of this new optimization.

python/ray/data/_internal/planner/plan_download_op.py

bveeramani

Gemini comments, but otherwise LGTM

gvspraveen · 2025-09-11T18:59:45Z

Please also add a test for multiple chained downloads use case?

Signed-off-by: Matthew Owen <mowen@anyscale.com>

…project#56462) ## Why are these changes needed? If we have multiple chained downloads e.g. `ds.with_column("bytes_1", download("uri_1")).with_column("bytes_2", download("uri_2")).with_column("bytes_3", download("uri_3"))`, then we would have an operator structure like `URIPartitioner->URIDownloader->URIPartitioner->URIDownloader->URIPartitioner->URIDownloader`. Each of the `URIPartitioner` operators will be implemented with an ActorPoolMapOperator with concurrency of 1. In these chained downloads, these become bottlenecks and scaling the concurrency of these up will result in additional resource usage that will take resources away from other operators. This solves the problem by deferring some of the partitioning to the `URIDownloader` so we can remove the interleaved partitioners. The result is an operator structure like `URIPartitioner->URIDownloader->URIDownloader->URIDownloader` which delivers much better performance for these cases. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

…project#56462) ## Why are these changes needed? If we have multiple chained downloads e.g. `ds.with_column("bytes_1", download("uri_1")).with_column("bytes_2", download("uri_2")).with_column("bytes_3", download("uri_3"))`, then we would have an operator structure like `URIPartitioner->URIDownloader->URIPartitioner->URIDownloader->URIPartitioner->URIDownloader`. Each of the `URIPartitioner` operators will be implemented with an ActorPoolMapOperator with concurrency of 1. In these chained downloads, these become bottlenecks and scaling the concurrency of these up will result in additional resource usage that will take resources away from other operators. This solves the problem by deferring some of the partitioning to the `URIDownloader` so we can remove the interleaved partitioners. The result is an operator structure like `URIPartitioner->URIDownloader->URIDownloader->URIDownloader` which delivers much better performance for these cases. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>

…project#56462) ## Why are these changes needed? If we have multiple chained downloads e.g. `ds.with_column("bytes_1", download("uri_1")).with_column("bytes_2", download("uri_2")).with_column("bytes_3", download("uri_3"))`, then we would have an operator structure like `URIPartitioner->URIDownloader->URIPartitioner->URIDownloader->URIPartitioner->URIDownloader`. Each of the `URIPartitioner` operators will be implemented with an ActorPoolMapOperator with concurrency of 1. In these chained downloads, these become bottlenecks and scaling the concurrency of these up will result in additional resource usage that will take resources away from other operators. This solves the problem by deferring some of the partitioning to the `URIDownloader` so we can remove the interleaved partitioners. The result is an operator structure like `URIPartitioner->URIDownloader->URIDownloader->URIDownloader` which delivers much better performance for these cases. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…project#56462) ## Why are these changes needed? If we have multiple chained downloads e.g. `ds.with_column("bytes_1", download("uri_1")).with_column("bytes_2", download("uri_2")).with_column("bytes_3", download("uri_3"))`, then we would have an operator structure like `URIPartitioner->URIDownloader->URIPartitioner->URIDownloader->URIPartitioner->URIDownloader`. Each of the `URIPartitioner` operators will be implemented with an ActorPoolMapOperator with concurrency of 1. In these chained downloads, these become bottlenecks and scaling the concurrency of these up will result in additional resource usage that will take resources away from other operators. This solves the problem by deferring some of the partitioning to the `URIDownloader` so we can remove the interleaved partitioners. The result is an operator structure like `URIPartitioner->URIDownloader->URIDownloader->URIDownloader` which delivers much better performance for these cases. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 added 2 commits September 11, 2025 11:13

add in removal of interleaved partitioners

11172d2

Signed-off-by: Matthew Owen <mowen@anyscale.com>

minor tweaks

6d44079

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 requested a review from a team as a code owner September 11, 2025 18:31

omatthew98 requested a review from bveeramani September 11, 2025 18:32

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

python/ray/data/_internal/planner/plan_download_op.py Show resolved Hide resolved

python/ray/data/_internal/planner/plan_download_op.py Outdated Show resolved Hide resolved

bveeramani approved these changes Sep 11, 2025

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Sep 11, 2025

omatthew98 added 2 commits September 11, 2025 13:27

pr feedback

41058e6

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in test

1758c8f

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 added the go add ONLY when ready to merge, run all tests label Sep 11, 2025

bveeramani merged commit a9a57a6 into ray-project:master Sep 11, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Download op fusion / removal of interleaved partitioners#56462

[data] Download op fusion / removal of interleaved partitioners#56462
bveeramani merged 4 commits intoray-project:masterfrom
omatthew98:mowen/download-op-fusion

omatthew98 commented Sep 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

bveeramani left a comment

Uh oh!

gvspraveen commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

omatthew98 commented Sep 11, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

gvspraveen commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants