Skip to content

[data] Adding in updated code to from uris release test#56091

Merged
bveeramani merged 8 commits intoray-project:masterfrom
omatthew98:mowen/update-from-uris-release-test
Sep 5, 2025
Merged

[data] Adding in updated code to from uris release test#56091
bveeramani merged 8 commits intoray-project:masterfrom
omatthew98:mowen/update-from-uris-release-test

Conversation

@omatthew98
Copy link
Copy Markdown
Contributor

Why are these changes needed?

Update the from uris release test to use the new code from #55824.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Matthew Owen <mowen@anyscale.com>
@omatthew98 omatthew98 requested a review from bveeramani August 29, 2025 18:59
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the read_from_uris benchmark to use the new ray.data.expressions.download API. The change is a good simplification, but I've found a critical performance issue due to the removal of dataset repartitioning, which would cause the benchmark to run serially. I've also added a minor suggestion to improve code conciseness. Please review the detailed comments.

Comment on lines +25 to +28
images = []
for b in batch["image_bytes"]:
image = Image.open(io.BytesIO(b)).convert("RGB")
images.append(np.array(image))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This for-loop can be written more concisely and idiomatically using a list comprehension, which improves readability.

Suggested change
images = []
for b in batch["image_bytes"]:
image = Image.open(io.BytesIO(b)).convert("RGB")
images.append(np.array(image))
images = [
np.array(Image.open(io.BytesIO(b)).convert("RGB"))
for b in batch["image_bytes"]
]

@ray-gardener ray-gardener bot added data Ray Data-related issues release-test release test labels Aug 29, 2025
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
return table.set_column(table.schema.get_field_index("key"), "key", new_col)

ds = metadata.map_batches(convert_key, batch_format="pyarrow")
ds = ds.with_column("image_bytes", download("key"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is batch_format necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't explicitly necessary, I could rewrite the function to use default/numpy batches. The function I already was testing with used PyArrow batches in the convert key function so specifying the batches is necessary to work with that.

Signed-off-by: Matthew Owen <mowen@anyscale.com>
@omatthew98 omatthew98 requested a review from a team as a code owner September 2, 2025 19:17
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
@omatthew98
Copy link
Copy Markdown
Contributor Author

Test Results Updated Test in PR:

  • Fixed size: 229.85s (build)
  • Autoscaling: 500.54s (build)

Test Results Latest Master:

  • Fixed size: 1283.89s (build)
  • Autoscaling: 1334.39s (build)

@omatthew98 omatthew98 added the go add ONLY when ready to merge, run all tests label Sep 2, 2025

ds = metadata.map_batches(convert_key, batch_format="pyarrow")
ds = ds.with_column("image_bytes", download("key"))
ds = ds.map_batches(decode_images)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: i think this might be simpler as map

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that might be simpler, but my thought was when we have the decode API we should be applying that with a map batches similar to this so better to update the test to match that pattern.

compute_strategy=partition_compute, # Use actor-based compute for callable class
ray_remote_args=op._ray_remote_args,
ray_remote_args_fn=op._ray_remote_args_fn,
supports_fusion=False, # Prevent fusion with upstream operations that use different compute strategies
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a drive-by bug fix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, upstream operators could fuse with the download operator and it was causing an assertion to fail.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the previous supports_fusion=True also causing performance issues?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion was failing

 File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 382, in _wrapped_udf_map_fn
    assert ray.data._map_actor_context is not None

so the code could not run / there wasn't really any performance to compare.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see this wasn't a test

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omatthew98 is there a more holistic fix we should make? Like, is disabling fusion the best move long-term?

Confused why this happens in this case but not with regular map operator fusion.

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
ray_remote_args: Optional[Dict[str, Any]] = None,
):
super().__init__("Download", input_op, ray_remote_args=ray_remote_args)
from ray.data._internal.compute import ActorPoolStrategy
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bveeramani Added a better fix here. TLDR by not specifying the compute as we initialize the logical operator it defaults to the TaskPoolStrategy for compute causing us to improperly fuse with other TaskPoolStrategy operators. Since the first operator will be implemented with a callable class / actor, we should use ActorPoolStrategy as the compute for the download logical operator.

@bveeramani bveeramani merged commit 4a9c49c into ray-project:master Sep 5, 2025
5 checks passed
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…56091)

## Why are these changes needed?
Update the from uris release test to use the new code from
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…56091)

## Why are these changes needed?
Update the from uris release test to use the new code from
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…56091)

## Why are these changes needed?
Update the from uris release test to use the new code from
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…56091)

## Why are these changes needed?
Update the from uris release test to use the new code from
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
## Why are these changes needed?
Update the from uris release test to use the new code from
#55824.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…56091)

## Why are these changes needed?
Update the from uris release test to use the new code from
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests release-test release test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants