Skip to content

[Data] Update image embedding benchmark to use download#56245

Merged
bveeramani merged 6 commits intomasterfrom
update-embed-benchmark
Sep 12, 2025
Merged

[Data] Update image embedding benchmark to use download#56245
bveeramani merged 6 commits intomasterfrom
update-embed-benchmark

Conversation

@bveeramani
Copy link
Copy Markdown
Member

@bveeramani bveeramani commented Sep 4, 2025

Why are these changes needed?

The existing image embedding release test downloads images from URIs in a pandas DataFrame. Previously, it used a boto-based UDF, but now it’s updated to use the built-in Ray API introduced in #55824.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani marked this pull request as draft September 4, 2025 17:04
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the image embedding benchmark by refactoring the data loading pipeline to use the new ray.data.expressions.download feature. This simplifies the code and likely improves performance. Additionally, the benchmark now uses a real embedding model instead of a fake one, which makes the results more realistic and valuable. My review includes a high-severity comment about a potential performance issue due to the removal of explicit parallelism control, which could affect the validity of the benchmark results.

@richardliaw richardliaw added the data Ray Data-related issues label Sep 5, 2025
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani marked this pull request as ready for review September 11, 2025 19:43
@bveeramani bveeramani added the release-test release test label Sep 11, 2025
@bveeramani bveeramani changed the title [Data] Update image embedding benchmark [Data] Update image embedding benchmark to use download Sep 11, 2025
Copy link
Copy Markdown
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating!

@@ -0,0 +1,21 @@
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we are forking from the existing compute configs that we previously used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation is to decouple from all of the other release tests so we can change them more easily (e.g., if we add an optimization and are able to use smaller nodes)

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ct/ray into update-embed-benchmark

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) September 12, 2025 00:58
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 12, 2025
@bveeramani bveeramani merged commit c05037e into master Sep 12, 2025
7 checks passed
@bveeramani bveeramani deleted the update-embed-benchmark branch September 12, 2025 01:31
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…t#56245)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

The existing image embedding release test downloads images from URIs in
a pandas DataFrame. Previously, it used a boto-based UDF, but now it’s
updated to use the built-in Ray API introduced in
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: zac <zac@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
…t#56245)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

The existing image embedding release test downloads images from URIs in
a pandas DataFrame. Previously, it used a boto-based UDF, but now it’s
updated to use the built-in Ray API introduced in
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Marco Stephan <marco@magic.dev>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

The existing image embedding release test downloads images from URIs in
a pandas DataFrame. Previously, it used a boto-based UDF, but now it’s
updated to use the built-in Ray API introduced in
#55824.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…t#56245)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

The existing image embedding release test downloads images from URIs in
a pandas DataFrame. Previously, it used a boto-based UDF, but now it’s
updated to use the built-in Ray API introduced in
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
snorkelopstesting2-coder pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_56245_9294f0e8-3a0c-400a-8b54-adb85c4d0aa3 that referenced this pull request Oct 22, 2025
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/ray-project_ray_pr_56245_9294f0e8-3a0c-400a-8b54-adb85c4d0aa3 that referenced this pull request Oct 22, 2025
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…t#56245)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

The existing image embedding release test downloads images from URIs in
a pandas DataFrame. Previously, it used a boto-based UDF, but now it’s
updated to use the built-in Ray API introduced in
ray-project#55824.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests release-test release test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants