[data] Adding in updated code to from uris release test#56091
[data] Adding in updated code to from uris release test#56091bveeramani merged 8 commits intoray-project:masterfrom
Conversation
Signed-off-by: Matthew Owen <mowen@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates the read_from_uris benchmark to use the new ray.data.expressions.download API. The change is a good simplification, but I've found a critical performance issue due to the removal of dataset repartitioning, which would cause the benchmark to run serially. I've also added a minor suggestion to improve code conciseness. Please review the detailed comments.
| images = [] | ||
| for b in batch["image_bytes"]: | ||
| image = Image.open(io.BytesIO(b)).convert("RGB") | ||
| images.append(np.array(image)) |
There was a problem hiding this comment.
This for-loop can be written more concisely and idiomatically using a list comprehension, which improves readability.
| images = [] | |
| for b in batch["image_bytes"]: | |
| image = Image.open(io.BytesIO(b)).convert("RGB") | |
| images.append(np.array(image)) | |
| images = [ | |
| np.array(Image.open(io.BytesIO(b)).convert("RGB")) | |
| for b in batch["image_bytes"] | |
| ] |
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
| return table.set_column(table.schema.get_field_index("key"), "key", new_col) | ||
|
|
||
| ds = metadata.map_batches(convert_key, batch_format="pyarrow") | ||
| ds = ds.with_column("image_bytes", download("key")) |
There was a problem hiding this comment.
why is batch_format necessary?
There was a problem hiding this comment.
It isn't explicitly necessary, I could rewrite the function to use default/numpy batches. The function I already was testing with used PyArrow batches in the convert key function so specifying the batches is necessary to work with that.
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
|
|
||
| ds = metadata.map_batches(convert_key, batch_format="pyarrow") | ||
| ds = ds.with_column("image_bytes", download("key")) | ||
| ds = ds.map_batches(decode_images) |
There was a problem hiding this comment.
Nit: i think this might be simpler as map
There was a problem hiding this comment.
Agree that might be simpler, but my thought was when we have the decode API we should be applying that with a map batches similar to this so better to update the test to match that pattern.
| compute_strategy=partition_compute, # Use actor-based compute for callable class | ||
| ray_remote_args=op._ray_remote_args, | ||
| ray_remote_args_fn=op._ray_remote_args_fn, | ||
| supports_fusion=False, # Prevent fusion with upstream operations that use different compute strategies |
There was a problem hiding this comment.
Yeah, upstream operators could fuse with the download operator and it was causing an assertion to fail.
There was a problem hiding this comment.
was the previous supports_fusion=True also causing performance issues?
There was a problem hiding this comment.
This assertion was failing
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 382, in _wrapped_udf_map_fn
assert ray.data._map_actor_context is not None
so the code could not run / there wasn't really any performance to compare.
There was a problem hiding this comment.
oh i see this wasn't a test
There was a problem hiding this comment.
@omatthew98 is there a more holistic fix we should make? Like, is disabling fusion the best move long-term?
Confused why this happens in this case but not with regular map operator fusion.
Signed-off-by: Matthew Owen <mowen@anyscale.com>
| ray_remote_args: Optional[Dict[str, Any]] = None, | ||
| ): | ||
| super().__init__("Download", input_op, ray_remote_args=ray_remote_args) | ||
| from ray.data._internal.compute import ActorPoolStrategy |
There was a problem hiding this comment.
@bveeramani Added a better fix here. TLDR by not specifying the compute as we initialize the logical operator it defaults to the TaskPoolStrategy for compute causing us to improperly fuse with other TaskPoolStrategy operators. Since the first operator will be implemented with a callable class / actor, we should use ActorPoolStrategy as the compute for the download logical operator.
…56091) ## Why are these changes needed? Update the from uris release test to use the new code from ray-project#55824. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>
…56091) ## Why are these changes needed? Update the from uris release test to use the new code from ray-project#55824. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…56091) ## Why are these changes needed? Update the from uris release test to use the new code from ray-project#55824. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…56091) ## Why are these changes needed? Update the from uris release test to use the new code from ray-project#55824. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
## Why are these changes needed? Update the from uris release test to use the new code from #55824. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…56091) ## Why are these changes needed? Update the from uris release test to use the new code from ray-project#55824. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Matthew Owen <mowen@anyscale.com>
Why are these changes needed?
Update the from uris release test to use the new code from #55824.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.