[Compiled Graph] Enhance Compile Graph with Multi-Device Support by hipudding · Pull Request #53395 · ray-project/ray

hipudding · 2025-05-29T03:38:35Z

Why are these changes needed?

This PR is the re-merge for #51032 : The first commit is content from #51032 , the second commit has the fix for #53267 and test failures mentioned in #53263 .

FIx Revert "[Compiled Graph] Enhance Compile Graph with Multi-Device Support (#51032)" #53263. This issue was discovered by the test_torch_tensor_transport_gpu test case. Since the pre-merge checks did not run GPU tests, the problem was not detected before the merge. The issue occurred in the deserialize_from_numpy_or_scalar function. Even when with_tensor_transport explicitly specifies CUDA, the function still automatically selects the currently available accelerator. In this particular scenario, since no GPU was available, it fell back to using a CPU tensor, which caused assertion failures in some test cases. The fix is to avoid selecting a suitable Accelerator when a specific CUDA device is already specified.
Fix Release test llm_serve_correctness failed #53267. This issue was discovered by the llm_serve_correctness test case. Since the pre-merge checks did not run the release tests, the problem was not detected before the merge. The root cause of this issue is the removal of get_devices function in PR#51032, and the default device was permanently set to cuda:0. This was done because, in general, gpu_ids are aligned with CUDA_VISIBLE_DEVICES, so the logical GPU ID used by the actor is always 0. This change simplifies the handling logic for different backends. This behavior works correctly in all current test cases. However, in the scenario of using Ray Serve with vLLM, vLLM resets the value of CUDA_VISIBLE_DEVICES, which causes the logical GPU ID inside the actor to no longer be 0. As a result, tensors may be created on the wrong GPU, leading to an invalid memory access. The fix is to restore the get_devices logic and refactor it to support multiple devices.

This PR has already requested the GPU tests and release tests to run, and the previously failing test cases have now passed. If there are any remaining test cases that haven’t been executed, please help trigger them. Thanks a lot!

Related issue number

Closes #53267 #53263

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

noemotiovon · 2025-05-29T03:46:36Z

python/ray/experimental/channel/accelerator_context.py

Did you override this environment

hipudding · 2025-05-29T11:32:16Z

python/ray/experimental/channel/serialization_context.py

FIx #53263. The issue occurred in the deserialize_from_numpy_or_scalar function. Even when with_tensor_transport explicitly specifies CUDA, the function still automatically selects the currently available accelerator. In this particular scenario, since no GPU was available, it fell back to using a CPU tensor.

hipudding · 2025-05-29T11:36:19Z

python/ray/experimental/channel/accelerator_context.py

fix #53267. This part of the code was removed in #51032, and the default device was permanently set to cuda:0. This was done because, in general, gpu_ids are aligned with CUDA_VISIBLE_DEVICES, so the logical GPU ID used by the actor is always 0. This behavior works correctly in all current test cases. However, in the scenario of using Ray Serve with vLLM, vLLM resets the value of CUDA_VISIBLE_DEVICES, which causes the logical GPU ID inside the actor to no longer be 0. As a result, tensors may be created on the wrong GPU, leading to an invalid memory access.

I tried to add a test case, but I couldn’t reproduce the issue. Could you give me some suggestions?

hipudding · 2025-05-29T11:40:15Z

@ruisearch42 Good day. The issue caused by #51032 has been fixed. Could you please run all necessary test cases? (The failing test suite seems not related to this change, Could you please retry them?) Thanks.

ruisearch42 · 2025-05-29T15:11:42Z

@ruisearch42 Good day. The issue caused by #51032 has been fixed. Could you please run all necessary test cases? (The failing test suite seems not related to this change, Could you please retry them?) Thanks.

Triggering the tests.

…-project#51032) This PR improves multi-device support in Compile Graph, which significantly reduces Tensor transmission latency by utilizing out-of-band communication. Currently, this feature only supports CUDA's NCCL. Since Ray already supports multiple accelerators, it is necessary to extend Compile Graph to support multi-device as well. This PR mainly introduces two key changes: 1. Removed dependency on cupy.cuda.ExternalStream - Since this library only supports CUDA devices, we replaced it with a more general stream context manager to accommodate various accelerators. The new implementation uses torch.{device}.StreamContext. 2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime - This allows automatic detection of the accelerator type and invokes the appropriate device-specific functions. ```python import ray import torch import torch_npu from ray.dag import InputNode from ray.experimental.channel.hccl_group import _HcclGroup from ray.experimental.channel.accelerator_context import register_accelerator_context @ray.remote class TorchTensorWorker: def __init__(self): self.device = torch.device('npu:0') torch.npu.set_device(self.device) def send(self, shape, dtype, value: int): return torch.ones(shape, dtype=dtype, device=self.device) * value def recv(self, tensor): return (tensor[0].item(), tensor.shape, tensor.dtype) register_accelerator_context('npu', _HcclGroup) actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1}) sender = actor_cls.remote() receiver = actor_cls.remote() with InputNode() as inp: dag = sender.send.bind(inp.shape, inp.dtype, inp[0]) dag = dag.with_tensor_transport(transport='nccl') dag = receiver.recv.bind(dag) shape = (10,) dtype = torch.float16 compiled_dag = dag.experimental_compile() for i in range(3): ref = compiled_dag.execute(i, shape=shape, dtype=dtype) assert ray.get(ref) == (i, shape, dtype) print("Success") ``` This PR is the main part of Task 2 in ray-project#51574 It would better to set the function name more general, such as changing requires_nccl to require_communicator. This is implemented in ray-project#51061. Signed-off-by: hipudding <huafengchun@gmail.com> Co-authored-by: noemotiovon <757486878@qq.com>

This commit fixes two issues: 1.Fixed the issue where with_tensor_transport would automatically select the device based on the environment, even when CUDA was explicitly specified. 2.Fixed an NCCL "invalid memory access" error that occurred when ray serve was used with vllm PP > 1. Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding · 2025-05-30T06:14:35Z

Hi @ruisearch42 @stephanie-wang , could you please help review this PR when you have time? Thanks !

hipudding · 2025-06-05T06:38:53Z

Morning @ruisearch42 @stephanie-wang. Just a gentle reminder regarding this PR. I understand everyone is quite busy, but I’d really appreciate it if you could take a look when you have a moment.

Please let me know if there’s anything I should address or clarify in the meantime.

BTW, is there'a way to run all tests(including post merge tests) before merging? Thanks.

ruisearch42 · 2025-06-05T15:17:52Z

Hi @hipudding , thanks for the fix! I will review it tomorrow!
I triggered the GPU tests and they have passed.

ruisearch42

LG. Thanks for the fix.

ruisearch42 · 2025-06-07T17:36:02Z

llm_serve_correctness release test passed: https://buildkite.com/ray-project/release/builds/44926#01974b51-5345-4461-8bf0-ad7994da8432

ruisearch42 · 2025-06-09T15:13:41Z

cc @jjyao for merging.

ruisearch42 · 2025-06-09T16:05:50Z

@hipudding Could you add more details into the PR description: e.g., the test failures observed and how this fixes the issues. cc @jjyao

hipudding · 2025-06-10T01:09:19Z

@hipudding Could you add more details into the PR description: e.g., the test failures observed and how this fixes the issues. cc @jjyao

Sure.

) Signed-off-by: hipudding <huafengchun@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

hipudding force-pushed the cg_fix branch 3 times, most recently from f34d469 to 9450c7c Compare May 29, 2025 03:42

noemotiovon reviewed May 29, 2025

View reviewed changes

hipudding force-pushed the cg_fix branch 2 times, most recently from 76ff1d8 to 029d099 Compare May 29, 2025 06:08

hipudding commented May 29, 2025

View reviewed changes

hipudding marked this pull request as ready for review May 29, 2025 11:38

ruisearch42 added the go add ONLY when ready to merge, run all tests label May 29, 2025

hipudding and others added 2 commits May 30, 2025 08:33

hipudding force-pushed the cg_fix branch from 029d099 to 676e3c8 Compare May 30, 2025 00:33

stephanie-wang self-assigned this May 30, 2025

ruisearch42 approved these changes Jun 7, 2025

View reviewed changes

jjyao merged commit eae786c into ray-project:master Jun 10, 2025
5 checks passed

kevin85421 mentioned this pull request Jun 16, 2025

Release test compiled_graphs failed #53716

Closed

elliot-barn pushed a commit that referenced this pull request Jun 18, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (#53395

b9d4a28

) Signed-off-by: hipudding <huafengchun@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

elliot-barn pushed a commit that referenced this pull request Jul 2, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (#53395

b973846

) Signed-off-by: hipudding <huafengchun@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Conversation

hipudding commented May 29, 2025 • edited by jjyao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

noemotiovon May 29, 2025

Choose a reason for hiding this comment

Uh oh!

hipudding May 29, 2025

Choose a reason for hiding this comment

Uh oh!

hipudding May 29, 2025

Choose a reason for hiding this comment

Uh oh!

hipudding May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hipudding May 29, 2025

Choose a reason for hiding this comment

Uh oh!

hipudding commented May 29, 2025

Uh oh!

ruisearch42 commented May 29, 2025

Uh oh!

hipudding commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hipudding commented Jun 5, 2025

Uh oh!

ruisearch42 commented Jun 5, 2025

Uh oh!

ruisearch42 left a comment

Choose a reason for hiding this comment

Uh oh!

ruisearch42 commented Jun 7, 2025

Uh oh!

ruisearch42 commented Jun 9, 2025

Uh oh!

ruisearch42 commented Jun 9, 2025

Uh oh!

hipudding commented Jun 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hipudding commented May 29, 2025 •

edited by jjyao

Loading

hipudding May 29, 2025 •

edited

Loading

hipudding commented May 30, 2025 •

edited

Loading