[Compiled Graph] Enhance Compile Graph with Multi-Device Support #53395
[Compiled Graph] Enhance Compile Graph with Multi-Device Support #53395jjyao merged 2 commits intoray-project:masterfrom
Conversation
f34d469 to
9450c7c
Compare
There was a problem hiding this comment.
Did you override this environment
76ff1d8 to
029d099
Compare
There was a problem hiding this comment.
FIx #53263. The issue occurred in the deserialize_from_numpy_or_scalar function. Even when with_tensor_transport explicitly specifies CUDA, the function still automatically selects the currently available accelerator. In this particular scenario, since no GPU was available, it fell back to using a CPU tensor.
There was a problem hiding this comment.
fix #53267. This part of the code was removed in #51032, and the default device was permanently set to cuda:0. This was done because, in general, gpu_ids are aligned with CUDA_VISIBLE_DEVICES, so the logical GPU ID used by the actor is always 0. This behavior works correctly in all current test cases. However, in the scenario of using Ray Serve with vLLM, vLLM resets the value of CUDA_VISIBLE_DEVICES, which causes the logical GPU ID inside the actor to no longer be 0. As a result, tensors may be created on the wrong GPU, leading to an invalid memory access.
There was a problem hiding this comment.
I tried to add a test case, but I couldn’t reproduce the issue. Could you give me some suggestions?
|
@ruisearch42 Good day. The issue caused by #51032 has been fixed. Could you please run all necessary test cases? (The failing test suite seems not related to this change, Could you please retry them?) Thanks. |
Triggering the tests. |
…-project#51032) This PR improves multi-device support in Compile Graph, which significantly reduces Tensor transmission latency by utilizing out-of-band communication. Currently, this feature only supports CUDA's NCCL. Since Ray already supports multiple accelerators, it is necessary to extend Compile Graph to support multi-device as well. This PR mainly introduces two key changes: 1. Removed dependency on cupy.cuda.ExternalStream - Since this library only supports CUDA devices, we replaced it with a more general stream context manager to accommodate various accelerators. The new implementation uses torch.{device}.StreamContext. 2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime - This allows automatic detection of the accelerator type and invokes the appropriate device-specific functions. ```python import ray import torch import torch_npu from ray.dag import InputNode from ray.experimental.channel.hccl_group import _HcclGroup from ray.experimental.channel.accelerator_context import register_accelerator_context @ray.remote class TorchTensorWorker: def __init__(self): self.device = torch.device('npu:0') torch.npu.set_device(self.device) def send(self, shape, dtype, value: int): return torch.ones(shape, dtype=dtype, device=self.device) * value def recv(self, tensor): return (tensor[0].item(), tensor.shape, tensor.dtype) register_accelerator_context('npu', _HcclGroup) actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1}) sender = actor_cls.remote() receiver = actor_cls.remote() with InputNode() as inp: dag = sender.send.bind(inp.shape, inp.dtype, inp[0]) dag = dag.with_tensor_transport(transport='nccl') dag = receiver.recv.bind(dag) shape = (10,) dtype = torch.float16 compiled_dag = dag.experimental_compile() for i in range(3): ref = compiled_dag.execute(i, shape=shape, dtype=dtype) assert ray.get(ref) == (i, shape, dtype) print("Success") ``` This PR is the main part of Task 2 in ray-project#51574 It would better to set the function name more general, such as changing requires_nccl to require_communicator. This is implemented in ray-project#51061. Signed-off-by: hipudding <huafengchun@gmail.com> Co-authored-by: noemotiovon <757486878@qq.com>
This commit fixes two issues: 1.Fixed the issue where with_tensor_transport would automatically select the device based on the environment, even when CUDA was explicitly specified. 2.Fixed an NCCL "invalid memory access" error that occurred when ray serve was used with vllm PP > 1. Signed-off-by: hipudding <huafengchun@gmail.com>
|
Hi @ruisearch42 @stephanie-wang , could you please help review this PR when you have time? Thanks ! |
|
Morning @ruisearch42 @stephanie-wang. Just a gentle reminder regarding this PR. I understand everyone is quite busy, but I’d really appreciate it if you could take a look when you have a moment. Please let me know if there’s anything I should address or clarify in the meantime. BTW, is there'a way to run all tests(including post merge tests) before merging? Thanks. |
|
Hi @hipudding , thanks for the fix! I will review it tomorrow! |
ruisearch42
left a comment
There was a problem hiding this comment.
LG. Thanks for the fix.
|
llm_serve_correctness release test passed: https://buildkite.com/ray-project/release/builds/44926#01974b51-5345-4461-8bf0-ad7994da8432 |
|
cc @jjyao for merging. |
|
@hipudding Could you add more details into the PR description: e.g., the test failures observed and how this fixes the issues. cc @jjyao |
Sure. |
Why are these changes needed?
This PR is the re-merge for #51032 : The first commit is content from #51032 , the second commit has the fix for #53267 and test failures mentioned in #53263 .
FIx Revert "[Compiled Graph] Enhance Compile Graph with Multi-Device Support (#51032)" #53263. This issue was discovered by the test_torch_tensor_transport_gpu test case. Since the pre-merge checks did not run GPU tests, the problem was not detected before the merge. The issue occurred in the deserialize_from_numpy_or_scalar function. Even when with_tensor_transport explicitly specifies CUDA, the function still automatically selects the currently available accelerator. In this particular scenario, since no GPU was available, it fell back to using a CPU tensor, which caused assertion failures in some test cases. The fix is to avoid selecting a suitable Accelerator when a specific CUDA device is already specified.
Fix Release test llm_serve_correctness failed #53267. This issue was discovered by the llm_serve_correctness test case. Since the pre-merge checks did not run the release tests, the problem was not detected before the merge. The root cause of this issue is the removal of get_devices function in PR#51032, and the default device was permanently set to cuda:0. This was done because, in general, gpu_ids are aligned with CUDA_VISIBLE_DEVICES, so the logical GPU ID used by the actor is always 0. This change simplifies the handling logic for different backends. This behavior works correctly in all current test cases. However, in the scenario of using Ray Serve with vLLM, vLLM resets the value of CUDA_VISIBLE_DEVICES, which causes the logical GPU ID inside the actor to no longer be 0. As a result, tensors may be created on the wrong GPU, leading to an invalid memory access. The fix is to restore the get_devices logic and refactor it to support multiple devices.
This PR has already requested the GPU tests and release tests to run, and the previously failing test cases have now passed. If there are any remaining test cases that haven’t been executed, please help trigger them. Thanks a lot!
Related issue number
Closes #53267 #53263
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.