Skip to content

[Compiled Graph] Enhance Compile Graph with Multi-Device Support #53395

Merged
jjyao merged 2 commits intoray-project:masterfrom
hipudding:cg_fix
Jun 10, 2025
Merged

[Compiled Graph] Enhance Compile Graph with Multi-Device Support #53395
jjyao merged 2 commits intoray-project:masterfrom
hipudding:cg_fix

Conversation

@hipudding
Copy link
Copy Markdown
Contributor

@hipudding hipudding commented May 29, 2025

Why are these changes needed?

This PR is the re-merge for #51032 : The first commit is content from #51032 , the second commit has the fix for #53267 and test failures mentioned in #53263 .

  1. FIx Revert "[Compiled Graph] Enhance Compile Graph with Multi-Device Support (#51032)" #53263. This issue was discovered by the test_torch_tensor_transport_gpu test case. Since the pre-merge checks did not run GPU tests, the problem was not detected before the merge. The issue occurred in the deserialize_from_numpy_or_scalar function. Even when with_tensor_transport explicitly specifies CUDA, the function still automatically selects the currently available accelerator. In this particular scenario, since no GPU was available, it fell back to using a CPU tensor, which caused assertion failures in some test cases. The fix is to avoid selecting a suitable Accelerator when a specific CUDA device is already specified.

  2. Fix Release test llm_serve_correctness failed #53267. This issue was discovered by the llm_serve_correctness test case. Since the pre-merge checks did not run the release tests, the problem was not detected before the merge. The root cause of this issue is the removal of get_devices function in PR#51032, and the default device was permanently set to cuda:0. This was done because, in general, gpu_ids are aligned with CUDA_VISIBLE_DEVICES, so the logical GPU ID used by the actor is always 0. This change simplifies the handling logic for different backends. This behavior works correctly in all current test cases. However, in the scenario of using Ray Serve with vLLM, vLLM resets the value of CUDA_VISIBLE_DEVICES, which causes the logical GPU ID inside the actor to no longer be 0. As a result, tensors may be created on the wrong GPU, leading to an invalid memory access. The fix is to restore the get_devices logic and refactor it to support multiple devices.

This PR has already requested the GPU tests and release tests to run, and the previously failing test cases have now passed. If there are any remaining test cases that haven’t been executed, please help trigger them. Thanks a lot!

Related issue number

Closes #53267 #53263

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@hipudding hipudding force-pushed the cg_fix branch 3 times, most recently from f34d469 to 9450c7c Compare May 29, 2025 03:42
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you override this environment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@hipudding hipudding force-pushed the cg_fix branch 2 times, most recently from 76ff1d8 to 029d099 Compare May 29, 2025 06:08
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIx #53263. The issue occurred in the deserialize_from_numpy_or_scalar function. Even when with_tensor_transport explicitly specifies CUDA, the function still automatically selects the currently available accelerator. In this particular scenario, since no GPU was available, it fell back to using a CPU tensor.

Copy link
Copy Markdown
Contributor Author

@hipudding hipudding May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix #53267. This part of the code was removed in #51032, and the default device was permanently set to cuda:0. This was done because, in general, gpu_ids are aligned with CUDA_VISIBLE_DEVICES, so the logical GPU ID used by the actor is always 0. This behavior works correctly in all current test cases. However, in the scenario of using Ray Serve with vLLM, vLLM resets the value of CUDA_VISIBLE_DEVICES, which causes the logical GPU ID inside the actor to no longer be 0. As a result, tensors may be created on the wrong GPU, leading to an invalid memory access.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add a test case, but I couldn’t reproduce the issue. Could you give me some suggestions?

@hipudding hipudding marked this pull request as ready for review May 29, 2025 11:38
@hipudding
Copy link
Copy Markdown
Contributor Author

@ruisearch42 Good day. The issue caused by #51032 has been fixed. Could you please run all necessary test cases? (The failing test suite seems not related to this change, Could you please retry them?) Thanks.

@ruisearch42 ruisearch42 added the go add ONLY when ready to merge, run all tests label May 29, 2025
@ruisearch42
Copy link
Copy Markdown
Contributor

@ruisearch42 Good day. The issue caused by #51032 has been fixed. Could you please run all necessary test cases? (The failing test suite seems not related to this change, Could you please retry them?) Thanks.

Triggering the tests.

hipudding and others added 2 commits May 30, 2025 08:33
…-project#51032)

This PR improves multi-device support in Compile Graph, which
significantly reduces Tensor transmission latency by utilizing
out-of-band communication. Currently, this feature only supports CUDA's
NCCL. Since Ray already supports multiple accelerators, it is necessary
to extend Compile Graph to support multi-device as well.

This PR mainly introduces two key changes:
1. Removed dependency on cupy.cuda.ExternalStream - Since this library
only supports CUDA devices, we replaced it with a more general stream
context manager to accommodate various accelerators. The new
implementation uses torch.{device}.StreamContext.
2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime -
This allows automatic detection of the accelerator type and invokes the
appropriate device-specific functions.

```python
import ray
import torch
import torch_npu
from ray.dag import InputNode
from ray.experimental.channel.hccl_group import _HcclGroup
from ray.experimental.channel.accelerator_context import register_accelerator_context

@ray.remote
class TorchTensorWorker:
    def __init__(self):
        self.device = torch.device('npu:0')
        torch.npu.set_device(self.device)

    def send(self, shape, dtype, value: int):
        return torch.ones(shape, dtype=dtype, device=self.device) * value

    def recv(self, tensor):
        return (tensor[0].item(), tensor.shape, tensor.dtype)

register_accelerator_context('npu', _HcclGroup)

actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})

sender = actor_cls.remote()
receiver = actor_cls.remote()

with InputNode() as inp:
    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])
    dag = dag.with_tensor_transport(transport='nccl')
    dag = receiver.recv.bind(dag)

shape = (10,)
dtype = torch.float16

compiled_dag = dag.experimental_compile()
for i in range(3):
    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)
    assert ray.get(ref) == (i, shape, dtype)

print("Success")
```

This PR is the main part of Task 2 in ray-project#51574
It would better to set the function name more general, such as changing
requires_nccl to require_communicator. This is implemented in ray-project#51061.

Signed-off-by: hipudding <huafengchun@gmail.com>
Co-authored-by: noemotiovon <757486878@qq.com>
This commit fixes two issues:
1.Fixed the issue where with_tensor_transport would automatically
  select the device based on the environment, even when CUDA was
  explicitly specified.
2.Fixed an NCCL "invalid memory access" error that occurred when
  ray serve was used with vllm PP > 1.

Signed-off-by: hipudding <huafengchun@gmail.com>
@hipudding
Copy link
Copy Markdown
Contributor Author

hipudding commented May 30, 2025

Hi @ruisearch42 @stephanie-wang , could you please help review this PR when you have time? Thanks !

@stephanie-wang stephanie-wang self-assigned this May 30, 2025
@hipudding
Copy link
Copy Markdown
Contributor Author

Morning @ruisearch42 @stephanie-wang. Just a gentle reminder regarding this PR. I understand everyone is quite busy, but I’d really appreciate it if you could take a look when you have a moment.

Please let me know if there’s anything I should address or clarify in the meantime.

BTW, is there'a way to run all tests(including post merge tests) before merging? Thanks.

@ruisearch42
Copy link
Copy Markdown
Contributor

Hi @hipudding , thanks for the fix! I will review it tomorrow!
I triggered the GPU tests and they have passed.

Copy link
Copy Markdown
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. Thanks for the fix.

@ruisearch42
Copy link
Copy Markdown
Contributor

llm_serve_correctness release test passed: https://buildkite.com/ray-project/release/builds/44926#01974b51-5345-4461-8bf0-ad7994da8432

@ruisearch42
Copy link
Copy Markdown
Contributor

cc @jjyao for merging.

@ruisearch42
Copy link
Copy Markdown
Contributor

@hipudding Could you add more details into the PR description: e.g., the test failures observed and how this fixes the issues. cc @jjyao

@hipudding
Copy link
Copy Markdown
Contributor Author

@hipudding Could you add more details into the PR description: e.g., the test failures observed and how this fixes the issues. cc @jjyao

Sure.

@jjyao jjyao merged commit eae786c into ray-project:master Jun 10, 2025
5 checks passed
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
)

Signed-off-by: hipudding <huafengchun@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
)

Signed-off-by: hipudding <huafengchun@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Release test llm_serve_correctness failed

5 participants