[Refactor]Rename NCCL-related items to comm_backend#51061
[Refactor]Rename NCCL-related items to comm_backend#51061edoakes merged 7 commits intoray-project:masterfrom
Conversation
|
@ruisearch42, |
|
This PR need follow #51574. Make it draft. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
) This PR improves multi-device support in Compile Graph, which significantly reduces Tensor transmission latency by utilizing out-of-band communication. Currently, this feature only supports CUDA’s NCCL. Since Ray already supports multiple accelerators, it is necessary to extend Compile Graph to support multi-device as well. This PR mainly introduces two key changes: 1. Removed dependency on cupy.cuda.ExternalStream – Since this library only supports CUDA devices, we replaced it with a more general stream context manager to accommodate various accelerators. The new implementation uses torch.{device}.StreamContext. 2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime – This allows automatic detection of the accelerator type and invokes the appropriate device-specific functions. ### How to add a new backend for CG? here's an example for Ascend NPU: ```python import ray import torch import torch_npu from ray.dag import InputNode # implement customer Communicator class from ray.experimental.channel.hccl_group import _HcclGroup from ray.experimental.channel.accelerator_context import register_accelerator_context @ray.remote class TorchTensorWorker: def __init__(self): self.device = torch.device('npu:0') torch.npu.set_device(self.device) def send(self, shape, dtype, value: int): return torch.ones(shape, dtype=dtype, device=self.device) * value def recv(self, tensor): return (tensor[0].item(), tensor.shape, tensor.dtype) # global register accelerator context register_accelerator_context('npu', _HcclGroup) actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1}) sender = actor_cls.remote() receiver = actor_cls.remote() with InputNode() as inp: dag = sender.send.bind(inp.shape, inp.dtype, inp[0]) dag = dag.with_tensor_transport(transport='nccl') dag = receiver.recv.bind(dag) shape = (10,) dtype = torch.float16 compiled_dag = dag.experimental_compile() for i in range(3): ref = compiled_dag.execute(i, shape=shape, dtype=dtype) assert ray.get(ref) == (i, shape, dtype) print("Success") ``` This PR is the main part of Task 2 in #51574 It would better to set the function name more general, such as changing requires_nccl to require_communicator. This is implemented in #51061. Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: noemotiovon <757486878@qq.com>
…-project#51032) This PR improves multi-device support in Compile Graph, which significantly reduces Tensor transmission latency by utilizing out-of-band communication. Currently, this feature only supports CUDA's NCCL. Since Ray already supports multiple accelerators, it is necessary to extend Compile Graph to support multi-device as well. This PR mainly introduces two key changes: 1. Removed dependency on cupy.cuda.ExternalStream - Since this library only supports CUDA devices, we replaced it with a more general stream context manager to accommodate various accelerators. The new implementation uses torch.{device}.StreamContext. 2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime - This allows automatic detection of the accelerator type and invokes the appropriate device-specific functions. ```python import ray import torch import torch_npu from ray.dag import InputNode from ray.experimental.channel.hccl_group import _HcclGroup from ray.experimental.channel.accelerator_context import register_accelerator_context @ray.remote class TorchTensorWorker: def __init__(self): self.device = torch.device('npu:0') torch.npu.set_device(self.device) def send(self, shape, dtype, value: int): return torch.ones(shape, dtype=dtype, device=self.device) * value def recv(self, tensor): return (tensor[0].item(), tensor.shape, tensor.dtype) register_accelerator_context('npu', _HcclGroup) actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1}) sender = actor_cls.remote() receiver = actor_cls.remote() with InputNode() as inp: dag = sender.send.bind(inp.shape, inp.dtype, inp[0]) dag = dag.with_tensor_transport(transport='nccl') dag = receiver.recv.bind(dag) shape = (10,) dtype = torch.float16 compiled_dag = dag.experimental_compile() for i in range(3): ref = compiled_dag.execute(i, shape=shape, dtype=dtype) assert ray.get(ref) == (i, shape, dtype) print("Success") ``` This PR is the main part of Task 2 in ray-project#51574 It would better to set the function name more general, such as changing requires_nccl to require_communicator. This is implemented in ray-project#51061. Signed-off-by: hipudding <huafengchun@gmail.com> Co-authored-by: noemotiovon <757486878@qq.com>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
d8b677a to
f95998a
Compare
|
This PR is a follow-up to #51032, which introduced multi-device support in the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-band tensor communication. While the current implementation is tightly coupled with NCCL and CUDA, the Compiled Graph runtime is now ready to support a broader spectrum of device types and collective communication backends (e.g., HCCL, RCCL). |
f95998a to
291cc7c
Compare
|
Hi @ruisearch42, @hipudding, this PR aims to generalize the communication backend interface and decouple NCCL-specific logic, as a follow-up to #51032. Happy to hear any feedback or suggestions! 😊 |
ruisearch42
left a comment
There was a problem hiding this comment.
Overall LGTM, initial reviews
python/ray/experimental/channel/torch_tensor_accelerator_channel.py
Outdated
Show resolved
Hide resolved
|
@ruisearch42, Thank you so much for the timely and careful review! |
|
Hi @ruisearch42, |
|
Oh apologies! Somehow I lost track of this, will review it again tomorrow. |
ruisearch42
left a comment
There was a problem hiding this comment.
Thanks for the PR. The raised issues are almost all nitpicks.
python/ray/experimental/channel/torch_tensor_accelerator_channel.py
Outdated
Show resolved
Hide resolved
|
Hi @ruisearch42, |
9f4fcda to
8dd034a
Compare
|
Triggered tests. We can merge after all passes. |
|
multi-gpu test failed. Please take a look. @noemotiovon |
|
Hi @ruisearch42 , |
f42ed90 to
a4bf259
Compare
This commit is a follow-up to ray-project#51032, which introduced multi-device support in the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-band tensor communication. While the current implementation is NCCL-specific, the Compiled Graph runtime is now ready to support a broader range of device types and collective communication libraries. To prepare for this generalization, this commit introduces the following changes: 1. Refactored NCCL-specific naming and interfaces 2. Established a pluggable communication backend interface This refactor does not change the behavior for existing NCCL-based Compiled Graph execution, but lays the foundation for enabling collective communication across diverse hardware accelerators and runtime environments. Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: noemotiovon <757486878@qq.com>
a4bf259 to
631f975
Compare
|
thanks for fixing the issues. Triggered gpu test again. |
## Why are these changes needed? ### Background This PR is a follow-up to [ray-project#51032](ray-project#51032), which introduced multi-device support in the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-band tensor communication. While the current implementation is tightly coupled with NCCL and CUDA, the Compiled Graph runtime is now ready to support a broader spectrum of device types and collective communication backends (e.g., HCCL, RCCL). ### What This PR Does? To enable extensibility and backend-agnostic design, this PR introduces the following core changes: Refactored NCCL-specific naming and APIs NCCL-related modules, classes, and function names have been generalized to eliminate hardcoded CUDA/NCCL assumptions. Introduced a pluggable communication backend interface A unified abstraction layer is added to decouple collective communication logic from any specific implementation. This makes it easier to support alternative collective libraries and device types in the future. This refactor does not alter the existing behavior of NCCL-based Compiled Graph execution. All current workflows using CUDA+NCCL continue to function as before. ## Related issue number ray-project#51574 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: noemotiovon <757486878@qq.com> Signed-off-by: doyoung <doyoung@anyscale.com>
## Why are these changes needed? ### Background This PR is a follow-up to [ray-project#51032](ray-project#51032), which introduced multi-device support in the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-band tensor communication. While the current implementation is tightly coupled with NCCL and CUDA, the Compiled Graph runtime is now ready to support a broader spectrum of device types and collective communication backends (e.g., HCCL, RCCL). ### What This PR Does? To enable extensibility and backend-agnostic design, this PR introduces the following core changes: Refactored NCCL-specific naming and APIs NCCL-related modules, classes, and function names have been generalized to eliminate hardcoded CUDA/NCCL assumptions. Introduced a pluggable communication backend interface A unified abstraction layer is added to decouple collective communication logic from any specific implementation. This makes it easier to support alternative collective libraries and device types in the future. This refactor does not alter the existing behavior of NCCL-based Compiled Graph execution. All current workflows using CUDA+NCCL continue to function as before. ## Related issue number ray-project#51574 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: noemotiovon <757486878@qq.com> Signed-off-by: doyoung <doyoung@anyscale.com>
## Why are these changes needed? ### Background This PR is a follow-up to [ray-project#51032](ray-project#51032), which introduced multi-device support in the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-band tensor communication. While the current implementation is tightly coupled with NCCL and CUDA, the Compiled Graph runtime is now ready to support a broader spectrum of device types and collective communication backends (e.g., HCCL, RCCL). ### What This PR Does? To enable extensibility and backend-agnostic design, this PR introduces the following core changes: Refactored NCCL-specific naming and APIs NCCL-related modules, classes, and function names have been generalized to eliminate hardcoded CUDA/NCCL assumptions. Introduced a pluggable communication backend interface A unified abstraction layer is added to decouple collective communication logic from any specific implementation. This makes it easier to support alternative collective libraries and device types in the future. This refactor does not alter the existing behavior of NCCL-based Compiled Graph execution. All current workflows using CUDA+NCCL continue to function as before. ## Related issue number ray-project#51574 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: noemotiovon <757486878@qq.com> Signed-off-by: ChanChan Mao <chanchanmao1130@gmail.com>
## Why are these changes needed? ### Background This PR is a follow-up to [ray-project#51032](ray-project#51032), which introduced multi-device support in the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-band tensor communication. While the current implementation is tightly coupled with NCCL and CUDA, the Compiled Graph runtime is now ready to support a broader spectrum of device types and collective communication backends (e.g., HCCL, RCCL). ### What This PR Does? To enable extensibility and backend-agnostic design, this PR introduces the following core changes: Refactored NCCL-specific naming and APIs NCCL-related modules, classes, and function names have been generalized to eliminate hardcoded CUDA/NCCL assumptions. Introduced a pluggable communication backend interface A unified abstraction layer is added to decouple collective communication logic from any specific implementation. This makes it easier to support alternative collective libraries and device types in the future. This refactor does not alter the existing behavior of NCCL-based Compiled Graph execution. All current workflows using CUDA+NCCL continue to function as before. ## Related issue number ray-project#51574 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: noemotiovon <757486878@qq.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
Background
This PR is a follow-up to #51032, which introduced multi-device support in the Compiled Graph by leveraging CUDA's NCCL backend for efficient out-of-band tensor communication.
While the current implementation is tightly coupled with NCCL and CUDA, the Compiled Graph runtime is now ready to support a broader spectrum of device types and collective communication backends (e.g., HCCL, RCCL).
What This PR Does?
To enable extensibility and backend-agnostic design, this PR introduces the following core changes:
Refactored NCCL-specific naming and APIs
NCCL-related modules, classes, and function names have been generalized to eliminate hardcoded CUDA/NCCL assumptions.
Introduced a pluggable communication backend interface
A unified abstraction layer is added to decouple collective communication logic from any specific implementation. This makes it easier to support alternative collective libraries and device types in the future.
This refactor does not alter the existing behavior of NCCL-based Compiled Graph execution. All current workflows using CUDA+NCCL continue to function as before.
Related issue number
#51574
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.