Skip to content

[Compiled Graph] Enhance Compile Graph with Multi-Device Support#51032

Merged
stephanie-wang merged 1 commit intoray-project:masterfrom
hipudding:cupystream
May 22, 2025
Merged

[Compiled Graph] Enhance Compile Graph with Multi-Device Support#51032
stephanie-wang merged 1 commit intoray-project:masterfrom
hipudding:cupystream

Conversation

@hipudding
Copy link
Copy Markdown
Contributor

@hipudding hipudding commented Mar 3, 2025

Why are these changes needed?

This PR improves multi-device support in Compile Graph, which significantly reduces Tensor transmission latency by utilizing out-of-band communication. Currently, this feature only supports CUDA’s NCCL. Since Ray already supports multiple accelerators, it is necessary to extend Compile Graph to support multi-device as well.

This PR mainly introduces two key changes:
1. Removed dependency on cupy.cuda.ExternalStream – Since this library only supports CUDA devices, we replaced it with a more general stream context manager to accommodate various accelerators. The new implementation uses torch.{device}.StreamContext.
2. Replaced hardcoded torch.cuda.xxx calls with AcceleratorRuntime – This allows automatic detection of the accelerator type and invokes the appropriate device-specific functions.

How to add a new backend for CG? here's an example for Ascend NPU:

import ray
import torch
import torch_npu
from ray.dag import InputNode
# implement customer Communicator class
from ray.experimental.channel.hccl_group import _HcclGroup
from ray.experimental.channel.accelerator_context import register_accelerator_context

@ray.remote
class TorchTensorWorker:
    def __init__(self):
        self.device = torch.device('npu:0')
        torch.npu.set_device(self.device)

    def send(self, shape, dtype, value: int):
        return torch.ones(shape, dtype=dtype, device=self.device) * value

    def recv(self, tensor):
        return (tensor[0].item(), tensor.shape, tensor.dtype)

# global register accelerator context
register_accelerator_context('npu', _HcclGroup)

actor_cls = TorchTensorWorker.options(num_cpus=0, resources={'NPU': 1})

sender = actor_cls.remote()
receiver = actor_cls.remote()

with InputNode() as inp:
    dag = sender.send.bind(inp.shape, inp.dtype, inp[0])
    dag = dag.with_tensor_transport(transport='nccl')
    dag = receiver.recv.bind(dag)

shape = (10,)
dtype = torch.float16

compiled_dag = dag.experimental_compile()
for i in range(3):
    ref = compiled_dag.execute(i, shape=shape, dtype=dtype)
    assert ray.get(ref) == (i, shape, dtype)

print("Success")

Related issue number

This PR is the main part of Task 2 in #51574
It would better to set the function name more general, such as changing requires_nccl to require_communicator. This is implemented in #51061.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@hipudding hipudding marked this pull request as draft March 3, 2025 11:47
@hipudding hipudding force-pushed the cupystream branch 2 times, most recently from f63223c to 27215f3 Compare March 4, 2025 02:55
@hipudding
Copy link
Copy Markdown
Contributor Author

@ruisearch42 Good Day. Could you please review this PR? Thanks.

@hipudding hipudding marked this pull request as ready for review March 4, 2025 06:16
@hipudding hipudding changed the title [WIP] Replacement for cupy.cuda.ExternalStream for hardware diversity Replacement for cupy.cuda.ExternalStream for hardware diversity Mar 4, 2025
@hipudding hipudding changed the title Replacement for cupy.cuda.ExternalStream for hardware diversity [aDAG] Replacement for cupy.cuda.ExternalStream for hardware diversity Mar 4, 2025
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the method returns None, is it still necessary to mark it as an abstract method?

Copy link
Copy Markdown
Contributor Author

@hipudding hipudding Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this class abstrct.

@hipudding hipudding force-pushed the cupystream branch 2 times, most recently from a4d06c9 to 705a10d Compare March 4, 2025 11:18
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Mar 4, 2025
@hipudding hipudding changed the title [aDAG] Replacement for cupy.cuda.ExternalStream for hardware diversity [aDAG] Change for acclerator diversity and add HCCL communication lib Mar 5, 2025
@ruisearch42
Copy link
Copy Markdown
Contributor

Hi @hipudding , is this ready for review? btw, aDAG was renamed to Compiled Graph: https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html

@hipudding
Copy link
Copy Markdown
Contributor Author

hipudding commented Mar 5, 2025

Hi @hipudding , is this ready for review? btw, aDAG was renamed to Compiled Graph: https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html

Yes, I think the main functionality of this PR is ready, I may make some minor adjustments later. Thanks for your review.
I'vd noticed that there's some test cases failed. I will fix them ASAP.

@hipudding hipudding changed the title [aDAG] Change for acclerator diversity and add HCCL communication lib [Compiled Graph] Change for acclerator diversity and add HCCL communication lib Mar 5, 2025
@ruisearch42
Copy link
Copy Markdown
Contributor

Yes, I think the main functionality of this PR is ready, I may make some minor adjustments later. Thanks for your review. I'vd noticed that there's some test cases failed. I will fix them ASAP.

Thanks. I will review tomorrow!

Copy link
Copy Markdown
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to better understand the PR

Comment on lines -775 to -786
with self._send_stream:
return self._write()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we want to use context manager?

Copy link
Copy Markdown
Contributor Author

@hipudding hipudding Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stream context manager is provided by cupy, it only support CUDA devices, So I just remote context manager and specified stream in read and write.
But I think I can implement a stream context manager in Ray, so cupy is not needed. Which is better?

Copy link
Copy Markdown
Contributor Author

@hipudding hipudding Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. use torch StreamContext instead to avoid too many changes.

self._intermediate_future = future

def reset_and_wait_intermediate_future(self) -> Any:
def reset_and_wait_intermediate_future(self, stream: Any = None) -> Any:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update doc for arg

Copy link
Copy Markdown
Contributor Author

@hipudding hipudding Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. reverted this part.

self,
val: Any,
wrap_in_gpu_future: bool,
stream: Any = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update doc for stream

Copy link
Copy Markdown
Contributor Author

@hipudding hipudding Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. reverted this part.

Comment on lines +10 to +24
class _DriverGroupHolder(Communicator):
"""
Communicator place holder for Driver, Since driver may has no
Accelerators, TorchDeviceManager cannot be used.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit why this is needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replace all torch.cuda.xxx by TorchDeviceManager, including create a _NcclGroup. But driver will keep a communicator instance itself. Driver will not join the group, but need record some information like world size and actors.
But Driver may has no GPU ( or other accelerators), it will choose the wrong TorchDeviceManager (CPUTorchDeviceManager), and get_communicator will return None.
Since Driver will not join the group, it use Communicator only for store some information. So I create a Hardware-independent class for driver.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment updated.

comm_id: int,
rank: Optional[int],
actor_handles: List["ray.actor.ActorHandle"],
acl_stream: Optional[Any],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is acl_tream? doc for args

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acl_stream is the raw stream in Ascend NPU. Just like cuda_stream to GPU.
What about name this stream torch_stream? since acl_stream and cuda_stream are both torch_stream now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I changed it to npu_stream for better understanding.

def get_world_size(self) -> int:
return self._world_size

def send(self, buf: "torch.Tensor", peer_rank: int) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many methods in this class look like code duplication from _NcclGroup with minor changes. Should they be unified?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should indeed reduce duplicate code. I will think about how to design it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, these two files share logical similarities and contain a lot of duplicate code. However, there are several differences in the details. For example, torch.cuda.xxx needs to be replaced with torch.npu.xxx, hccl_group requires an additional StreamContext, and the set device operation must be performed before get_unique_id and creating the HCCL group. Therefore, I haven’t found a good way to merge these two files. Do you have any suggestions?

@hipudding hipudding force-pushed the cupystream branch 2 times, most recently from 1e80d9f to d682702 Compare March 6, 2025 08:03
@hipudding hipudding changed the title [Compiled Graph] Change for acclerator diversity and add HCCL communication lib [Compiled Graph] Enhance Compile Graph with Multi-Device Support and HCCL Integration Mar 7, 2025
@hipudding hipudding requested a review from a team as a code owner March 7, 2025 06:26
@hipudding
Copy link
Copy Markdown
Contributor Author

hipudding commented Mar 7, 2025

Hi @ruisearch42, Could you please review this PR again?

BTW, I fixed test case fail in doc_test_cgraph_nccl. It blocked this PR.

@hipudding hipudding force-pushed the cupystream branch 6 times, most recently from 61d9600 to d6d4b76 Compare April 29, 2025 02:24
@hipudding hipudding requested a review from noemotiovon April 29, 2025 02:27
Copy link
Copy Markdown
Contributor

@noemotiovon noemotiovon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not registered yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Big enough.

@hipudding
Copy link
Copy Markdown
Contributor Author

@ruisearch42 @stephanie-wang This PR is ready for review now. Could you please review this PR again? Thanks.

@ruisearch42
Copy link
Copy Markdown
Contributor

@ruisearch42 @stephanie-wang This PR is ready for review now. Could you please review this PR again? Thanks.

Thanks for the efforts! will do!

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I like the new APIs. Let's check if we can reduce the driver API; eventually it would be good to use this as a single-controller API for Ray GPU objects too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "it" in this comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite the class comment.

Comment on lines 24 to 27
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints and comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use _set_context instead of just passing these args to cls() constructor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two steps here:
1. Set class attributes, including _torch_module_name and _communicator_cls. These attributes can be configured via register_accelerator_context; if not set, the system will automatically detect whether to use GPU or CPU.
2. Instantiate the AcceleratorContext object based on these class attributes.

If we were to use the constructor here, we would need to introduce global variables to store _torch_module_name and _communicator_cls.

Which approach do you think is better?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the current code is brittle because if you call AcceleratorContext._set_context, it will modify a previously created singleton returned by AcceleratorContext.get() as well. It would be better if you can make it so that calling AcceleratorContext.get() returns an instance whose attributes are not tied to the class attributes. This make it a bit nicer for testing or if we eventually want to support multiple contexts in the same process.

I don't have a strong preference on how you should do that, although I agree it would be nice to not need global variables for the torch_module_name and communicator_cls. You could follow the DataContext pattern, which stores the defaults on the class but uses dataclass and a global default context for the current instance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_communicator(self, *args, **kwargs):
def create_communicator(self, *args, **kwargs):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Retrieves the communication group for collective operations.
Creates a communication group for collective operations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to do_register_accelerator_context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 760 to 761
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confusing that this is calling _do_check_has_communicator but the variable is called has_accelerators. Can't the actors have accelerators but not a communicator created yet?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think this check is not exactly equivalent to the previous check - is it right that each actor could now have an AcceleratorContext but they could be for different devices? Do we have a unit test for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I will fix this.

Copy link
Copy Markdown
Contributor Author

@hipudding hipudding May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed to check whether all actors use the same accelerator, to ensure consistency across all actors. A test case added.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking - why was it that this code could be removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function returns the logical IDs of the CUDA devices visible to the current worker. Currently, in CG, the device at index 0 is used as the default device, and the logical ID of the device at index 0 is also 0, so I believe calling get_cuda_devices is unnecessary. I think this part needs to be double-checked by @ruisearch42.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# set custom communicator on all actors
# Set custom communicator on all actors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 748 to 754
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can take in the _communicator_cls and _torch_module_name as arguments to this function? That makes it explicit what the dependencies of this function are, instead of passing the dependencies through the accelerator context.

Copy link
Copy Markdown
Contributor Author

@hipudding hipudding May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can take in the _communicator_cls and _torch_module_name as arguments to this function? That makes it explicit what the dependencies of this function are, instead of passing the dependencies through the accelerator context.

Are you suggesting that _communicator_cls and _torch_module_name should be passed as parameters to the _init_communicator function?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@hipudding
Copy link
Copy Markdown
Contributor Author

Thanks for the contribution! I like the new APIs. Let's check if we can reduce the driver API; eventually it would be good to use this as a single-controller API for Ray GPU objects too.

Sorry, I didn’t quite understand what you meant. Could you please explain this sentence again?
“Let’s check if we can reduce the driver API; eventually it would be good to use this as a single-controller API for Ray GPU objects too.”

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Not registrey yet.
# Not registered yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of code has been rewirte.

@stephanie-wang
Copy link
Copy Markdown
Contributor

Thanks for the contribution! I like the new APIs. Let's check if we can reduce the driver API; eventually it would be good to use this as a single-controller API for Ray GPU objects too.

Sorry, I didn’t quite understand what you meant. Could you please explain this sentence again? “Let’s check if we can reduce the driver API; eventually it would be good to use this as a single-controller API for Ray GPU objects too.”

Ah this was just the comment about removing NotImplemented methods for the CommunicatorHandle class. Eventually we would like to use this API to program collective operations in general Ray programs, not just in compiled graphs (see #51173)

@Bye-legumes
Copy link
Copy Markdown
Contributor

I think currently looks great, while currently, the assumption is that all the device that support this should support torch stream, right? Is that possible that we list the APIs that to support all these features at least?

@hipudding
Copy link
Copy Markdown
Contributor Author

I think currently looks great, while currently, the assumption is that all the device that support this should support torch stream, right? Is that possible that we list the APIs that to support all these features at least?

Yes, because currently CG only supports out-of-band transmission of torch.tensor, the accelerator must support the PyTorch interface—not just the torch stream, but also the interfaces related to device and event. I can add the backend capability requirements in the comments of accelerator_context.py.

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good! Please address the last few comments and then we can merge it.

class AcceleratorContext:
"""
Provides a unified interface for managing different accelerator backends
This eincludes stram management, event creation, device context control,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This eincludes stram management, event creation, device context control,
This includes stream management, event creation, device context control,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

and communicator support for distributed communication.
"""

def __init__(self, torch_module_name: str, commumcator_cls: Type[Communicator]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, torch_module_name: str, commumcator_cls: Type[Communicator]):
def __init__(self, torch_module_name: str, communicator_cls: Type[Communicator]):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Args:
torch_module_name: Name of the torch device module (e.g., "cuda", "cpu").
commumcator_cls: Class used to handle communication.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
commumcator_cls: Class used to handle communication.
communicator_cls: Class used to handle communication.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@staticmethod
def set(accelerator_context: "AcceleratorContext") -> None:
"""
Sets the accelerator context to the default context.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Sets the accelerator context to the default context.
Overwrites the default accelerator context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""
return self._communicator_cls(*args, **kwargs)

def get_module_name(self) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change these getter methods to use @property?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

torch_module_name: The name of the device module under torch.
communicator: The communicator class associated with the device.
"""
accleerator_context = AcceleratorContext(torch_module_name, communicator_cls)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
accleerator_context = AcceleratorContext(torch_module_name, communicator_cls)
accelerator_context = AcceleratorContext(torch_module_name, communicator_cls)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +161 to +163
if self._is_registered:
return self._torch_module_name
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this a bit confusing: I think it is odd that you can call register multiple times and end up with multiple AcceleratorContexts that have _is_registered=True.

Instead, how about removing the _is_registered flag and just having get_module_name. You can create a _global_custom_context along with the _default_accelerator_context, and AcceleratorContext.get() will return the custom context if set else it returns the default context. This makes it more explicit which context is registered or not and whether the context is custom or not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with your suggestion. I have implemented _global_custom_context following your design and provided a method is_accelerator_context_registered to check whether a context is registered. The code now looks much clearer. Thank you again for your valuable feedback!


class CommunicatorHandle:
"""
A lightweight communicator handle used to store actors in the communicator.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A lightweight communicator handle used to store actors in the communicator.
A lightweight communicator handle used by the driver to store handles to the actors in the communicator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@hipudding
Copy link
Copy Markdown
Contributor Author

Thanks, this looks good! Please address the last few comments and then we can merge it.

Sorry for all the typos—I should have enabled local spell checking. Thank you for going through so many rounds of review and providing suggestions for improvement. I’ll fix these issues as soon as possible.

@stephanie-wang
Copy link
Copy Markdown
Contributor

Looks like there are just some lint errors to fix.

This commit introduces multi-device support in Compile Graph,
extending beyond CUDA's NCCL to accommodate various accelerators.
The key changes include:

1. Removed dependency on `cupy.cuda.ExternalStream`, Replaced with
   `torch.{device}.StreamContext` to support diverse hardware.

2. Replaced hardcoded `torch.cuda.xxx` calls with `AcceleratorRuntime`
   Enables automatic detection and invocation of device-specific
   functions.

Co-authored-by: hipudding <huafengchun@gmail.com>

Signed-off-by: noemotiovon <757486878@qq.com>
@hipudding
Copy link
Copy Markdown
Contributor Author

hipudding commented May 22, 2025

@stephanie-wang @ruisearch42 Hi, Could you please merge this PR? Thanks.

@daiping8
Copy link
Copy Markdown
Contributor

@hipudding Hello, this is a great piece of work that I can learn from. However, I have a few questions—would you mind helping me clarify them?

  1. I couldn't find hccl_group.py. Was it removed from the commit because hccl_group and nccl_group are logically similar and contain a lot of duplicated code?
  2. Is the current method for adding a new backend to CG still as described in your PR—implementing an _XcclGroup class similar to _NcclGroup, and registering it via register_accelerator_context?

Thank you for your time.

@hipudding
Copy link
Copy Markdown
Contributor Author

hipudding commented Oct 20, 2025

  1. I couldn't find hccl_group.py. Was it removed from the commit because hccl_group and nccl_group are logically similar and contain a lot of duplicated code?

Yes, this part has removed and leave a register method to add your own backend. Currently, hccl backend can't add to Ray project because ray community has no Ascend servers and can't test it. I keep a backup of hccl and hccl_group, you can see 0482edc for an example.

  1. Is the current method for adding a new backend to CG still as described in your PR—implementing an _XcclGroup class similar to _NcclGroup, and registering it via register_accelerator_context?

Yes, you're right, Use this register and enable your backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants