[collective] Ray Collective Communication Lib Support HCCL Backend by liuxsh9 · Pull Request #50790 · ray-project/ray

liuxsh9 · 2025-02-21T10:30:45Z

Why are these changes needed?

Currently, Ray Core supports scheduling on Ascend NPU devices, but Ray Collective API does not support communication between NPUs using HCCL. Compared to transfers based on PCIe and main memory, HCCL can provide a bandwidth increase of several times, which significantly enhances the performance of machine learning tasks.

Software Dependencies

CANN: Open-source CANN version >= 8.1, available at this link.
torch/torch-npu: API with single device per process has no version restrictions. send_multidevice and recv_multidevice interfaces require a version of torch-npu released after 2025.2.19.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

hipudding · 2025-03-24T11:52:03Z

Thanks to @liuxsh9 's PR, this enables the use of Ascend NPU's collective communication capabilities in Ray. However, there are still some differences in HcclGroup within the Compiled Graph:

The CompiledGraph creates communication groups by the ‌driver‌, which does not need for a global root_info store through an actor.
The collective communication interfaces in the CompiledGraph are a ‌subset‌ of RCCL.

Therefore, could we split hccl_collective_group into two parts:
One part ‌only provides a wrapper for HCCL's C interfaces‌ (using ctypes).
Both hccl_collective_group and hccl_group would reuse these HCCL interface wrappers.
This approach would ‌maximize code reuse‌. What do you think?

tianyi-ge · 2025-04-11T08:44:31Z

python/ray/util/collective/collective_group/hccl_collective_group.py

I wonder if it's necessary to rename gpu_index to something like device_index?

Yes, using device_index is a more universal solution. The necessary modifications have been made, but further feedback from reviewers may be needed.

At the same time, the modifications include renaming send/recv_multigpu to send/recv_multidevice, while marking the original APIs as deprecated but not removing them to ensure compatibility.

tianyi-ge · 2025-04-11T08:45:32Z

python/ray/util/collective/collective_group/hccl_collective_group.py

How about enabling copy when calling _flatten_for_scatter_gather?

That's right, simply enabling copy is more straightforward.

tianyi-ge · 2025-04-11T09:50:07Z

python/ray/util/collective/collective_group/hccl_collective_group.py

I think we could dedup and sort the device ids? Would it be helpful to hit communicator cache?

Yes, it has been modified. However, the current Ascend CANN does not support collective communication for multiple processes with multiple devices in each process (such as interfaces like allreduce_multigpu), which means that the devices list here typically contains only a single value.

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

…ultigpu/recv_mutligpu Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

…cescatter api. Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

github-actions · 2025-06-06T00:32:54Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2025-06-21T00:31:34Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

liuxsh9 marked this pull request as ready for review February 21, 2025 10:31

liuxsh9 marked this pull request as draft February 21, 2025 10:32

liuxsh9 changed the title ~~[WIP] Supports the use of the Ray Collective API on Ascend NPU devices~~ [WIP] Ray Collective Communication Lib Support HCCL Backend Feb 21, 2025

liuxsh9 force-pushed the support_ray_collective_on_ascend branch from 8ef8711 to 868a90b Compare March 18, 2025 08:07

This was referenced Mar 20, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support #51032

Merged

[CG, Core] Add Ascend NPU Support for RCCL and CG #51574

Open

tianyi-ge reviewed Apr 11, 2025

View reviewed changes

liuxsh9 added 3 commits April 14, 2025 12:14

Support Ray Collective API on Ascend NPU

95620f0

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

Add NPU hccl to collective primitives support matrix

268c26f

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

fix typo

e594d68

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

liuxsh9 force-pushed the support_ray_collective_on_ascend branch from 6ff196f to e594d68 Compare April 14, 2025 04:17

liuxsh9 added 5 commits April 14, 2025 15:22

Added exectuion result checks to facilitate error tracing.

e3e79ac

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

Support send_multigpu/recv_multigpu api for hccl.

0014257

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

Add send_mutlidevice/recv_multidevice interface and deprecated send_m…

3c49139

…ultigpu/recv_mutligpu Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

Add send_multidevice/recv_multidevice to ray.util.collective module

a03e1da

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

just enabling copy when calling _flatten_for_scatter_gather in redu…

06bb4c2

…cescatter api. Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

liuxsh9 marked this pull request as ready for review April 15, 2025 04:08

liuxsh9 requested a review from a team as a code owner April 15, 2025 04:08

liuxsh9 changed the title ~~[WIP] Ray Collective Communication Lib Support HCCL Backend~~ [Collective] Ray Collective Communication Lib Support HCCL Backend Apr 15, 2025

liuxsh9 changed the title ~~[Collective] Ray Collective Communication Lib Support HCCL Backend~~ [collective] Ray Collective Communication Lib Support HCCL Backend Apr 15, 2025

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 6, 2025

github-actions bot closed this Jun 21, 2025

Bye-legumes mentioned this pull request Aug 7, 2025

[Core][object_store][RFC] NPU support to Accelerator Store (GPU Store) #55364

Open

Bye-legumes mentioned this pull request Aug 7, 2025

[collective] Ray Collective Communication Lib Support HCCL Backend #55381

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[collective] Ray Collective Communication Lib Support HCCL Backend#50790

[collective] Ray Collective Communication Lib Support HCCL Backend#50790
liuxsh9 wants to merge 8 commits intoray-project:masterfrom
liuxsh9:support_ray_collective_on_ascend

liuxsh9 commented Feb 21, 2025 •

edited

Loading

Uh oh!

hipudding commented Mar 24, 2025

Uh oh!

tianyi-ge Apr 11, 2025

Uh oh!

liuxsh9 Apr 15, 2025

Uh oh!

liuxsh9 Apr 15, 2025

Uh oh!

tianyi-ge Apr 11, 2025

Uh oh!

liuxsh9 Apr 15, 2025

Uh oh!

tianyi-ge Apr 11, 2025

Uh oh!

liuxsh9 Apr 15, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

liuxsh9 commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Software Dependencies

Related issue number

Checks

Uh oh!

hipudding commented Mar 24, 2025

Uh oh!

tianyi-ge Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

liuxsh9 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

liuxsh9 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

tianyi-ge Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

liuxsh9 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

tianyi-ge Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

liuxsh9 Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liuxsh9 commented Feb 21, 2025 •

edited

Loading

liuxsh9 Apr 15, 2025 •

edited

Loading