Skip to content

[collective] Ray Collective Communication Lib Support HCCL Backend#50790

Closed
liuxsh9 wants to merge 8 commits intoray-project:masterfrom
liuxsh9:support_ray_collective_on_ascend
Closed

[collective] Ray Collective Communication Lib Support HCCL Backend#50790
liuxsh9 wants to merge 8 commits intoray-project:masterfrom
liuxsh9:support_ray_collective_on_ascend

Conversation

@liuxsh9
Copy link
Copy Markdown
Contributor

@liuxsh9 liuxsh9 commented Feb 21, 2025

Why are these changes needed?

Currently, Ray Core supports scheduling on Ascend NPU devices, but Ray Collective API does not support communication between NPUs using HCCL. Compared to transfers based on PCIe and main memory, HCCL can provide a bandwidth increase of several times, which significantly enhances the performance of machine learning tasks.

Software Dependencies

  • CANN: Open-source CANN version >= 8.1, available at this link.
  • torch/torch-npu: API with single device per process has no version restrictions. send_multidevice and recv_multidevice interfaces require a version of torch-npu released after 2025.2.19.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@liuxsh9 liuxsh9 marked this pull request as ready for review February 21, 2025 10:31
@liuxsh9 liuxsh9 marked this pull request as draft February 21, 2025 10:32
@liuxsh9 liuxsh9 changed the title [WIP] Supports the use of the Ray Collective API on Ascend NPU devices [WIP] Ray Collective Communication Lib Support HCCL Backend Feb 21, 2025
@liuxsh9 liuxsh9 force-pushed the support_ray_collective_on_ascend branch from 8ef8711 to 868a90b Compare March 18, 2025 08:07
@hipudding
Copy link
Copy Markdown
Contributor

Thanks to @liuxsh9 's PR, this enables the use of Ascend NPU's collective communication capabilities in Ray. However, there are still some differences in HcclGroup within the Compiled Graph:

  1. The CompiledGraph creates communication groups by the ‌driver‌, which does not need for a global root_info store through an actor.
  2. The collective communication interfaces in the CompiledGraph are a ‌subset‌ of RCCL.

Therefore, could we split hccl_collective_group into two parts:
One part ‌only provides a wrapper for HCCL's C interfaces‌ (using ctypes).
Both hccl_collective_group and hccl_group would reuse these HCCL interface wrappers.
This approach would ‌maximize code reuse‌. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's necessary to rename gpu_index to something like device_index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using device_index is a more universal solution. The necessary modifications have been made, but further feedback from reviewers may be needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, the modifications include renaming send/recv_multigpu to send/recv_multidevice, while marking the original APIs as deprecated but not removing them to ensure compatibility.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about enabling copy when calling _flatten_for_scatter_gather?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, simply enabling copy is more straightforward.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could dedup and sort the device ids? Would it be helpful to hit communicator cache?

Copy link
Copy Markdown
Contributor Author

@liuxsh9 liuxsh9 Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it has been modified. However, the current Ascend CANN does not support collective communication for multiple processes with multiple devices in each process (such as interfaces like allreduce_multigpu), which means that the devices list here typically contains only a single value.

liuxsh9 added 3 commits April 14, 2025 12:14
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
@liuxsh9 liuxsh9 force-pushed the support_ray_collective_on_ascend branch from 6ff196f to e594d68 Compare April 14, 2025 04:17
liuxsh9 added 5 commits April 14, 2025 15:22
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
…ultigpu/recv_mutligpu

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
…cescatter api.

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
@liuxsh9 liuxsh9 marked this pull request as ready for review April 15, 2025 04:08
@liuxsh9 liuxsh9 requested a review from a team as a code owner April 15, 2025 04:08
@liuxsh9 liuxsh9 changed the title [WIP] Ray Collective Communication Lib Support HCCL Backend [Collective] Ray Collective Communication Lib Support HCCL Backend Apr 15, 2025
@liuxsh9 liuxsh9 changed the title [Collective] Ray Collective Communication Lib Support HCCL Backend [collective] Ray Collective Communication Lib Support HCCL Backend Apr 15, 2025
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 6, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 6, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants