[collective] Ray Collective Communication Lib Support HCCL Backend#50790
[collective] Ray Collective Communication Lib Support HCCL Backend#50790liuxsh9 wants to merge 8 commits intoray-project:masterfrom
Conversation
8ef8711 to
868a90b
Compare
|
Thanks to @liuxsh9 's PR, this enables the use of Ascend NPU's collective communication capabilities in Ray. However, there are still some differences in HcclGroup within the Compiled Graph:
Therefore, could we split hccl_collective_group into two parts: |
There was a problem hiding this comment.
I wonder if it's necessary to rename gpu_index to something like device_index?
There was a problem hiding this comment.
Yes, using device_index is a more universal solution. The necessary modifications have been made, but further feedback from reviewers may be needed.
There was a problem hiding this comment.
At the same time, the modifications include renaming send/recv_multigpu to send/recv_multidevice, while marking the original APIs as deprecated but not removing them to ensure compatibility.
There was a problem hiding this comment.
How about enabling copy when calling _flatten_for_scatter_gather?
There was a problem hiding this comment.
That's right, simply enabling copy is more straightforward.
There was a problem hiding this comment.
I think we could dedup and sort the device ids? Would it be helpful to hit communicator cache?
There was a problem hiding this comment.
Yes, it has been modified. However, the current Ascend CANN does not support collective communication for multiple processes with multiple devices in each process (such as interfaces like allreduce_multigpu), which means that the devices list here typically contains only a single value.
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
6ff196f to
e594d68
Compare
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
…ultigpu/recv_mutligpu Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
…cescatter api. Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Why are these changes needed?
Currently, Ray Core supports scheduling on Ascend NPU devices, but Ray Collective API does not support communication between NPUs using HCCL. Compared to transfers based on PCIe and main memory, HCCL can provide a bandwidth increase of several times, which significantly enhances the performance of machine learning tasks.
Software Dependencies
send_multideviceandrecv_multideviceinterfaces require a version of torch-npu released after 2025.2.19.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.