[ADAG]Enable NPU (hccl) communication for CG#47658
[ADAG]Enable NPU (hccl) communication for CG#47658Bye-legumes wants to merge 44 commits intoray-project:masterfrom
Conversation
|
cc @ruisearch42 |
ruisearch42
left a comment
There was a problem hiding this comment.
Having a round of review since I was tagged.
Overall looks good. Do you plan to add a test?
Let me know when this is ready to review.
| from ray.experimental.channel.nccl_group import _NcclGroup | ||
|
|
||
| else: | ||
| from ray.experimental.channel.hccl_group import _HcclGroup as _NcclGroup |
There was a problem hiding this comment.
hmm, this looks like a hack. Do you plan to change to a cleaner approach?
There was a problem hiding this comment.
OK, I just remove this hack and left a comments in the test. After we refactor the channel we can have better solution.
| class GPUCommunicator(ABC): | ||
| """ | ||
| Communicator for a group of aDAG actors on Nvidia GPU. | ||
| Communicator for a group of aDAG actors on Nvidia GPU or other XPUs. |
There was a problem hiding this comment.
We should probably change the class name to a more general one if this is to support other XPUs. This is not yet used externally so backward compatibility is not an issue.
There was a problem hiding this comment.
I agree. Next step I prefer to change it to AcceleratorCommunicator or just Communicator for all. Currently, this GPUCommunicator is also called from some top level so I just keep it now.
| self._device_id = device_id | ||
|
|
||
| if rank is not None: | ||
| assert ray.get_gpu_ids(), "HCCL actor has no NPUs assigned" |
There was a problem hiding this comment.
ray.get_gpu_ids() seems to only get GPU IDs?
There was a problem hiding this comment.
True, I just changed it. Also I think there should be a API to get all Accelerator id?
| if self._closed: | ||
| raise RayChannelError("HCCL group has been destroyed.") | ||
|
|
||
| self._comm.send(tensor=value, dst=peer_rank) |
There was a problem hiding this comment.
One question I have is how is this different than nccl_collective_group send/recv? It seems nccl_collective_group just abstracts it higher level as _point2point, but otherwise identical to nccl_group.
If it's supposed to channel-only then we can merge this hccl_group, then later we'll open another PR for hccl_collective_group
There was a problem hiding this comment.
I think collective is a more general module that can be used for all other ray module. While here we need a module specify for aDAG channel. I think we can try to have another PR for hccl_collective_group so it can be used as a utils so we can use the NPU easier. In collective we can try to solve the double import or other problems that we meet.
There was a problem hiding this comment.
So in yesterday's aDAG meeting someone mentioned nccl_collective_group is actually old code, and nccl_group send/receive is what's currently used. We can discuss more to see how to extend it to support collectives it as apart of the refactor proposal.
There was a problem hiding this comment.
There is another PR to support collective fn as a node type #47621
I see they implemented collective/allreduce.py which calls allreduce of the GPUCommunicator in nccl_group.py
|
HI, @ruisearch42 Thanks for your suggestion! I just rewrite some of them and add a test here. The test is runnable on NPU but cannot run on GPU now, so it's a example to show how to run it. |
| ) | ||
|
|
||
| torch_npu.npu.set_device(rank) # Set the NPU device according to the rank | ||
| self.ctx = dist.init_process_group( |
There was a problem hiding this comment.
Should we call this process_group?
There was a problem hiding this comment.
aha.. This is different from process_group....The ascend torch_npu is a little different when handling the distributed while other parts are the same. https://github.com/Ascend/pytorch/blob/868b6f8e00eb0fb179fe719a81e13d8ec1860873/test/distributed/test_send_recv.py#L25
|
@kevin85421 Hi, I just resolved confits and it works with vLLM now. Can you check this PR? Thx! |
|
hi, @ruisearch42 Can you review this PR? |
|
I also noticed another new PR: #51032 |
Yes, the goals of these two PRs are the same. I was also inspired by this PR, thanks to @Bye-legumes for the first step. But I prefer to use a similar way to cuda to access hccl, and provide more convenient for more accelerators, so the implementation is different. @ruisearch42 Which way is better? Could you give us some suggestions? I think it's better to join these two PR together. I did not submit code directly to this PR because I was worried that it would affect the existing functions of this PR. This PR may be related to the debugging of vllm. What do you think? @Bye-legumes. (Of course, we are both authors of this feature) |
Performance Results (Model: Qwen2.5-14B)
|
|
@ruisearch42 what's the next step? |
We will follow up with #51032 |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.