[ADAG] Refactor nccl to communicator channel. by Bye-legumes · Pull Request #48607 · ray-project/ray

Bye-legumes · 2024-11-06T20:09:57Z

Why are these changes needed?

Previous #47845
This is try to enable the ADAG channel can use different hardware while the user API keeps the same.
For user side, they can use transport='nccl' or transport='hccl' for different hardware.
While internally, ADAG will treat them the nodes that needs the communicator. So for the complied and channel level, it just rename the nccl to communicator.
In the bottom level, the nccl_group or hccl_group will be called to achieve the hardware level communicator.
The API and logical above the torch_tensor_communicator_channel.py previously torch_tensor_nccl_channel.py keeps the same.

So when new accelerator will be used, what they need is just implement xccl_group, with same API for all groups. Then we can use new hardware.

RFC doc https://docs.google.com/document/d/1zu9SllrEAjPHqs-eeITtrSSbv0rBxtkyCJeweZJl100/edit?usp=sharing

Related issue number

Checks

[√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[√] I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
[√] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes · 2024-11-18T21:08:29Z

see #47658

Bye-legumes and others added 13 commits November 6, 2024 15:08

fix

98c9b2d

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

8187451

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

fce8cd6

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

395b0b3

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

7c36fb3

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

acafdc7

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Merge branch 'master' into refactor_communicator

072beaa

fix

dd01079

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

32f3c37

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Dfix

2f59765

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

9e0726e

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Merge branch 'master' into refactor_communicator

d00ec2e

fix

49c38ca

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

jcotant1 added the core Issues that should be addressed in Ray Core label Nov 15, 2024

Bye-legumes closed this Nov 18, 2024

Bye-legumes mentioned this pull request Mar 7, 2025

[ADAG]Enable NPU (hccl) communication for CG #47658

Closed

8 tasks

Bye-legumes mentioned this pull request Mar 20, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support #51032

Merged

8 tasks

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAG] Refactor nccl to communicator channel.#48607

[ADAG] Refactor nccl to communicator channel.#48607
Bye-legumes wants to merge 13 commits intoray-project:masterfrom
Bye-legumes:refactor_communicator

Bye-legumes commented Nov 6, 2024

Uh oh!

Bye-legumes commented Nov 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Bye-legumes commented Nov 6, 2024

Why are these changes needed?

Related issue number

Checks

Uh oh!

Bye-legumes commented Nov 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants