Skip to content

[ADAG] Refactor nccl to communicator channel.#48607

Closed
Bye-legumes wants to merge 13 commits intoray-project:masterfrom
Bye-legumes:refactor_communicator
Closed

[ADAG] Refactor nccl to communicator channel.#48607
Bye-legumes wants to merge 13 commits intoray-project:masterfrom
Bye-legumes:refactor_communicator

Conversation

@Bye-legumes
Copy link
Copy Markdown
Contributor

Why are these changes needed?

Previous #47845
This is try to enable the ADAG channel can use different hardware while the user API keeps the same.
For user side, they can use transport='nccl' or transport='hccl' for different hardware.
While internally, ADAG will treat them the nodes that needs the communicator. So for the complied and channel level, it just rename the nccl to communicator.
In the bottom level, the nccl_group or hccl_group will be called to achieve the hardware level communicator.
The API and logical above the torch_tensor_communicator_channel.py previously torch_tensor_nccl_channel.py keeps the same.

So when new accelerator will be used, what they need is just implement xccl_group, with same API for all groups. Then we can use new hardware.

RFC doc https://docs.google.com/document/d/1zu9SllrEAjPHqs-eeITtrSSbv0rBxtkyCJeweZJl100/edit?usp=sharing

Related issue number

Checks

  • [√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [√] I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • [√] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Bye-legumes and others added 13 commits November 6, 2024 15:08
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Nov 15, 2024
@Bye-legumes
Copy link
Copy Markdown
Contributor Author

see #47658

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants