-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[core][compiled graphs] Meta-issue: Support collective communication ops #47983
Copy link
Copy link
Open
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogcompiled-graphscoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformance
Description
Description
This is a meta-issue to track progress for tasks related to collective communication. See RFC for more details.
Roadmap:
- Initial support for allreduce: [aDAG] Support all reduce collective in aDAG #47621
- [WIP] Unify p2p and collective op codepaths: (WIP) [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling #48649
- [WIP] Support for other all-to-all patterns, e.g., allgather, etc.
- Support all-to-one patterns, e.g., reduce
- [core][compiled graphs] Execute multiple p2p NCCL transfers with collective ops #47938
- Error detection for users passing different shapes on different workers. Possibilities:
- Support
direct_return=False, i.e. cases where user returns a mix of CPU and GPU data - Support batching multiple tensors into one communication op
- Support for non-compiled graphs
- [core][compiled graphs] Add CPU-based NCCL communicator for development #47936
Use case
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogcompiled-graphscoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformance