[aDAG] Support all reduce collective in aDAG#47621
Merged
stephanie-wang merged 96 commits intoray-project:masterfrom Oct 21, 2024
Merged
[aDAG] Support all reduce collective in aDAG#47621stephanie-wang merged 96 commits intoray-project:masterfrom
stephanie-wang merged 96 commits intoray-project:masterfrom
Conversation
added 30 commits
August 8, 2024 19:25
The input index is used when a task returns multiple values. An index is used to get the corresponding value of the tuple. Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
The downstream tasks of a TaskReturnNode should be readers of a output channel. The task of a TaskRetureNode should have a copy of the output channel in the task of its upstream ClassMethodNode. Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
stephanie-wang
approved these changes
Oct 18, 2024
Contributor
stephanie-wang
left a comment
There was a problem hiding this comment.
Nice work! Left a few remaining comments for cleanups and to see if we can further cut down on the tests that need GPUs to run. The only functionality we really need to test for GPUs is:
- does it provide the expected allreduce result?
- does passing a custom communicator actually result in calling the custom communicator during execution?
Otherwise, all logic should go in unit tests that don't need GPUs to run.
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Move Deduplicate P2P & Collective
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
This was referenced Oct 21, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
aDAG currently does not support collective APIs. We would like to add support for collective APIs, starting from allreduce.
This PR support allreduce by introducing a syntax sugar
ray.experimental.collective.allreduce.bind. Thebindaccepts argumentsinput_nodes,op, andtransport. It returns a list ofCollectiveOutputNodeas the allreduce results, with the same size ofinput_nodes. The allreduce results write to newly allocated tensors. In theCOMPUTEoperation ofCollectiveOutputNode, the corresponding NCCL collective API is called. There are no required changes for the input and output channels ofCollectiveOutputNode.Proposed new API:
API Requirements:
Requirements 1-3 are checked in the
_CollectiveOperationconstructor. Requirement 4 is checked by runtime timeout.The operation scheduling is also updated to consider the NCCL collective operation. When a NCCL collective node is selected, all the corresponding collective nodes in the collective group should be selected as well.
Related issue number
Meta-issue: #47983
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.