(WIP) [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling by AndyUB · Pull Request #48649 · ray-project/ray

AndyUB · 2024-11-08T09:18:07Z

Why are these changes needed?

This PR unifies the code paths for NCCL P2P and collectives. Before, scheduling for NCCL operations is done by splitting each node into three operations: READ, COMPUTE, and WRITE. This PR simplifies the logic by only keeping the compute node. To ensure scheduling still works, NCCL operations are converted into special types of system-created compute nodes.

This PR also allows overlapping NCCL collectives with computation.

NCCL P2P Refactoring

with InputNode() as inp:
  dag = actor1.foo.bind(inp)
  dag = dag.with_tensor_transport("nccl")
  dag = actor2.bar.bind(dag)

Before this PR, compiling this dag will result in a TorchTensorNcclChannel from foo to bar.

This PR adds a NcclSendNode after foo and a NcclRecvNode before bar. The TorchTensorNcclChannel now connects the two added nodes. Since foo and the send node are on the same actor, the channel from foo to the send node is an IntraProcessChannel. Same thing for the recv side.

Multiple Receivers

with InputNode() as inp:
  dag = actor1.foo.bind(inp)
  dag = dag.with_tensor_transport("nccl")
  dag = MultiOutputNode([actor2.bar.bind(dag), actor3.baz.bind(dag)])

In this case, the sender sends to two different receivers.

Only one NcclSendNode is created. One NcclRecvNode is created per receiver. Like before, there is only 1 TorchTensorNcclChannel.

Multiple Senders

with InputNode() as inp:
  branch1 = actor1.foo.bind(inp)
  branch1 = branch1.with_tensor_transport("nccl")
  branch2 = actor2.bar.bind(inp)
  branch2 = branch2.with_tensor_transport("nccl")
  dag = actor3.baz.bind(branch1, branch2)

The receiver receives from two senders.

1 NcclSendNode is created per sender. 1 NcclRecvNode is created per argument for the receiver. There are 2 different TorchTensorNcclChannels.

Overlap NCCL Collectives

This is done by prioritizing NCCL operations over non-NCCL operations when scheduling, i.e., if both some NCCL operations and some non-NCCL operations are ready to be added into the actors' execution schedules, NCCL operations are always added before the non-NCCL ones.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

dengwxn · 2024-11-08T15:25:55Z

Looks great. Some more TODOs before an initial review as we discussed offline:

Refactor all the [CL] and [TODO] in the code. They are mainly missing comments, unused code blocks, branches to be merged, variable and function names to be renamed, etc.
Introduce a special op node for NCCL_Collective similar to the current NCCL_READ and NCCL_WRITE, such that the COMPUTE node does not require NCCL.

cc @dengwxn

dengwxn · 2024-11-08T15:28:45Z

@anyscalesam Could you help add a go badge to run more CI tests? Thanks!

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

This reverts commit 941cb73. Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

dengwxn · 2024-11-09T15:59:18Z

Introduce a special op node for NCCL_Collective similar to the current NCCL_READ and NCCL_WRITE, such that the COMPUTE node does not require NCCL.

After your attempt and a second thought, I think this might not be the best way to separate NCCL and non-NCCL ops by introducing another NCCL_Collective op. We can skip this and see what others think.

dengwxn · 2024-11-09T20:10:52Z

As we discussed offline, we should remove all the NCCL_* op nodes, instead we should create system-level DAG nodes doing NCCL read/write. We will refactor based on this.

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

dengwxn

First pass. Structure seems right. Will look into details later.

python/ray/dag/collective_node.py

python/ray/dag/compiled_dag_node.py

python/ray/dag/p2p_node.py

stephanie-wang

I think this can be made simpler. Try to think about how you can achieve the following:

_NCCLSendNode/_NCCLRecvNode should have the same interface as _CollectiveOperation
If the above is done properly, I believe we can get rid of most of the parts that need to differentiate between send/recv/collective. I.e. there should be only one requires_nccl flag instead of three, and there should only be on kind of DAG op node, a COMPUTE node.

python/ray/dag/compiled_dag_node.py

python/ray/dag/dag_node_operation.py

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

…dule_gpu Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

jjyao · 2025-04-29T20:42:22Z

@stephanie-wang @AndyUB do you want to continue working on this PR?

stephanie-wang · 2025-04-29T20:49:59Z

@stephanie-wang @AndyUB do you want to continue working on this PR?

Yes, we're still working on this.

stephanie-wang

Sorry for the delay, I think this is looking close to merge-able.

I'm a bit confused about a few things, though:

There are several different collective/p2p operation/node types added. Can you explain how each one is used, i.e. how do they reference each other and do we need all of them?
Is there any change in scheduling behavior compared to before?
Are there any unit tests that we can add? I.e. tests that don't need to create a full DAG and test the e2e execution.

python/ray/dag/nccl_operation.py

stephanie-wang · 2025-05-07T00:14:05Z

python/ray/dag/p2p_node.py

+
+    def __init__(
+        self,
+        method_args: Tuple[_P2PSendNode],


Why not use the same structure as CollectiveOutputNode, where we create one actual _P2PNode and the send and recv nodes depend on the _P2PNode, via other_args_to_resolve?

stephanie-wang · 2025-05-07T00:19:36Z

python/ray/dag/compiled_dag_node.py

+        # Convert the abstract P2P operation from scheduling to the executable P2P
+        # send/recv operation.
+        if self.requires_nccl_read:
+            assert self.nccl_ch is not None
+            self.nccl_op = _P2PRecvOperation(self.nccl_ch)
+        elif self.requires_nccl_write:
+            assert self.nccl_ch is not None
+            self.nccl_ch.ensure_registered_as_writer()
+            self.nccl_op = _P2PSendOperation(self.nccl_ch)


Why do we only need to do this conversion from abstract to executable operation for P2P operations and not for collective operations?

stephanie-wang · 2025-05-07T00:23:09Z

python/ray/dag/compiled_dag_node.py

+            if input_exc is not None and self.requires_nccl_write:
+                input_values = [input_exc]
+                input_exc = None


This code can be squashed into the following block.

stephanie-wang · 2025-05-07T00:24:11Z

python/ray/dag/compiled_dag_node.py

+            method_args=(node,),
+            other_args_to_resolve={
+                PARENT_CLASS_NODE_KEY: send_actor_handle,
+                P2P_OPERATION_KEY: _P2POperation(),


Where does this get used?

stephanie-wang · 2025-05-07T00:34:45Z

python/ray/dag/tests/experimental/test_execution_schedule_gpu.py

-        (3, _DAGNodeOperationType.COMPUTE),
-        (3, _DAGNodeOperationType.WRITE),
-    ]
+    w1_expected_schedule = [0, 1, 2, 5, 3, 4, 7, 6, 8]


Please add a comment explaining what the expected schedule is.

Also, I assume there was no behavior change in this test?

stephanie-wang · 2025-05-07T00:35:39Z

python/ray/dag/tests/experimental/test_torch_tensor_dag.py


+@pytest.mark.skipif(not USE_GPU, reason="Skipping GPU Test")
+@pytest.mark.parametrize("overlap_gpu_communication", [False, True])
+def test_torch_tensor_nccl_overlap_collective(


Please add comments explaining what each test does.

stephanie-wang · 2025-05-07T00:37:00Z

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

+
+@pytest.mark.skipif(not USE_GPU, reason="Skipping GPU Test")
+@pytest.mark.parametrize("overlap_gpu_communication", [False, True])
+def test_torch_tensor_nccl_overlap_send_future_across_actors(


This test seems a bit complicated / unrelated compared to the stated goal? Is there a simpler test that can be run? Or a unit test?

stephanie-wang · 2025-05-07T00:37:27Z

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

+
+@pytest.mark.skipif(not USE_GPU, reason="Skipping GPU Test")
+@pytest.mark.parametrize("overlap_gpu_communication", [False, True])
+def test_torch_tensor_nccl_overlap_same_future_multiple_waits(


This test seems a bit complicated / unrelated compared to the stated goal? Is there a simpler test that can be run? Or a unit test?

python/ray/experimental/channel/serialization_context.py

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

…tions (#53007) Given an input DAG of SPMD training strategies such as DDP, after DAG compile, the first actor will generate different execution schedules than others. This is due to the current scheduling policy, when there are multiple ready operation nodes such as `actor1.compute` (non-NCCL) and `actor4.collective` (NCCL, for actor1-4, there's only one collective operation node that's eventually ready), the policy does not know actor1 has both the non-NCCL `actor1.compute` and the NCCL `actor4.collective`. This leads to actor1 scheduling the `actor1.compute` first, and actor1-4 scheduling the `collective` next. We update the policy to push all the collective operations nodes into candidates when the last of them is ready. In the previous example, actor1 will have both `actor1.compute` and `actor1.collective` as candidates. In a DAG of SPMD strategies, all the actors pop either the `compute` or the `collective` together. We also update the policy to simply prioritize the NCCL operation node over the non-NCCL. This will lead to NCCL operations to be scheduled as soon as possible. It is safe to do so under the current settings of CUDA streams in the system, because each NCCL read/write/collective stream only allows one outstanding NCCL kernel at a time. We add a test `test_collective_dag.py::test_exec_schedules_ddp` to verify the generated schedules are identical across workers for the DDP stragegy. Other tests are updated to reflect the changes of prioritizing the NCCL operation node over the non-NCCL. ## Related issue number  This PR is part of #48649 planning to be merged incrementally. --------- Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

…3111) This PR unifies the scheduling implementation for the NCCL P2P and collective operation nodes. The logic remains the same: (1) P2P case: When a NCCL send node is selected, its downstream NCCL recv nodes are also selected; (2) Collective case: When a NCCL collective node is selected, its corresponding NCCL collective nodes are also selected. Previously, the NCCL P2P case was implemented by selecting the recv nodes if a send node is detected, and the NCCL collective case was implemented by maintaining a set of pending collective nodes. We unify the implementation for both cases. Concretely, they both maintain a set of (pending) synchronous nodes named `sync_idxs` and `pending_sync_idxs`. The synchronous nodes denote the P2P send/recv nodes or the collective nodes. The NCCL P2P/collective operation is ready when `sync_idxs == pending_sync_idxs`. Test cases are updated to reflect the use of synchronous nodes for both NCCL P2P and collective nodes. This PR is a follow-up of #53007. They are parts of #48649 planning to be merged incrementally. --------- Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

github-actions · 2025-06-12T00:31:29Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2025-06-26T00:39:23Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

Weixin Deng and others added 10 commits October 27, 2024 10:36

refactor: Add compute aio

d16ba2c

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

refactor: Separate nccl_read, nccl_write, and compute

5b226f3

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

chore: Apply CL

c04ea86

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

test: Fix messages

1f43ce9

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

merge: Upstream master

d4d8764

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Test error message

252789a

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: Upstream master

d88b19c

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Merge errors

767eaf4

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Execution schedule GPU tests

d355e94

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

chore: Format code

bf28397

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

AndyUB marked this pull request as ready for review November 8, 2024 18:49

AndyUB added 2 commits November 8, 2024 19:17

refactor: Separate NCCL_COLLECTIVE and COMPUTE

941cb73

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

Revert "refactor: Separate NCCL_COLLECTIVE and COMPUTE"

2867556

This reverts commit 941cb73. Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

AndyUB added 4 commits November 10, 2024 02:44

(WIP) refactor: Add NCCL send/recv nodes

9259b2e

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) refactor: Refactor DAGOperationGraphNode

a7507cb

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) refactor: Tests

f980026

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) chore: Cleanup

2dc06ec

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

dengwxn reviewed Nov 11, 2024

View reviewed changes

python/ray/dag/collective_node.py Outdated Show resolved Hide resolved

python/ray/dag/collective_node.py Outdated Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/dag/p2p_node.py Outdated Show resolved Hide resolved

stephanie-wang requested changes Nov 12, 2024

View reviewed changes

rkooo567 self-assigned this Nov 12, 2024

stephanie-wang self-assigned this Nov 12, 2024

jcotant1 added the core Issues that should be addressed in Ray Core label Nov 15, 2024

AndyUB added 4 commits November 16, 2024 12:34

(WIP) chore: Add comments for test_execution_schedule_gpu

4045f34

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

merge: Upstream master

f3cf414

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) refactor: Remove DAG node operation type; add synchronous group

bd7d556

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

(WIP) experimental: Add P2P nodes in _add_node (failed)

fb2f20d

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

AndyUB added 5 commits April 27, 2025 20:32

merge: Upstream master

f72c5c1

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: tests

ccad883

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: cpu communicator test

71fc9d3

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Move tensors to CPU before equality check in test_execution_sche…

850c3f7

…dule_gpu Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

chore: Format code

d831a46

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

stephanie-wang reviewed May 7, 2025

View reviewed changes

polish: Names; remove dead code

3542c0f

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

This was referenced May 14, 2025

[core][compiled graphs] Fix execution schedules with collective operations #53007

Merged

[core][compiled graphs] Unify scheduling for NCCL operation nodes #53111

Merged

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

stephanie-wang changed the title ~~(WIP) [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling~~ [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling May 28, 2025

stephanie-wang changed the title ~~[core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling~~ (WIP) [core][compiled graphs] Unify code paths for NCCL P2P and collectives scheduling May 28, 2025

dengwxn mentioned this pull request Jun 2, 2025

[core][compiled graphs] Unify and simplify NCCL operation nodes #53470

Closed

8 tasks

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 12, 2025

github-actions bot closed this Jun 26, 2025

Conversation

AndyUB commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

NCCL P2P Refactoring

Overlap NCCL Collectives

Checks

Uh oh!

dengwxn commented Nov 8, 2024

Uh oh!

dengwxn commented Nov 8, 2024

Uh oh!

dengwxn commented Nov 9, 2024

Uh oh!

dengwxn commented Nov 9, 2024

Uh oh!

dengwxn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjyao commented Apr 29, 2025

Uh oh!

stephanie-wang commented Apr 29, 2025

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jun 12, 2025

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

AndyUB commented Nov 8, 2024 •

edited

Loading