[core][compiled graphs] Add CPU-based NCCL communicator for development by anyadontfly · Pull Request #48440 · ray-project/ray

anyadontfly · 2024-10-30T08:36:07Z

Why are these changes needed?

This allows developers to debug DAGs with collective ops on CPU. Currently we use Ray actor to perform allreduce.

num_workers = 2
workers = [CPUTorchTensorWorker.remote() for _ in range(num_workers)]
cpu_group = CPUCommunicator(num_workers, workers)

with InputNode() as inp:
    computes = [worker.compute.bind(inp) for worker in workers]
    collectives = collective.allreduce.bind(computes, transport=cpu_group)
    recvs = [worker.recv.bind(collective)  for worker, collective in zip(workers, collectives)]
    dag = MultiOutputNode(recvs)

compiled_dag = dag.experimental_compile()
ray.get(compiled_dag.execute())

Related issue number

Closes #47936

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…er of allreduce ops; minor change: sesddefaultdict on line63 for simplicity"; added line 94 for unsupported collective ops

python/ray/experimental/channel/cpu_nccl_group.py

stephanie-wang · 2024-10-31T04:39:08Z

Could you also add a test for this?

AndyUB

LGTM.

python/ray/experimental/channel/cpu_nccl_group.py

…e op of CPUNcclGroup

…ce self.get_communicator in CPUNcclGroup

tfsingh · 2024-11-22T04:51:26Z

Just a heads up that @anyadontfly and I are writing some e2e tests with the compiled dag API, and should have them done by tomorrow.

…rements in 2e2 run; reverted changes in torch_tensor_type.py; added 1 e2e test for p2p and 7 2e2 tests for allreduce

stephanie-wang

Nice, this looks pretty good! Left some comments for cleanup/clarification but overall the structure looks good.

python/ray/experimental/channel/cpu_nccl_group.py

stephanie-wang · 2024-11-28T05:57:06Z

python/ray/experimental/channel/cpu_nccl_group.py

+        self.collective_data: Dict[int, List[torch.Tensor]] = defaultdict(list)
+        # Buffer for the number of actors seen, each entry is one p2p op.
+        self.num_actors_seen = defaultdict(int)
+        # Number of actors who have read the result, and are about the exit the function.


Suggested change

# Number of actors who have read the result, and are about the exit the function.

# Number of actors who have read the result, and are about to exit the function.

python/ray/experimental/channel/cpu_nccl_group.py

stephanie-wang · 2024-11-28T06:01:27Z

python/ray/experimental/channel/cpu_nccl_group.py

+        self.communicators.add(comm)
+
+        received_tensor = ray.get(comm.wait_p2p.remote(self.num_ops[comm_key]))
+        assert (


In this case you can probably just directly return the received_tensor (allocator is needed for cases where a receive buffer needs to be allocated before the recv happens).

python/ray/experimental/channel/cpu_nccl_group.py

stephanie-wang · 2024-11-28T06:09:55Z

python/ray/experimental/channel/torch_tensor_nccl_channel.py

-    assert (
-        ray.get_gpu_ids()
-    ), "Actors participating in NCCL group must have at least one GPU assigned"
+    if not custom_nccl_group or not isinstance(custom_nccl_group, CPUNcclGroup):


Suggested change

if not custom_nccl_group or not isinstance(custom_nccl_group, CPUNcclGroup):

if not (custom_nccl_group and isinstance(custom_nccl_group, CPUNcclGroup)):

nit, a bit more readable with a single not.

python/ray/experimental/channel/torch_tensor_nccl_channel.py

python/ray/experimental/channel/cpu_nccl_group.py

python/ray/dag/tests/experimental/test_cpu_communicator_dag.py

stephanie-wang · 2024-12-05T19:11:38Z

python/ray/experimental/channel/cpu_communicator.py

+        return result
+
+
+class CPUCommunicator:


Can we make this inherit from GPUCommunicator?

One suggestion is to rename the GPUCommunicator into a generic Communicator or DeviceCommunicator, and you can have it return a string of the expected resource type that actors in the group should have.

stephanie-wang · 2024-12-05T19:12:39Z

python/ray/experimental/channel/torch_tensor_nccl_channel.py

        )


+def _do_init_cpu_group(


Can you see if we can reuse the existing _do_init_nccl_group instead?

stephanie-wang · 2024-12-05T19:14:36Z

python/ray/dag/collective_node.py

-        recv_buf = torch.empty_like(send_buf)
-        nccl_group.allreduce(send_buf, recv_buf, self._op)
+        ctx = ChannelContext.get_current()
+        if ctx.nccl_groups:


Same here, I think you can restructure this to just take the "default group", whether it's NCCL or CPU, and then the code inside the if-else branches is the same for both.

stephanie-wang · 2024-12-05T19:19:30Z

python/ray/experimental/channel/common.py

        self.nccl_groups: Dict[str, "GPUCommunicator"] = {}
+        # Used for the torch.Tensor CPU transport.
+        self.cpu_groups: Dict[str, "CPUCommunicator"] = {}


// group ID -> Communicator
Option 1: self.device_groups: Dict[str, Communicator]

// resource label -> group ID -> Communicator
Option2: self.device_groups: Dict[str, str, Communicator]

can we keep the name self.nccl_groups?

…icator recv_stream and send_ctream raise NotImplementedError

stephanie-wang

Thanks, this looks great! A couple minor comments about naming then we can merge it. Let's have get_device_type return "gpu" instead of "nccl".

stephanie-wang · 2024-12-13T05:22:16Z

python/ray/experimental/channel/communicator.py

+    @abstractmethod
+    def get_device_type() -> str:
+        """
+        Return the type of the communicator (nccl or cpu).


Suggested change

Return the type of the communicator (nccl or cpu).

Return the type of the communicator (gpu or cpu).

stephanie-wang · 2024-12-13T05:23:54Z

python/ray/experimental/channel/nccl_group.py

            self._comm.destroy()
+
+    def get_device_type(self) -> str:
+        return "nccl"


Suggested change

return "nccl"

return "gpu"

"nccl" is the transport name but "gpu" is the device that each actor is expected to have.

You could add a get_transport_name and that can return NCCL instead.

stephanie-wang · 2024-12-13T05:28:14Z

python/ray/dag/compiled_dag_node.py

        # This is set to the specified custom nccl group
        # if there exists a type hint of `transport=nccl_group`.
-        self._custom_nccl_group_p2p: Optional[GPUCommunicator] = None
+        self._custom_nccl_group_p2p: Optional[Communicator] = None


For consistency, it would be good to replace-all nccl_group with communicator.

Got it. Also we have functions for initializing and destroying CPUCommunicator in torch_tensor_nccl_channel.py. Do we have to change the file name of torch_tensor_nccl_channel.py or to put those functions in a separate file?

Signed-off-by: tfsingh <105320310+tfsingh@users.noreply.github.com>

tfsingh and others added 3 commits October 28, 2024 22:46

Initial work

df0f227

Undo sorting

b043233

usiedlist of ranks as barrier_key instead of type of operation + numb…

b7c01c2

…er of allreduce ops; minor change: sesddefaultdict on line63 for simplicity"; added line 94 for unsupported collective ops

stephanie-wang reviewed Oct 31, 2024

View reviewed changes

stephanie-wang assigned stephanie-wang and ruisearch42 Oct 31, 2024

tfsingh and others added 4 commits November 4, 2024 19:53

Initial response to review

00f0f4e

Merge branch 'master' into py-ts/cpu-nccl

2528f86

Condense code

fc3e810

Add todo

1da7d1d

AndyUB reviewed Nov 7, 2024

View reviewed changes

python/ray/experimental/channel/cpu_nccl_group.py Outdated Show resolved Hide resolved

python/ray/experimental/channel/cpu_nccl_group.py Outdated Show resolved Hide resolved

python/ray/experimental/channel/cpu_nccl_group.py Outdated Show resolved Hide resolved

anyadontfly and others added 11 commits November 7, 2024 02:30

added test for CPUCommunicator, changed communicator key for allreduc…

fa00f9d

…e op of CPUNcclGroup

Fix (some) lint errors

095c307

Changes during meeting

d2bffa5

Merge branch 'master' into py-ts/cpu-nccl

04707ed

Tests passing, but incorrect

26b387b

Add working test

e12b500

Merge branch 'master' into py-ts/cpu-nccl

df16818

Reset conftest.py

5e12e22

Add newline

0d7b2a4

added allreduce test, used Actor.options(get_if_exists=True) to repla…

932c4e5

…ce self.get_communicator in CPUNcclGroup

minor changes

ddcdaf8

jcotant1 added core Issues that should be addressed in Ray Core compiled-graphs labels Nov 18, 2024

tfsingh and others added 3 commits November 18, 2024 19:34

Remove time

5c9a67f

Merge branch 'master' into py-ts/cpu-nccl

dfa8257

Merge branch 'master' into py-ts/cpu-nccl

4f99dd2

used a slightly modifed start_mock_nccl() to satisfy dependency requi…

6c4aed5

…rements in 2e2 run; reverted changes in torch_tensor_type.py; added 1 e2e test for p2p and 7 2e2 tests for allreduce

tfsingh and others added 3 commits November 26, 2024 20:01

Merge branch 'master' into py-ts/cpu-nccl

f3baf27

Merge branch 'master' into py-ts/cpu-nccl

9ff534b

Respond to review

ed3b27f

stephanie-wang reviewed Nov 28, 2024

View reviewed changes

anyadontfly and others added 7 commits November 30, 2024 20:39

no longer depend on mock nccl, test on wrong shape not passing

289eb6c

change name from cpu_nccl_group to cpu_communicator

affb790

Fix allreduce test

ec7eba5

Remove p2p ops on CPUCommunicator

9a3569c

Merge branch 'master' into py-ts/cpu-nccl

9de82d3

reformat code

857c2de

Move import inside class

c6f47d0

stephanie-wang reviewed Dec 5, 2024

View reviewed changes

tfsingh and others added 9 commits December 5, 2024 21:58

Unify codepaths

81fddec

Rename file

c211427

Small fixes

fbf1388

clean up, lint and format

e11cfe7

added get_device_type func for custom nccl groups in tests, CPUCommun…

ca6fda8

…icator recv_stream and send_ctream raise NotImplementedError

Merge branch 'master' into py-ts/cpu-nccl

944f522

Swap torch

5dada98

lint fix

3b614dd

Merge branch 'master' into py-ts/cpu-nccl

fa7dc1e

stephanie-wang approved these changes Dec 13, 2024

View reviewed changes

rename nccl_group to communicator

a401eff

stephanie-wang added the go add ONLY when ready to merge, run all tests label Dec 17, 2024

tfsingh and others added 4 commits December 17, 2024 10:49

Merge branch 'master' into py-ts/cpu-nccl

d7fca94

Signed-off-by: tfsingh <105320310+tfsingh@users.noreply.github.com>

small fix

c30e02a

small fix

42a7171

Merge branch 'master' into py-ts/cpu-nccl

89e3f43

Signed-off-by: tfsingh <105320310+tfsingh@users.noreply.github.com>

stephanie-wang changed the title ~~(WIP) [core][compiled graphs] Add CPU-based NCCL communicator for development~~ [core][compiled graphs] Add CPU-based NCCL communicator for development Dec 19, 2024

stephanie-wang merged commit 6676cc1 into ray-project:master Dec 19, 2024

	# Number of actors who have read the result, and are about the exit the function.
	# Number of actors who have read the result, and are about to exit the function.

	if not custom_nccl_group or not isinstance(custom_nccl_group, CPUNcclGroup):
	if not (custom_nccl_group and isinstance(custom_nccl_group, CPUNcclGroup)):

	Return the type of the communicator (nccl or cpu).
	Return the type of the communicator (gpu or cpu).

Conversation

anyadontfly commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephanie-wang commented Oct 31, 2024

Uh oh!

AndyUB left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfsingh commented Nov 22, 2024

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anyadontfly Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

anyadontfly commented Oct 30, 2024 •

edited

Loading

anyadontfly Dec 5, 2024 •

edited

Loading