[core] Add single-controller API for ray.util.collective and torch gloo backend by stephanie-wang · Pull Request #53319 · ray-project/ray

stephanie-wang · 2025-05-26T23:10:42Z

Why are these changes needed?

Adds single-controller APIs (APIs that can be called from the driver) for creating collectives on a group of actors using ray.util.collective. These APIs are currently under ray.experimental.collective as they are experimental and to avoid potential conflicts with ray.util.collective. See test_experimental_collective::test_api_basic for API usage.

create_collective_group
destroy_collective_group
get_collective_groups

Also adds a ray.util.collective backend based on torch.distributed gloo, for convenient testing on CPUs. While ray.util.collective has a pygloo backend, this backend requires pygloo to be installed, and pygloo doesn't seem to be supported on latest versions of Python.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

python/ray/experimental/collective/util.py

kevin85421 · 2025-05-30T06:36:36Z

python/ray/experimental/collective/communicator.py

+        """
+        Return all actor handles in this communicator.
+        """
+        return self._actors[:]


is there any reason to do shallow copy?

Lets the caller modify the list.

this can easily become a gotcha; prefer a more explicit pattern (caller gets the list, modifies it, calls a setter) or else make it clear in the docstring

python/ray/experimental/collective/communicator.py

kevin85421 · 2025-05-30T06:52:18Z

python/ray/experimental/collective/collective.py

+    @staticmethod
+    def get() -> "RemoteCommunicatorManager":
+        global _remote_communicator_manager
+        with _remote_communicator_manager_lock:


Why do we need a lock here? If we want to avoid race conditions, should we also add a lock to add_remote_communicator and remove_remote_communicator?

yes it seems so. unclear if this interface is intended to be thread safe or not though

also, if it's always meant to be a global singleton, defining the interface as a set of functions is more natural

Prevents race condition between checking if _remote_communicator_manager is None and setting it. The other methods are already thread-safe through python builtin dictionary.

also, if it's always meant to be a global singleton, defining the interface as a set of functions is more natural

The user-facing APIs are a set of functions. This singleton class is an implementation detail.

python/ray/experimental/collective/collective.py

python/ray/util/collective/util.py

python/ray/experimental/collective/__init__.py

kevin85421 · 2025-05-30T18:37:20Z

python/ray/experimental/collective/collective.py

+        # Find all collective groups that the given actors are a subset
+        # of, with the matching backend if provided.
+        for collective in self._remote_communicators.values():
+            if actors.issubset(set(collective.actors)):


avoid converting list to set for every loop iteration

We're not expecting to have many collectives to iterate through right now and it can be optimized later if it becomes a scalability issue.

kevin85421 · 2025-05-30T18:40:59Z

python/ray/experimental/collective/collective.py

+            popped = None
+        return popped
+
+    def get_collective_groups(


should we have an API:

def get_collective_group_by_name(self, name: str): ...

We can add this later if we need it.

python/ray/experimental/collective/collective.py

python/ray/experimental/collective/communicator.py

kevin85421 · 2025-05-30T18:59:17Z

python/ray/experimental/collective/communicator.py

+        self._backend = Backend(backend)
+
+    def get_rank(self, actor: ray.actor.ActorHandle):
+        for i, a in enumerate(self._actors):


should we maintain a actor-to-index Dict[ActorHandle, int] dict?

We can add it later.

kevin85421 · 2025-05-30T19:01:52Z

python/ray/experimental/collective/util.py

+import ray
+
+
+def find_free_port():


This may cause issues when users are using a Kubernetes service mesh, which requires knowing the communication ports in advance. It’s probably fine for now—we can wait to update it until users report problems.

Most users don't use service mesh.

edoakes

The API/usage makes sense to me. Had many of the same questions as Kai-Hsun on the implementation.

python/ray/util/collective/util.py

python/ray/util/collective/collective_group/torch_gloo_collective_group.py

edoakes · 2025-05-30T19:58:17Z

python/ray/util/collective/collective_group/torch_gloo_collective_group.py

+        dist.init_process_group(
+            backend="gloo", init_method="env://", world_size=world_size, rank=rank
+        )


Will torch.distributed give a useful error message if a user tries to instantiate two groups in the same process? (I'm assuming this would be an error because the process group is a global singleton)

It should, but we also prevent this from happening right now, as long as you use the top-level collective APIs.

python/ray/util/collective/collective.py

edoakes · 2025-05-30T20:02:23Z

python/ray/experimental/collective/collective.py

+    """
+    Create a collective group on the given list of actors. If this function


Suggested change

"""

Create a collective group on the given list of actors. If this function

"""Create a collective group on the given list of actors.

If this function

to follow google style guide. I think we have a linter that's supposed to tell you to do this now 🤔

edoakes · 2025-05-30T20:04:34Z

python/ray/experimental/collective/collective.py

+    @staticmethod
+    def get() -> "RemoteCommunicatorManager":
+        global _remote_communicator_manager
+        with _remote_communicator_manager_lock:


yes it seems so. unclear if this interface is intended to be thread safe or not though

also, if it's always meant to be a global singleton, defining the interface as a set of functions is more natural

python/ray/experimental/collective/collective.py

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

stephanie-wang · 2025-06-03T22:08:31Z

Thanks for the review, this is ready for another round. Not sure why my linter's not working...

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

edoakes

🚢

edoakes · 2025-06-06T18:53:54Z

python/ray/experimental/collective/communicator.py

+        """
+        Return all actor handles in this communicator.
+        """
+        return self._actors[:]


this can easily become a gotcha; prefer a more explicit pattern (caller gets the list, modifies it, calls a setter) or else make it clear in the docstring

…tives

…oo backend (ray-project#53319) Adds single-controller APIs (APIs that can be called from the driver) for creating collectives on a group of actors using `ray.util.collective`. These APIs are currently under `ray.experimental.collective` as they are experimental and to avoid potential conflicts with `ray.util.collective`. See test_experimental_collective::test_api_basic for API usage. - create_collective_group - destroy_collective_group - get_collective_groups Also adds a ray.util.collective backend based on torch.distributed gloo, for convenient testing on CPUs. While ray.util.collective has a pygloo backend, this backend requires pygloo to be installed, and pygloo doesn't seem to be supported on latest versions of Python. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…oo backend (#53319) Adds single-controller APIs (APIs that can be called from the driver) for creating collectives on a group of actors using `ray.util.collective`. These APIs are currently under `ray.experimental.collective` as they are experimental and to avoid potential conflicts with `ray.util.collective`. See test_experimental_collective::test_api_basic for API usage. - create_collective_group - destroy_collective_group - get_collective_groups Also adds a ray.util.collective backend based on torch.distributed gloo, for convenient testing on CPUs. While ray.util.collective has a pygloo backend, this backend requires pygloo to be installed, and pygloo doesn't seem to be supported on latest versions of Python. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…GPU objects (ray-project#53720) Adds integration between the single-controller collective APIs introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…oo backend (#53319) Adds single-controller APIs (APIs that can be called from the driver) for creating collectives on a group of actors using `ray.util.collective`. These APIs are currently under `ray.experimental.collective` as they are experimental and to avoid potential conflicts with `ray.util.collective`. See test_experimental_collective::test_api_basic for API usage. - create_collective_group - destroy_collective_group - get_collective_groups Also adds a ray.util.collective backend based on torch.distributed gloo, for convenient testing on CPUs. While ray.util.collective has a pygloo backend, this backend requires pygloo to be installed, and pygloo doesn't seem to be supported on latest versions of Python. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

stephanie-wang added 3 commits May 23, 2025 16:32

tmp

db10f80

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Working basic test

81eeebc

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

tests and group

2f4ea06

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

stephanie-wang assigned jjyao, edoakes and kevin85421 May 26, 2025

stephanie-wang added 3 commits May 26, 2025 16:38

doc

e31ec81

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

4393996

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

e866300

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

stephanie-wang changed the title ~~[core] Add single-controller API for ray.util.collective and~~ [core] Add single-controller API for ray.util.collective and torch gloo backend May 27, 2025

stephanie-wang added 3 commits May 27, 2025 14:03

test

541b54a

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

more tests

91da8fc

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

7fcaeac

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

kevin85421 reviewed May 30, 2025

View reviewed changes

edoakes reviewed May 30, 2025

View reviewed changes

stephanie-wang and others added 6 commits June 3, 2025 10:37

Update python/ray/experimental/collective/util.py

0ce186e

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Update python/ray/experimental/collective/communicator.py

f909028

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Update python/ray/experimental/collective/collective.py

0fbf8fb

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Update python/ray/experimental/collective/collective.py

7cc982c

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

cleanup

ee69c3d

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

update

8a84643

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

kevin85421 approved these changes Jun 6, 2025

View reviewed changes

lint

7f5bfe9

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

stephanie-wang added the go add ONLY when ready to merge, run all tests label Jun 6, 2025

edoakes approved these changes Jun 6, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/master' into gpu-object-collec…

f85fdac

…tives

stephanie-wang merged commit 1ee529e into ray-project:master Jun 10, 2025
4 of 5 checks passed

stephanie-wang mentioned this pull request Jun 10, 2025

[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720

Merged

8 tasks

stephanie-wang deleted the gpu-object-collectives branch June 10, 2025 22:15

		"""
		Create a collective group on the given list of actors. If this function

Conversation

stephanie-wang commented May 26, 2025

Why are these changes needed?

Checks

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephanie-wang commented Jun 3, 2025

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants