[core][gpu objects] Integrate single-controller collective APIs with GPU objects by stephanie-wang · Pull Request #53720 · ray-project/ray

stephanie-wang · 2025-06-10T22:03:10Z

Why are these changes needed?

Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through ray.experimental.collective.create_collective_group will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

…ct-collective-integration

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

…tive-integration

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

edoakes · 2025-06-11T22:14:26Z

python/ray/_private/worker.py

+    def gpu_object_manager(self) -> "ray._private.gpu_object_manager.GPUObjectManager":
+        if self._gpu_object_manager is None:
+            from ray._private.gpu_object_manager import GPUObjectManager
+            self._gpu_object_manager = GPUObjectManager()


why's this made to be lazy?

Ah this is to avoid pulling in any dependencies needed by GPUObjectManager that aren't required by ray usually (currently torch).

Got it. Would be nice if we came up with a more structured way to quarantine soft dependencies so we don't need lazy imports for first party code. I'll play around with it at some point.

edoakes · 2025-06-11T22:14:54Z

python/ray/tests/test_gpu_objects_nccl.py

+
+
+def test_p2p(ray_start_regular):
+    # TODO(swang): Add tests for mocked NCCL that can run on CPU-only machines.


yes please!

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

…tive-integration

kevin85421

Looks great!

kevin85421 · 2025-06-13T21:54:01Z

python/ray/tests/test_gpu_objects_nccl.py

+
+    # Create test tensor
+    tensor = torch.tensor([1, 2, 3])
+    gpu_ref = src_actor.echo_cuda.remote(tensor)


what is echo_cuda?

kevin85421 · 2025-06-13T22:17:48Z

python/ray/_private/gpu_object_manager.py

+    if not _TORCH_AVAILABLE:
+        raise ImportError(
+            "`tensor_transport` requires PyTorch. "
+            "Please install torch with 'pip install torch' to use this feature."


In my memory, pip install torch will install CPU version PyTorch. Maybe we can just ask users to install torch without providing the instruction.

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

…tive-integration

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…GPU objects (ray-project#53720) Adds integration between the single-controller collective APIs introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

stephanie-wang and others added 27 commits May 23, 2025 16:32

tmp

db10f80

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Working basic test

81eeebc

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

tests and group

2f4ea06

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

doc

e31ec81

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

4393996

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

e866300

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

test

541b54a

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

more tests

91da8fc

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

7fcaeac

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Update python/ray/experimental/collective/util.py

0ce186e

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Update python/ray/experimental/collective/communicator.py

f909028

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Update python/ray/experimental/collective/collective.py

0fbf8fb

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

Update python/ray/experimental/collective/collective.py

7cc982c

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

cleanup

ee69c3d

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

update

8a84643

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Merge commit '2ff7298b1a69ea68b0c51a8036acacf147dc8cdf' into gpu-obje…

32b767a

…ct-collective-integration

Unit tests work now

aa7dac9

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Specify backend

2594dc1

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Allocate on correct device

681704c

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

GPU test

a72f78c

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Merge remote-tracking branch 'upstream/master' into gpu-object-collec…

cff9244

…tive-integration

doc

083a3fa

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

more docs

a30ea59

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

909d298

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

doc

ce69966

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

test

edb45ae

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

comment

253cfce

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

stephanie-wang assigned edoakes and kevin85421 Jun 10, 2025

edoakes approved these changes Jun 11, 2025

View reviewed changes

stephanie-wang added 2 commits June 11, 2025 16:40

fix, lint

c4091bf

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

fix and lint

10e6624

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

stephanie-wang enabled auto-merge (squash) June 11, 2025 23:44

github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 11, 2025

lint

309558c

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

github-actions bot disabled auto-merge June 12, 2025 17:39

stephanie-wang added 5 commits June 12, 2025 10:40

lint

5e2ac05

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

avoid torch import

d8f457c

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

816baa5

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

lint

0e751d2

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Merge remote-tracking branch 'upstream/master' into gpu-object-collec…

88d80be

…tive-integration

kevin85421 mentioned this pull request Jun 13, 2025

How to transfer tensors stored in GPU in actor with NCCL? #53816

Closed

kevin85421 approved these changes Jun 13, 2025

View reviewed changes

stephanie-wang added 5 commits June 13, 2025 15:37

fix imports

4ca5195

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

ignore

42ba38a

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

Merge remote-tracking branch 'upstream/master' into gpu-object-collec…

384fb46

…tive-integration

fix

a7a79cb

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

fix test

2bfccdf

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>

stephanie-wang enabled auto-merge (squash) June 16, 2025 21:37

stephanie-wang merged commit 93acaf1 into ray-project:master Jun 16, 2025
5 of 6 checks passed

stephanie-wang deleted the gpu-object-collective-integration branch June 17, 2025 01:16

stephanie-wang mentioned this pull request Jun 17, 2025

[core][gpu-objects] Allocate placeholder tensor on corresponding devices #53622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][gpu objects] Integrate single-controller collective APIs with GPU objects#53720

[core][gpu objects] Integrate single-controller collective APIs with GPU objects#53720
stephanie-wang merged 40 commits intoray-project:masterfrom
stephanie-wang:gpu-object-collective-integration

stephanie-wang commented Jun 10, 2025 •

edited

Loading

Uh oh!

edoakes Jun 11, 2025

Uh oh!

stephanie-wang Jun 11, 2025

Uh oh!

edoakes Jun 13, 2025

Uh oh!

edoakes Jun 11, 2025

Uh oh!

kevin85421 left a comment

Uh oh!

kevin85421 Jun 13, 2025

Uh oh!

kevin85421 Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def test_p2p(ray_start_regular):
		# TODO(swang): Add tests for mocked NCCL that can run on CPU-only machines.

Conversation

stephanie-wang commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Checks

Uh oh!

edoakes Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stephanie-wang commented Jun 10, 2025 •

edited

Loading