Skip to content

[core][gpu objects] Integrate single-controller collective APIs with GPU objects#53720

Merged
stephanie-wang merged 40 commits intoray-project:masterfrom
stephanie-wang:gpu-object-collective-integration
Jun 16, 2025
Merged

[core][gpu objects] Integrate single-controller collective APIs with GPU objects#53720
stephanie-wang merged 40 commits intoray-project:masterfrom
stephanie-wang:gpu-object-collective-integration

Conversation

@stephanie-wang
Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang commented Jun 10, 2025

Why are these changes needed?

Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through ray.experimental.collective.create_collective_group will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

stephanie-wang and others added 27 commits May 23, 2025 16:32
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Comment on lines +501 to +504
def gpu_object_manager(self) -> "ray._private.gpu_object_manager.GPUObjectManager":
if self._gpu_object_manager is None:
from ray._private.gpu_object_manager import GPUObjectManager
self._gpu_object_manager = GPUObjectManager()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why's this made to be lazy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is to avoid pulling in any dependencies needed by GPUObjectManager that aren't required by ray usually (currently torch).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Would be nice if we came up with a more structured way to quarantine soft dependencies so we don't need lazy imports for first party code. I'll play around with it at some point.



def test_p2p(ray_start_regular):
# TODO(swang): Add tests for mocked NCCL that can run on CPU-only machines.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please!

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
@stephanie-wang stephanie-wang enabled auto-merge (squash) June 11, 2025 23:44
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 11, 2025
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
@github-actions github-actions bot disabled auto-merge June 12, 2025 17:39
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!


# Create test tensor
tensor = torch.tensor([1, 2, 3])
gpu_ref = src_actor.echo_cuda.remote(tensor)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is echo_cuda?

if not _TORCH_AVAILABLE:
raise ImportError(
"`tensor_transport` requires PyTorch. "
"Please install torch with 'pip install torch' to use this feature."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my memory, pip install torch will install CPU version PyTorch. Maybe we can just ask users to install torch without providing the instruction.

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
@stephanie-wang stephanie-wang enabled auto-merge (squash) June 16, 2025 21:37
@stephanie-wang stephanie-wang merged commit 93acaf1 into ray-project:master Jun 16, 2025
5 of 6 checks passed
@stephanie-wang stephanie-wang deleted the gpu-object-collective-integration branch June 17, 2025 01:16
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
…GPU objects (#53720)

Adds integration between the single-controller collective APIs
introduced in #53319 and the GPU objects feature prototyped in #52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…GPU objects (ray-project#53720)

Adds integration between the single-controller collective APIs
introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
…GPU objects (#53720)

Adds integration between the single-controller collective APIs
introduced in #53319 and the GPU objects feature prototyped in #52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants