[core][gpu objects] Integrate single-controller collective APIs with GPU objects#53720
Merged
stephanie-wang merged 40 commits intoray-project:masterfrom Jun 16, 2025
Merged
Conversation
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
…ct-collective-integration
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
edoakes
approved these changes
Jun 11, 2025
Comment on lines
+501
to
+504
| def gpu_object_manager(self) -> "ray._private.gpu_object_manager.GPUObjectManager": | ||
| if self._gpu_object_manager is None: | ||
| from ray._private.gpu_object_manager import GPUObjectManager | ||
| self._gpu_object_manager = GPUObjectManager() |
Collaborator
There was a problem hiding this comment.
why's this made to be lazy?
Contributor
Author
There was a problem hiding this comment.
Ah this is to avoid pulling in any dependencies needed by GPUObjectManager that aren't required by ray usually (currently torch).
Collaborator
There was a problem hiding this comment.
Got it. Would be nice if we came up with a more structured way to quarantine soft dependencies so we don't need lazy imports for first party code. I'll play around with it at some point.
|
|
||
|
|
||
| def test_p2p(ray_start_regular): | ||
| # TODO(swang): Add tests for mocked NCCL that can run on CPU-only machines. |
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
kevin85421
approved these changes
Jun 13, 2025
|
|
||
| # Create test tensor | ||
| tensor = torch.tensor([1, 2, 3]) | ||
| gpu_ref = src_actor.echo_cuda.remote(tensor) |
| if not _TORCH_AVAILABLE: | ||
| raise ImportError( | ||
| "`tensor_transport` requires PyTorch. " | ||
| "Please install torch with 'pip install torch' to use this feature." |
Member
There was a problem hiding this comment.
In my memory, pip install torch will install CPU version PyTorch. Maybe we can just ask users to install torch without providing the instruction.
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
elliot-barn
pushed a commit
that referenced
this pull request
Jun 18, 2025
…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
minerharry
pushed a commit
to minerharry/ray
that referenced
this pull request
Jun 27, 2025
…GPU objects (ray-project#53720) Adds integration between the single-controller collective APIs introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
elliot-barn
pushed a commit
that referenced
this pull request
Jul 2, 2025
…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through
ray.experimental.collective.create_collective_groupwill now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO).See updates in test_gpu_objects.py for examples.
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.