[core][gpu-objects] GPU Objects POC#52938
Merged
stephanie-wang merged 72 commits intoray-project:masterfrom May 29, 2025
Merged
[core][gpu-objects] GPU Objects POC#52938stephanie-wang merged 72 commits intoray-project:masterfrom
stephanie-wang merged 72 commits intoray-project:masterfrom
Conversation
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
This reverts commit 2d166c2.
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Contributor
stephanie-wang
left a comment
There was a problem hiding this comment.
Looking good so far!
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
…sion Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Member
Author
|
Hi @edoakes, I’ve updated the type from a string to a Protobuf enum. Would you have a chance to take another look? I’ll be adding some tests today. |
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Member
Author
|
I added a test to cover both non-trivial public functions in |
Member
Author
13 tasks
8 tasks
kevin85421
added a commit
to kevin85421/ray
that referenced
this pull request
Jun 6, 2025
This reverts commit 2ff7298.
jjyao
pushed a commit
that referenced
this pull request
Jun 6, 2025
8 tasks
stephanie-wang
added a commit
that referenced
this pull request
Jun 16, 2025
…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
elliot-barn
pushed a commit
that referenced
this pull request
Jun 18, 2025
…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
minerharry
pushed a commit
to minerharry/ray
that referenced
this pull request
Jun 27, 2025
…GPU objects (ray-project#53720) Adds integration between the single-controller collective APIs introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
elliot-barn
pushed a commit
that referenced
this pull request
Jul 2, 2025
…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
weiquanlee
pushed a commit
to antgroup/ant-ray
that referenced
this pull request
Aug 5, 2025
…-project#53602) This reverts commit 2ff7298.
crypdick
reviewed
Aug 14, 2025
| @@ -0,0 +1,153 @@ | |||
| import sys | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Why are these changes needed?
High-level
This PR implements most parts of the diagram above.
Details
Step 1: Users annotate the sender’s actor method with
@ray.method(tensor_transport=...). The valid, case-insensitive values fortensor_transportarenccl,gloo, andobject_store(default).Step 2: Users create a communication group, such as an NCCL group, for actors that need to communicate with each other. In addition, each actor needs to register the custom serializer.
Step 3: Pass the tensor_transport information through the stack—Python → Cython → C++ → Cython —when submitting a task to the sender actor.
tensor_transportis notOBJECT_STORE,serialize_and_store_gpu_objectswill be called to extract tensors from the task output and store them in theGPUObjectManager.Step 4: When the driver process resolves the dependencies of the receiver actor’s task argument, if that argument is an
ObjectRefpointing to an object created by an actor method annotated with@ray.method(tensor_transport="...")(NCCL or GLOO), it submits a__ray_send__task to the sender actor to initiate the send operation (e.g., NCCL send) and a__ray_recv__task to the receiver actor to initiate the receive operation (e.g., NCCL recv).Step 5: Pass the object ID through the stack—C++ (driver) → C++ (receiver actor) → Cython → Python (
def deserialize).def deserialize, use the object ID to retrieve tensors from the in-actor object store, add them to the serialization context, and then deserialize to obtain the argument.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.