[core][rdt] Reuse previous metadata if transferring the same tensor list with nixl by Qiaolin-Yu · Pull Request #58263 · ray-project/ray

Qiaolin-Yu · 2025-10-28T23:53:29Z

Description

For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated register_memory before deregister_memory

dayshah · 2025-10-29T20:53:43Z

python/ray/util/collective/types.py

    nixl_agent_meta: Optional[bytes] = None

+    __eq__ = object.__eq__
+    __hash__ = object.__hash__


without this the equality operator wouldn't work?

i think it's needed by the hashset

doc/source/ray-core/doc_code/direct_transport_nixl.py

dayshah · 2025-10-29T21:14:31Z

python/ray/experimental/collective/nixl_tensor_transport.py

        )
+        gpu_object_store._managed_meta_nixl[obj_id] = ret
+        gpu_object_store._managed_meta_counts_nixl[ret] = 1
+        return ret


i think the ai is right here aren't these 2 dicts under locks

doc/source/ray-core/direct-transport.rst

dayshah

can we turn the code blocks that grab a lock into gpu obj store funcs, and also if possible just have one lock instead of having a new nixl one?

python/ray/experimental/collective/nixl_tensor_transport.py

dayshah · 2025-10-30T00:09:37Z

python/ray/util/collective/collective_group/nixl_backend.py

                break

        nixl_agent.release_xfer_handle(xfer_handle)
+        nixl_agent.deregister_memory(local_descs)


why do we need to deregister on recv now?

It's for the memory registered by the receiver.

ray/python/ray/util/collective/collective_group/nixl_backend.py

Line 62 in e0ecadb

local_descs = nixl_agent.register_memory(tensors)

stephanie-wang · 2025-10-30T00:10:30Z

python/ray/experimental/gpu_object_manager/gpu_object_store.py

+                    )
+                    if not is_same_tensors:
+                        raise ValueError(
+                            f"The duplicate object {dst_obj_id} does not have the same tensors as the source object {src_obj_id}."


This error will get raised to users, right? Might want to say something a bit clearer, like:

"Some of the tensors in this object are still in scope as part of another RDT object. Ensure that ObjectRef({src_object_id}) is out of scope before creating this object."

stephanie-wang · 2025-10-30T00:12:52Z

python/ray/tests/gpu_objects/test_gpu_objects_nixl.py



+@pytest.mark.parametrize("ray_start_regular", [{"num_gpus": 2}], indirect=True)
+def test_send_duplicate_tensor(ray_start_regular):


Can you add a test that checks that we throw an error when a different tensor subset is passed?

doc/source/ray-core/direct-transport.rst

Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

python/ray/experimental/gpu_object_manager/gpu_object_store.py

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

python/ray/experimental/gpu_object_manager/gpu_object_store.py

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

python/ray/experimental/gpu_object_manager/gpu_object_store.py

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

…ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>

…same tensor list with nixl (#58309) Cherry-picking #58263 for 2.51.1 release. Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>

…ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>

…ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Signed-off-by: Future-Outlier <eric901201@gmail.com>

…same tensor list with nixl (ray-project#58309) Cherry-picking ray-project#58263 for 2.51.1 release. Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>

…ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>

upd

504d2f9

Qiaolin-Yu force-pushed the fix_ql branch from df7249a to 504d2f9 Compare October 29, 2025 19:08

Qiaolin-Yu added 4 commits October 29, 2025 19:35

clean code

6c2a841

upd

6b847fb

fix

4fb4bc5

fix

1358212

Qiaolin-Yu changed the title ~~draft~~ [core][rdt] Reuse previous metadata if transferring the same tensor list with nixl Oct 29, 2025

Qiaolin-Yu marked this pull request as ready for review October 29, 2025 20:39

Qiaolin-Yu requested review from a team as code owners October 29, 2025 20:39

fix

7b12520

This comment was marked as outdated.

Sign in to view

Qiaolin-Yu added 2 commits October 29, 2025 20:45

fix

d9511aa

fix

fad2cbd

Qiaolin-Yu requested a review from dayshah October 29, 2025 21:02

Qiaolin-Yu assigned dayshah Oct 29, 2025

Qiaolin-Yu added core Issues that should be addressed in Ray Core rdt Ray Direct Transport labels Oct 29, 2025

dayshah reviewed Oct 29, 2025

View reviewed changes

dayshah added the go add ONLY when ready to merge, run all tests label Oct 29, 2025

Qiaolin-Yu added 2 commits October 29, 2025 23:02

add lock

5f633d4

refine

96d3d77

This comment was marked as outdated.

Sign in to view

Qiaolin-Yu requested a review from dayshah October 29, 2025 23:18

dayshah reviewed Oct 29, 2025

View reviewed changes

doc/source/ray-core/direct-transport.rst Outdated Show resolved Hide resolved

dayshah reviewed Oct 29, 2025

View reviewed changes

python/ray/experimental/collective/nixl_tensor_transport.py Outdated Show resolved Hide resolved

fix

07e3147

Qiaolin-Yu requested a review from dayshah October 29, 2025 23:59

This comment was marked as outdated.

Sign in to view

dayshah approved these changes Oct 30, 2025

View reviewed changes

stephanie-wang approved these changes Oct 30, 2025

View reviewed changes

Update doc/source/ray-core/direct-transport.rst

acec046

Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

dayshah enabled auto-merge (squash) October 30, 2025 04:05

dayshah reviewed Oct 30, 2025

View reviewed changes

python/ray/experimental/gpu_object_manager/gpu_object_store.py Outdated Show resolved Hide resolved

Apply suggestion from @dayshah

9f51969

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

github-actions bot disabled auto-merge October 30, 2025 04:07

dayshah reviewed Oct 30, 2025

View reviewed changes

python/ray/experimental/gpu_object_manager/gpu_object_store.py Outdated Show resolved Hide resolved

Apply suggestion from @dayshah

919d373

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

dayshah enabled auto-merge (squash) October 30, 2025 04:08

dayshah reviewed Oct 30, 2025

View reviewed changes

python/ray/experimental/gpu_object_manager/gpu_object_store.py Outdated Show resolved Hide resolved

Apply suggestion from @dayshah

0b97ac7

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

github-actions bot disabled auto-merge October 30, 2025 04:40

dayshah enabled auto-merge (squash) October 30, 2025 04:40

dayshah merged commit 6e72c84 into ray-project:master Oct 30, 2025
7 checks passed

dayshah mentioned this pull request Oct 30, 2025

[core][rdt][cherry-pick] Reuse previous metadata if transferring the same tensor list with nixl #58309

Merged



		@pytest.mark.parametrize("ray_start_regular", [{"num_gpus": 2}], indirect=True)
		def test_send_duplicate_tensor(ray_start_regular):

Conversation

Qiaolin-Yu commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

This comment was marked as outdated.

Uh oh!

dayshah Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dayshah Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

dayshah left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

dayshah Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qiaolin-Yu commented Oct 28, 2025 •

edited

Loading

dayshah left a comment •

edited

Loading

Qiaolin-Yu Oct 30, 2025 •

edited

Loading