Skip to content

[core][rdt] Reuse previous metadata if transferring the same tensor list with nixl#58263

Merged
dayshah merged 15 commits intoray-project:masterfrom
Qiaolin-Yu:fix_ql
Oct 30, 2025
Merged

[core][rdt] Reuse previous metadata if transferring the same tensor list with nixl#58263
dayshah merged 15 commits intoray-project:masterfrom
Qiaolin-Yu:fix_ql

Conversation

@Qiaolin-Yu
Copy link
Copy Markdown
Member

@Qiaolin-Yu Qiaolin-Yu commented Oct 28, 2025

Description

For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated register_memory before deregister_memory

@Qiaolin-Yu Qiaolin-Yu changed the title draft [core][rdt] Reuse previous metadata if transferring the same tensor list with nixl Oct 29, 2025
@Qiaolin-Yu Qiaolin-Yu marked this pull request as ready for review October 29, 2025 20:39
@Qiaolin-Yu Qiaolin-Yu requested review from a team as code owners October 29, 2025 20:39
cursor[bot]

This comment was marked as outdated.

@Qiaolin-Yu Qiaolin-Yu requested a review from dayshah October 29, 2025 21:02
@Qiaolin-Yu Qiaolin-Yu added core Issues that should be addressed in Ray Core rdt Ray Direct Transport labels Oct 29, 2025
nixl_agent_meta: Optional[bytes] = None

__eq__ = object.__eq__
__hash__ = object.__hash__
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this the equality operator wouldn't work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's needed by the hashset

)
gpu_object_store._managed_meta_nixl[obj_id] = ret
gpu_object_store._managed_meta_counts_nixl[ret] = 1
return ret
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the ai is right here aren't these 2 dicts under locks

@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Oct 29, 2025
cursor[bot]

This comment was marked as outdated.

@Qiaolin-Yu Qiaolin-Yu requested a review from dayshah October 29, 2025 23:18
Copy link
Copy Markdown
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we turn the code blocks that grab a lock into gpu obj store funcs, and also if possible just have one lock instead of having a new nixl one?

@Qiaolin-Yu Qiaolin-Yu requested a review from dayshah October 29, 2025 23:59
cursor[bot]

This comment was marked as outdated.

break

nixl_agent.release_xfer_handle(xfer_handle)
nixl_agent.deregister_memory(local_descs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to deregister on recv now?

Copy link
Copy Markdown
Member Author

@Qiaolin-Yu Qiaolin-Yu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for the memory registered by the receiver.

local_descs = nixl_agent.register_memory(tensors)

)
if not is_same_tensors:
raise ValueError(
f"The duplicate object {dst_obj_id} does not have the same tensors as the source object {src_obj_id}."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error will get raised to users, right? Might want to say something a bit clearer, like:

"Some of the tensors in this object are still in scope as part of another RDT object. Ensure that ObjectRef({src_object_id}) is out of scope before creating this object."



@pytest.mark.parametrize("ray_start_regular", [{"num_gpus": 2}], indirect=True)
def test_send_duplicate_tensor(ray_start_regular):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that checks that we throw an error when a different tensor subset is passed?

Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
@dayshah dayshah enabled auto-merge (squash) October 30, 2025 04:05
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
@github-actions github-actions bot disabled auto-merge October 30, 2025 04:07
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
@dayshah dayshah enabled auto-merge (squash) October 30, 2025 04:08
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
@github-actions github-actions bot disabled auto-merge October 30, 2025 04:40
@dayshah dayshah enabled auto-merge (squash) October 30, 2025 04:40
@dayshah dayshah merged commit 6e72c84 into ray-project:master Oct 30, 2025
7 checks passed
dayshah added a commit to dayshah/ray that referenced this pull request Oct 30, 2025
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
aslonnie pushed a commit that referenced this pull request Oct 30, 2025
…same tensor list with nixl (#58309)

Cherry-picking #58263 for 2.51.1 release.

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
weiquanlee pushed a commit to antgroup/ant-ray that referenced this pull request Dec 11, 2025
…same tensor list with nixl (ray-project#58309)

Cherry-picking ray-project#58263 for 2.51.1 release.

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests rdt Ray Direct Transport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants