Skip to content

[core][rdt][cherry-pick] Reuse previous metadata if transferring the same tensor list with nixl#58309

Merged
aslonnie merged 1 commit intoray-project:releases/2.51.1from
dayshah:2.51.1-cherrypick
Oct 30, 2025
Merged

[core][rdt][cherry-pick] Reuse previous metadata if transferring the same tensor list with nixl#58309
aslonnie merged 1 commit intoray-project:releases/2.51.1from
dayshah:2.51.1-cherrypick

Conversation

@dayshah
Copy link
Copy Markdown
Contributor

@dayshah dayshah commented Oct 30, 2025

Cherry-picking #58263 for 2.51.1 release.

…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
@dayshah dayshah requested review from a team as code owners October 30, 2025 07:00
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Oct 30, 2025
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 30, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for NIXL tensor transport by reusing metadata for repeated transfers of the same tensor list. This is achieved by caching metadata and using tensor data pointers for identification, which also enables support for multiple objects containing the same tensor. The changes span documentation, core object management logic, and tests. While the overall approach is sound, I've identified a couple of critical syntax errors in the initialization of defaultdict that will cause TypeError at runtime. I've also included a medium-severity suggestion to improve code clarity and efficiency in one of the new methods.

Comment on lines +168 to +170
self._tensor_to_object_ids: Dict[int, Set[str]] = defaultdict[int, Set[str]](
set
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The syntax defaultdict[int, Set[str]](set) is incorrect. defaultdict[int, Set[str]] is a types.GenericAlias which is not callable in Python 3.9+. This will raise a TypeError at runtime. The correct way to initialize a defaultdict is by passing the factory function to its constructor, like defaultdict(set).

        self._tensor_to_object_ids: Dict[int, Set[str]] = defaultdict(set)

# Mapping from object ID to the NIXL managed meta.
self._managed_meta_nixl: Dict[str, Any] = {}
# Mapping from NIXL managed meta to the number of objects that contain it.
self._managed_meta_counts_nixl: Dict[Any, int] = defaultdict[Any, int](int)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The syntax defaultdict[Any, int](int) is incorrect. defaultdict[Any, int] is a types.GenericAlias which is not callable in Python 3.9+. This will raise a TypeError at runtime. The correct way to initialize a defaultdict is by passing the factory function to its constructor, like defaultdict(int).

        self._managed_meta_counts_nixl: Dict[Any, int] = defaultdict(int)

Comment on lines +304 to +307
self._managed_meta_counts_nixl[meta] -= 1
if self._managed_meta_counts_nixl[meta] == 0:
self._managed_meta_counts_nixl.pop(meta)
return self._managed_meta_counts_nixl[meta]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation is a bit confusing and inefficient. It accesses self._managed_meta_counts_nixl[meta] multiple times. If the count becomes 0, the key is popped and then re-accessed on the return line, which relies on the defaultdict behavior to create a new entry and return 0. It would be clearer and more efficient to store the count in a local variable before the conditional pop.

            self._managed_meta_counts_nixl[meta] -= 1
            count = self._managed_meta_counts_nixl[meta]
            if count == 0:
                self._managed_meta_counts_nixl.pop(meta)
            return count

self._managed_meta_counts_nixl[meta] -= 1
if self._managed_meta_counts_nixl[meta] == 0:
self._managed_meta_counts_nixl.pop(meta)
return self._managed_meta_counts_nixl[meta]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: KeyError in Meta Count Removal

The remove_managed_meta_nixl method can raise a KeyError. This occurs when an entry's count in _managed_meta_counts_nixl reaches zero, causing it to be popped, but the method then tries to access that same key for its return value.

Fix in Cursor Fix in Web

@aslonnie aslonnie merged commit f86c8a0 into ray-project:releases/2.51.1 Oct 30, 2025
3 of 5 checks passed
@dayshah dayshah deleted the 2.51.1-cherrypick branch October 30, 2025 07:06
weiquanlee pushed a commit to antgroup/ant-ray that referenced this pull request Dec 11, 2025
…same tensor list with nixl (ray-project#58309)

Cherry-picking ray-project#58263 for 2.51.1 release.

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants