[TENT] Fix NVLink IPC address for sub-allocated GPU tensors#1831
[TENT] Fix NVLink IPC address for sub-allocated GPU tensors#1831alogfans merged 2 commits intokvcache-ai:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a caching mechanism for CUDA memory registration in the NVLink transport to handle sub-allocations from caching allocators like PyTorch. It uses cuMemGetAddressRange to identify base segments and tracks them to avoid redundant IPC handle creation. However, the current implementation contains a critical bug where the IPC handle (shm_path) is not populated for cached segments, which will cause failures on the receiving side. Additionally, the reviewer suggests replacing the unordered_set with an unordered_map to store and reuse serialized IPC handles, which also addresses a potential race condition in the registration logic.
| { | ||
| std::lock_guard<std::mutex> lock(register_mutex_); | ||
| if (registered_base_addrs_.count((uint64_t)base_ptr)) { | ||
| // Already registered this cudaMalloc block, just tag transport | ||
| desc.addr = (uint64_t)base_ptr; | ||
| desc.length = alloc_size; | ||
| desc.transports.push_back(TransportType::NVLINK); | ||
| return Status::OK(); | ||
| } | ||
| } | ||
|
|
||
| cudaIpcMemHandle_t handle; | ||
| CHECK_CUDA(cudaIpcGetMemHandle(&handle, (void*)desc.addr)); | ||
| CHECK_CUDA(cudaIpcGetMemHandle(&handle, (void*)base_ptr)); | ||
| desc.addr = (uint64_t)base_ptr; | ||
| desc.length = alloc_size; | ||
| desc.shm_path = | ||
| serializeBinaryData(&handle, sizeof(cudaIpcMemHandle_t)); | ||
|
|
||
| { | ||
| std::lock_guard<std::mutex> lock(register_mutex_); | ||
| registered_base_addrs_.insert((uint64_t)base_ptr); | ||
| } |
There was a problem hiding this comment.
The current implementation has a critical bug: when a base address is already registered (cache hit), it fails to populate desc.shm_path. This will cause the receiving side to fail during address relocation because it expects a valid IPC handle. Additionally, the check-then-act pattern outside the lock creates a race condition where multiple threads might redundantly call cudaIpcGetMemHandle for the same segment. This refactoring fixes both issues by using a map to store and retrieve the serialized handle while holding the lock.
{
std::lock_guard<std::mutex> lock(register_mutex_);
auto it = registered_base_addrs_.find((uint64_t)base_ptr);
if (it != registered_base_addrs_.end()) {
desc.shm_path = it->second;
} else {
cudaIpcMemHandle_t handle;
CHECK_CUDA(cudaIpcGetMemHandle(&handle, (void*)base_ptr));
desc.shm_path = serializeBinaryData(&handle, sizeof(cudaIpcMemHandle_t));
registered_base_addrs_[(uint64_t)base_ptr] = desc.shm_path;
}
}
desc.addr = (uint64_t)base_ptr;
desc.length = alloc_size;| #include <mutex> | ||
| #include <queue> | ||
| #include <string> | ||
| #include <unordered_set> |
| bool host_register_; | ||
|
|
||
| std::mutex register_mutex_; | ||
| std::unordered_set<uint64_t> registered_base_addrs_; |
There was a problem hiding this comment.
Change the tracking set to a map to store the serialized IPC handle associated with each base address. This is necessary to correctly populate the shm_path for subsequent sub-allocations in the same segment.
| std::unordered_set<uint64_t> registered_base_addrs_; | |
| std::unordered_map<uint64_t, std::string> registered_base_addrs_; |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
Fixes #1829
Port of the IPC base-address fix from #1622 to the TENT NVLink transport code path.
Root cause: When
desc.addris a sub-allocation within a largercudaMallocsegment (typical with PyTorch's caching allocator),cudaIpcGetMemHandle()returns a handle for the entire segment. But the metadata records the sub-allocation address, so the receiving side computes an incorrect offset duringrelocateSharedMemoryAddress(), leading to silent IPC failures in intra-node P/D disaggregation.Fix: Call
cuMemGetAddressRange()beforecudaIpcGetMemHandle()to resolve the true cudaMalloc base address and allocation size. Register the IPC handle at segment granularity, and track already-registered bases viaregistered_base_addrs_to avoid duplicate registrations. Apply the same base-address resolution inremoveMemoryBuffer().This mirrors the approach already merged in the non-TENT path (
IntraNodeNvlinkTransport::registerLocalMemory()), adapted to TENT'sBufferDesc-based API.Changes
nvlink_transport.cpp:addMemoryBuffer()now resolves base address viacuMemGetAddressRange, uses base for IPC handle, deduplicates via tracking setnvlink_transport.cpp:removeMemoryBuffer()resolves base address before clearing tracking statenvlink_transport.h: Addedstd::mutex+std::unordered_set<uint64_t>forregistered_base_addrs_tracking