Fix EP buffer allocation for MNNVL clusters#1629
Conversation
On GB200 MNNVL clusters where NVLink fabric is the only transport, gdr_buffer was allocated with cudaMalloc which lacks fabric handles. Remote nodes could not resolve the address, causing EP to hang. When supportFabricMem() is true (MNNVL), allocate gdr_buffer via cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) + cuMemMap + cuMemSetAccess, mirroring what allocatePinnedLocalMemory() does in nvlink_transport.cpp. The IPC handle exchange path is skipped since fabric addresses are globally visible across the clique. On IB clusters or single-node setups, the original cudaMalloc path is preserved unchanged. Fixes kvcache-ai#1627
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an issue causing Endpoint (EP) hangs on GB200 MNNVL clusters by enhancing the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to add support for MNNVL clusters by using fabric-aware memory allocation for the EP buffer. However, the implementation introduces critical denial of service (DoS) vulnerabilities due to incorrect assumptions about Hopper-based systems and skipped IPC handle exchanges, leading to null pointer dereferences and crashes on both MNNVL and standard Hopper clusters. Additionally, there's a critical issue in how remote peer memory is handled, as the current implementation incorrectly assumes automatic fabric memory accessibility across nodes without handle exchange. A minor maintainability concern regarding repetitive resource cleanup code was also noted. Addressing these critical issues is essential for the feature to function correctly and securely.
| for (int i = 0; i < num_ranks; ++i) { | ||
| nvlink_array[i] = 1; | ||
| // Each rank's gdr_buffer is directly accessible; the remote | ||
| // addresses will be exchanged via the RDMA address sync path | ||
| // (sync_ib / sync_roce) or via a separate fabric address exchange. | ||
| // For local rank, point to our own buffer. | ||
| ipc_peer_ptrs_host[i] = (i == rank) ? gdr_buffer : nullptr; | ||
| } |
There was a problem hiding this comment.
This code introduces a denial of service (DoS) vulnerability. In MooncakeEpBuffer::sync_nvlink_ipc_handles, when use_fabric_mem_ is true, ipc_peer_ptrs_host[i] is incorrectly set to nullptr for remote ranks. Despite this, nvlink_array[i] is set to 1, indicating NVLink availability. GPU kernels (dispatch and combine in mooncake-ep/src/mooncake_ep_kernel.cu) will then attempt to access remote memory via these null pointers, causing a page fault and CUDA context crash, leading to DoS. For cross-node access with fabric memory, explicit handle exchange and mapping are required, as cuMemSetAccess is insufficient. The correct procedure involves exporting a CUmemFabricHandle from the CUmemGenericAllocationHandle using cuMemExportToShareableHandle, exchanging these fabric handles between ranks, importing them using cuMemImportFromShareableHandle, mapping them into the VA space using cuMemMap, and storing the resulting pointers in ipc_peer_ptrs_host. Refer to the implementation in mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp's relocateSharedMemoryAddress function for a good example of this process.
| static bool supportFabricMem() { | ||
| if (getenv("MC_USE_NVLINK_IPC")) return false; | ||
|
|
||
| int num_devices = 0; | ||
| cudaError_t err = cudaGetDeviceCount(&num_devices); | ||
| if (err != cudaSuccess || num_devices == 0) return false; | ||
|
|
||
| for (int dev = 0; dev < num_devices; ++dev) { | ||
| int supported = 0; | ||
| cuDeviceGetAttribute(&supported, | ||
| CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, | ||
| dev); | ||
| if (!supported) return false; | ||
| } | ||
| return true; | ||
| } |
There was a problem hiding this comment.
The supportFabricMem() function checks for the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute on all GPUs. This attribute is true for all NVIDIA Hopper (e.g., H100) and later GPUs. However, the PR assumes that if this attribute is true, the system is a GB200 MNNVL cluster where all nodes are in a single NVLink fabric clique. On standard H100 clusters that use InfiniBand for cross-node communication, supportFabricMem() will return true, but these clusters are NOT a single NVLink domain across nodes. The code in sync_nvlink_ipc_handles will then incorrectly set nvlink_available[i] = 1 for all ranks, including those on different nodes. This will cause the kernels to attempt NVLink access for cross-node communication, which will fail and crash the application. This is a significant regression that breaks Mooncake on any multi-node Hopper cluster that is not MNNVL.
| CUdeviceptr dptr = 0; | ||
| res = cuMemAddressReserve(&dptr, fabric_alloc_size_, granularity, 0, 0); | ||
| if (res != CUDA_SUCCESS) { | ||
| cuMemRelease(fabric_mem_handle_); | ||
| LOG(ERROR) << "[EP] cuMemAddressReserve failed: " << res; | ||
| throw std::runtime_error("cuMemAddressReserve failed"); | ||
| } | ||
|
|
||
| res = cuMemMap(dptr, fabric_alloc_size_, 0, fabric_mem_handle_, 0); | ||
| if (res != CUDA_SUCCESS) { | ||
| cuMemAddressFree(dptr, fabric_alloc_size_); | ||
| cuMemRelease(fabric_mem_handle_); | ||
| LOG(ERROR) << "[EP] cuMemMap failed: " << res; | ||
| throw std::runtime_error("cuMemMap failed"); | ||
| } | ||
|
|
||
| // Grant read/write access to all devices in the fabric clique | ||
| int device_count = 0; | ||
| cudaGetDeviceCount(&device_count); | ||
| std::vector<CUmemAccessDesc> access(device_count); | ||
| for (int i = 0; i < device_count; ++i) { | ||
| access[i].location.type = CU_MEM_LOCATION_TYPE_DEVICE; | ||
| access[i].location.id = i; | ||
| access[i].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE; | ||
| } | ||
| res = cuMemSetAccess(dptr, fabric_alloc_size_, access.data(), | ||
| device_count); | ||
| if (res != CUDA_SUCCESS) { | ||
| cuMemUnmap(dptr, fabric_alloc_size_); | ||
| cuMemAddressFree(dptr, fabric_alloc_size_); | ||
| cuMemRelease(fabric_mem_handle_); | ||
| LOG(ERROR) << "[EP] cuMemSetAccess failed: " << res; | ||
| throw std::runtime_error("cuMemSetAccess failed"); | ||
| } |
There was a problem hiding this comment.
The error handling logic in this block for fabric memory allocation has a lot of duplicated resource cleanup code. For example, cuMemRelease(fabric_mem_handle_) is called in multiple failure paths. This makes the code harder to read and maintain.
Consider refactoring this to reduce repetition. Common patterns for this in C++ are:
- Using RAII wrapper classes for the CUDA driver API resources (
CUmemGenericAllocationHandle,CUdeviceptrfor reserved VA). The destructor of the wrapper would handle cleanup. - Using a
goto-based cleanup block at the end of the function, which is a common C-style pattern for resource management.
This would centralize the cleanup logic and make the allocation flow easier to follow.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Thank you for your contribution! Please fix the CI lint error. |
|
When testing the wheel artifact in Hopper environment, I encountered "undefined symbol: cuMemAddressFree". Is there anything I should adjust in my settings? |
EP now calls cuMemCreate, cuMemMap, cuMemSetAccess, cuMemAddressFree etc. from the CUDA driver API (libcuda.so). Previously only the runtime API was linked.
|
@UNIDY2002 Thanks for testing! The Just pushed a fix: added |
Removed 'cuda' from the list of libraries in setup.py.
Added support for linking against the CUDA driver stub library if it exists.
Removed CUDA driver from linked libraries for mooncake_ep.
UNIDY2002
left a comment
There was a problem hiding this comment.
Thank you very much for your valuable contribution!
Summary
On GB200 MNNVL clusters where NVLink fabric is the only transport (0 HCAs), EP hangs permanently during P2P handshake because
gdr_bufferis allocated withcudaMalloc, which lacks fabric handles for cross-node NVLink access.This PR adds conditional fabric-aware allocation:
supportFabricMem()is true (MNNVL), allocategdr_bufferviacuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC)+cuMemMap+cuMemSetAccess, mirroring whatallocatePinnedLocalMemory()already does innvlink_transport.cppcuMemSetAccess(allDevices)makes the buffer globally visible across the cliquecudaMalloc+ IPC path is preserved unchangedThe two paths are mutually exclusive: MNNVL has 0 HCAs so
init_ibgda()never runs; IB clusters havesupportFabricMem() = falsesocudaMallocis unchanged.ibv_reg_mris never called on acuMemCreatebuffer.Fixes #1627
Changes
mooncake_ep_buffer.h: addcuda.hinclude,use_fabric_mem_/fabric_mem_handle_/fabric_alloc_size_membersmooncake_ep_buffer.cpp:supportFabricMem()(same check asnvlink_transport.cpp)cuMemCreate(FABRIC)vscudaMalloccuMemUnmap+cuMemReleasevscudaFreeget_ipc_handle(): return empty vector under fabric modesync_nvlink_ipc_handles(): skip IPC exchange, mark all ranks NVLink-accessibleTest plan
I don't have MNNVL hardware; @tzulingk offered to help validate on GB200.