Fix EP buffer allocation for MNNVL clusters by he-yufeng · Pull Request #1629 · kvcache-ai/Mooncake

he-yufeng · 2026-03-08T09:18:10Z

Summary

On GB200 MNNVL clusters where NVLink fabric is the only transport (0 HCAs), EP hangs permanently during P2P handshake because gdr_buffer is allocated with cudaMalloc, which lacks fabric handles for cross-node NVLink access.

This PR adds conditional fabric-aware allocation:

When supportFabricMem() is true (MNNVL), allocate gdr_buffer via cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) + cuMemMap + cuMemSetAccess, mirroring what allocatePinnedLocalMemory() already does in nvlink_transport.cpp
Skip IPC handle exchange under fabric mode — cuMemSetAccess(allDevices) makes the buffer globally visible across the clique
On IB clusters or single-node setups, the original cudaMalloc + IPC path is preserved unchanged

The two paths are mutually exclusive: MNNVL has 0 HCAs so init_ibgda() never runs; IB clusters have supportFabricMem() = false so cudaMalloc is unchanged. ibv_reg_mr is never called on a cuMemCreate buffer.

Fixes #1627

Changes

mooncake_ep_buffer.h: add cuda.h include, use_fabric_mem_ / fabric_mem_handle_ / fabric_alloc_size_ members
mooncake_ep_buffer.cpp:
- Add supportFabricMem() (same check as nvlink_transport.cpp)
- Constructor: conditional cuMemCreate(FABRIC) vs cudaMalloc
- Destructor: conditional cuMemUnmap + cuMemRelease vs cudaFree
- get_ipc_handle(): return empty vector under fabric mode
- sync_nvlink_ipc_handles(): skip IPC exchange, mark all ranks NVLink-accessible

Test plan

Verify on GB200 MNNVL cluster (no HCAs) — EP should no longer hang at startup
Verify on IB cluster (H100/A100) — behavior unchanged, IBGDA path unaffected
Verify single-node multi-GPU — IPC path preserved

I don't have MNNVL hardware; @tzulingk offered to help validate on GB200.

On GB200 MNNVL clusters where NVLink fabric is the only transport, gdr_buffer was allocated with cudaMalloc which lacks fabric handles. Remote nodes could not resolve the address, causing EP to hang. When supportFabricMem() is true (MNNVL), allocate gdr_buffer via cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) + cuMemMap + cuMemSetAccess, mirroring what allocatePinnedLocalMemory() does in nvlink_transport.cpp. The IPC handle exchange path is skipped since fabric addresses are globally visible across the clique. On IB clusters or single-node setups, the original cudaMalloc path is preserved unchanged. Fixes kvcache-ai#1627

gemini-code-assist · 2026-03-08T09:18:26Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue causing Endpoint (EP) hangs on GB200 MNNVL clusters by enhancing the gdr_buffer allocation strategy. It introduces a fabric-aware memory allocation path that utilizes CUDA's cuMemCreate with fabric handles, enabling seamless cross-node NVLink access. This new path bypasses the traditional IPC handle exchange, which is not suitable for MNNVL environments, while ensuring that existing cudaMalloc and IPC-based mechanisms remain functional for other cluster configurations.

Highlights

Conditional Memory Allocation: Implemented conditional memory allocation for the gdr_buffer. On MNNVL clusters, cuMemCreate with fabric handles is used for cross-node NVLink access, while cudaMalloc is retained for IB clusters and single-node setups.
Fabric Memory Management: Introduced logic to detect fabric memory support and manage CUmemGenericAllocationHandle for fabric-aware allocations, including cuMemMap, cuMemSetAccess, and proper deallocation.
IPC Handle Optimization: Modified the IPC handle exchange mechanism to skip it when fabric memory is in use, as cuMemSetAccess makes the buffer globally visible across the fabric clique, eliminating the need for explicit IPC handles.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

mooncake-ep/include/mooncake_ep_buffer.h
- Added cuda.h include for CUDA driver API types.
- Introduced use_fabric_mem_, fabric_mem_handle_, and fabric_alloc_size_ members to the MooncakeEpBuffer struct to manage fabric memory.
mooncake-ep/src/mooncake_ep_buffer.cpp
- Added a static helper function supportFabricMem() to determine if all GPUs support fabric memory handles (MNNVL).
- Modified the MooncakeEpBuffer constructor to conditionally allocate gdr_buffer using cuMemCreate (for fabric memory) or cudaMalloc (for non-fabric memory), including error handling and memory mapping.
- Updated the MooncakeEpBuffer destructor to conditionally deallocate gdr_buffer using cuMemUnmap and cuMemRelease for fabric memory, or cudaFree for standard CUDA memory.
- Modified get_ipc_handle() to return an empty vector when fabric memory is active, signaling that IPC handle exchange is not required.
- Updated sync_nvlink_ipc_handles() to skip IPC handle processing and mark all ranks as NVLink-accessible when fabric memory is enabled, relying on cuMemSetAccess for global visibility.

Activity

The author, he-yufeng, has outlined a comprehensive test plan to verify the fix on GB200 MNNVL clusters, IB clusters (H100/A100), and single-node multi-GPU setups.
The author noted that @tzulingk offered to assist with validation on GB200 hardware, as the author does not have access to MNNVL hardware.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to add support for MNNVL clusters by using fabric-aware memory allocation for the EP buffer. However, the implementation introduces critical denial of service (DoS) vulnerabilities due to incorrect assumptions about Hopper-based systems and skipped IPC handle exchanges, leading to null pointer dereferences and crashes on both MNNVL and standard Hopper clusters. Additionally, there's a critical issue in how remote peer memory is handled, as the current implementation incorrectly assumes automatic fabric memory accessibility across nodes without handle exchange. A minor maintainability concern regarding repetitive resource cleanup code was also noted. Addressing these critical issues is essential for the feature to function correctly and securely.

gemini-code-assist · 2026-03-08T09:20:29Z

+        for (int i = 0; i < num_ranks; ++i) {
+            nvlink_array[i] = 1;
+            // Each rank's gdr_buffer is directly accessible; the remote
+            // addresses will be exchanged via the RDMA address sync path
+            // (sync_ib / sync_roce) or via a separate fabric address exchange.
+            // For local rank, point to our own buffer.
+            ipc_peer_ptrs_host[i] = (i == rank) ? gdr_buffer : nullptr;
        }


This code introduces a denial of service (DoS) vulnerability. In MooncakeEpBuffer::sync_nvlink_ipc_handles, when use_fabric_mem_ is true, ipc_peer_ptrs_host[i] is incorrectly set to nullptr for remote ranks. Despite this, nvlink_array[i] is set to 1, indicating NVLink availability. GPU kernels (dispatch and combine in mooncake-ep/src/mooncake_ep_kernel.cu) will then attempt to access remote memory via these null pointers, causing a page fault and CUDA context crash, leading to DoS. For cross-node access with fabric memory, explicit handle exchange and mapping are required, as cuMemSetAccess is insufficient. The correct procedure involves exporting a CUmemFabricHandle from the CUmemGenericAllocationHandle using cuMemExportToShareableHandle, exchanging these fabric handles between ranks, importing them using cuMemImportFromShareableHandle, mapping them into the VA space using cuMemMap, and storing the resulting pointers in ipc_peer_ptrs_host. Refer to the implementation in mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp's relocateSharedMemoryAddress function for a good example of this process.

gemini-code-assist · 2026-03-08T09:20:29Z

+static bool supportFabricMem() {
+    if (getenv("MC_USE_NVLINK_IPC")) return false;
+
+    int num_devices = 0;
+    cudaError_t err = cudaGetDeviceCount(&num_devices);
+    if (err != cudaSuccess || num_devices == 0) return false;
+
+    for (int dev = 0; dev < num_devices; ++dev) {
+        int supported = 0;
+        cuDeviceGetAttribute(&supported,
+                             CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED,
+                             dev);
+        if (!supported) return false;
+    }
+    return true;
+}


The supportFabricMem() function checks for the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute on all GPUs. This attribute is true for all NVIDIA Hopper (e.g., H100) and later GPUs. However, the PR assumes that if this attribute is true, the system is a GB200 MNNVL cluster where all nodes are in a single NVLink fabric clique. On standard H100 clusters that use InfiniBand for cross-node communication, supportFabricMem() will return true, but these clusters are NOT a single NVLink domain across nodes. The code in sync_nvlink_ipc_handles will then incorrectly set nvlink_available[i] = 1 for all ranks, including those on different nodes. This will cause the kernels to attempt NVLink access for cross-node communication, which will fail and crash the application. This is a significant regression that breaks Mooncake on any multi-node Hopper cluster that is not MNNVL.

gemini-code-assist · 2026-03-08T09:20:29Z

+        CUdeviceptr dptr = 0;
+        res = cuMemAddressReserve(&dptr, fabric_alloc_size_, granularity, 0, 0);
+        if (res != CUDA_SUCCESS) {
+            cuMemRelease(fabric_mem_handle_);
+            LOG(ERROR) << "[EP] cuMemAddressReserve failed: " << res;
+            throw std::runtime_error("cuMemAddressReserve failed");
+        }
+
+        res = cuMemMap(dptr, fabric_alloc_size_, 0, fabric_mem_handle_, 0);
+        if (res != CUDA_SUCCESS) {
+            cuMemAddressFree(dptr, fabric_alloc_size_);
+            cuMemRelease(fabric_mem_handle_);
+            LOG(ERROR) << "[EP] cuMemMap failed: " << res;
+            throw std::runtime_error("cuMemMap failed");
+        }
+
+        // Grant read/write access to all devices in the fabric clique
+        int device_count = 0;
+        cudaGetDeviceCount(&device_count);
+        std::vector<CUmemAccessDesc> access(device_count);
+        for (int i = 0; i < device_count; ++i) {
+            access[i].location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+            access[i].location.id = i;
+            access[i].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
+        }
+        res = cuMemSetAccess(dptr, fabric_alloc_size_, access.data(),
+                             device_count);
+        if (res != CUDA_SUCCESS) {
+            cuMemUnmap(dptr, fabric_alloc_size_);
+            cuMemAddressFree(dptr, fabric_alloc_size_);
+            cuMemRelease(fabric_mem_handle_);
+            LOG(ERROR) << "[EP] cuMemSetAccess failed: " << res;
+            throw std::runtime_error("cuMemSetAccess failed");
+        }


The error handling logic in this block for fabric memory allocation has a lot of duplicated resource cleanup code. For example, cuMemRelease(fabric_mem_handle_) is called in multiple failure paths. This makes the code harder to read and maintain.

Consider refactoring this to reduce repetition. Common patterns for this in C++ are:

Using RAII wrapper classes for the CUDA driver API resources (CUmemGenericAllocationHandle, CUdeviceptr for reserved VA). The destructor of the wrapper would handle cleanup.

Using a goto-based cleanup block at the end of the function, which is a common C-style pattern for resource management.

This would centralize the cleanup logic and make the allocation flow easier to follow.

codecov-commenter · 2026-03-08T09:50:57Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

UNIDY2002 · 2026-03-08T15:04:02Z

Thank you for your contribution! Please fix the CI lint error.

UNIDY2002 · 2026-03-09T02:50:16Z

When testing the wheel artifact in Hopper environment, I encountered "undefined symbol: cuMemAddressFree". Is there anything I should adjust in my settings?

EP now calls cuMemCreate, cuMemMap, cuMemSetAccess, cuMemAddressFree etc. from the CUDA driver API (libcuda.so). Previously only the runtime API was linked.

he-yufeng · 2026-03-09T03:12:05Z

@UNIDY2002 Thanks for testing! The undefined symbol: cuMemAddressFree is a linker issue — the EP target was only linking the CUDA runtime API (libcudart), but the new allocation path calls driver API functions (cuMemCreate, cuMemMap, cuMemAddressFree, etc.) from libcuda.so.

Just pushed a fix: added CUDA::cuda_driver to the EP link dependencies in mooncake-ep/src/CMakeLists.txt. This is the same pattern used by the transport layer (e.g. tent_xport_nvlink, tent_xport_mnnvl).

Removed 'cuda' from the list of libraries in setup.py.

Added support for linking against the CUDA driver stub library if it exists.

Removed CUDA driver from linked libraries for mooncake_ep.

UNIDY2002

Thank you very much for your valuable contribution!

he-yufeng requested review from UNIDY2002 and ympcMark as code owners March 8, 2026 09:18

github-actions Bot added run-ci Mooncake EP labels Mar 8, 2026

he-yufeng mentioned this pull request Mar 8, 2026

[Bug]: Mooncake EP hangs on GB200 MNNVL cluster — gdr_buffer allocated with cudaMalloc instead of cuMemCreate(FABRIC), causing 'Requested address not found' in nvlink_transport #1627

Closed

gemini-code-assist Bot reviewed Mar 8, 2026

View reviewed changes

Fix code format (clang-format)

caf7902

Link CUDA driver API for cuMemCreate/cuMemAddressFree

9abf963

EP now calls cuMemCreate, cuMemMap, cuMemSetAccess, cuMemAddressFree etc. from the CUDA driver API (libcuda.so). Previously only the runtime API was linked.

UNIDY2002 added 8 commits March 9, 2026 12:52

Merge branch 'main' into fix/ep-fabric-alloc-mnnvl

28cb287

Add CUDA driver library to EP's setup.py

abf67d0

Change CUDA library reference in setup.py

c1c4f60

Remove 'cuda' library from setup.py [skip ci]

0c365fe

Removed 'cuda' from the list of libraries in setup.py.

Add 'cuda' library to setup.py

f67db33

Enhance setup.py for CUDA library linking

83f9f1b

Added support for linking against the CUDA driver stub library if it exists.

Update linked libraries for mooncake_ep [skip ci]

bb4b07b

Removed CUDA driver from linked libraries for mooncake_ep.

Remove undefined shared library flag from setup.py

ca53647

UNIDY2002 approved these changes Mar 9, 2026

View reviewed changes

UNIDY2002 merged commit 7a96121 into kvcache-ai:main Mar 9, 2026
16 checks passed

tzulingk mentioned this pull request Mar 10, 2026

[Bug] MNNVL hang after #1629: mooncake-pg warmup buffers (CPU heap) are not NVLink fabric-accessible #1639

Closed

This was referenced Mar 10, 2026

[EP] Enable Fabric Mem only if MC_USE_NVLINK_IPC is explicitly set to zero #1637

Merged

Fix MNNVL warmup hang: skip warmup when fabric mem is available #1644

Merged

whn09 pushed a commit to whn09/Mooncake that referenced this pull request Apr 4, 2026

[EP] Fix EP buffer allocation for MNNVL clusters (kvcache-ai#1629)

52f2c09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix EP buffer allocation for MNNVL clusters#1629

Fix EP buffer allocation for MNNVL clusters#1629
UNIDY2002 merged 11 commits intokvcache-ai:mainfrom
he-yufeng:fix/ep-fabric-alloc-mnnvl

he-yufeng commented Mar 8, 2026

Uh oh!

gemini-code-assist Bot commented Mar 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 8, 2026

Uh oh!

gemini-code-assist Bot Mar 8, 2026

Uh oh!

gemini-code-assist Bot Mar 8, 2026

Uh oh!

codecov-commenter commented Mar 8, 2026

Uh oh!

UNIDY2002 commented Mar 8, 2026

Uh oh!

UNIDY2002 commented Mar 9, 2026

Uh oh!

he-yufeng commented Mar 9, 2026

Uh oh!

UNIDY2002 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

he-yufeng commented Mar 8, 2026

Summary

Changes

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 8, 2026

Codecov Report

Uh oh!

UNIDY2002 commented Mar 8, 2026

Uh oh!

UNIDY2002 commented Mar 9, 2026

Uh oh!

he-yufeng commented Mar 9, 2026

Uh oh!

UNIDY2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants