Skip to content

Fix EP buffer allocation for MNNVL clusters#1629

Merged
UNIDY2002 merged 11 commits intokvcache-ai:mainfrom
he-yufeng:fix/ep-fabric-alloc-mnnvl
Mar 9, 2026
Merged

Fix EP buffer allocation for MNNVL clusters#1629
UNIDY2002 merged 11 commits intokvcache-ai:mainfrom
he-yufeng:fix/ep-fabric-alloc-mnnvl

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

Summary

On GB200 MNNVL clusters where NVLink fabric is the only transport (0 HCAs), EP hangs permanently during P2P handshake because gdr_buffer is allocated with cudaMalloc, which lacks fabric handles for cross-node NVLink access.

This PR adds conditional fabric-aware allocation:

  • When supportFabricMem() is true (MNNVL), allocate gdr_buffer via cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) + cuMemMap + cuMemSetAccess, mirroring what allocatePinnedLocalMemory() already does in nvlink_transport.cpp
  • Skip IPC handle exchange under fabric mode — cuMemSetAccess(allDevices) makes the buffer globally visible across the clique
  • On IB clusters or single-node setups, the original cudaMalloc + IPC path is preserved unchanged

The two paths are mutually exclusive: MNNVL has 0 HCAs so init_ibgda() never runs; IB clusters have supportFabricMem() = false so cudaMalloc is unchanged. ibv_reg_mr is never called on a cuMemCreate buffer.

Fixes #1627

Changes

  • mooncake_ep_buffer.h: add cuda.h include, use_fabric_mem_ / fabric_mem_handle_ / fabric_alloc_size_ members
  • mooncake_ep_buffer.cpp:
    • Add supportFabricMem() (same check as nvlink_transport.cpp)
    • Constructor: conditional cuMemCreate(FABRIC) vs cudaMalloc
    • Destructor: conditional cuMemUnmap + cuMemRelease vs cudaFree
    • get_ipc_handle(): return empty vector under fabric mode
    • sync_nvlink_ipc_handles(): skip IPC exchange, mark all ranks NVLink-accessible

Test plan

  • Verify on GB200 MNNVL cluster (no HCAs) — EP should no longer hang at startup
  • Verify on IB cluster (H100/A100) — behavior unchanged, IBGDA path unaffected
  • Verify single-node multi-GPU — IPC path preserved

I don't have MNNVL hardware; @tzulingk offered to help validate on GB200.

On GB200 MNNVL clusters where NVLink fabric is the only transport,
gdr_buffer was allocated with cudaMalloc which lacks fabric handles.
Remote nodes could not resolve the address, causing EP to hang.

When supportFabricMem() is true (MNNVL), allocate gdr_buffer via
cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) + cuMemMap + cuMemSetAccess,
mirroring what allocatePinnedLocalMemory() does in nvlink_transport.cpp.
The IPC handle exchange path is skipped since fabric addresses are
globally visible across the clique.

On IB clusters or single-node setups, the original cudaMalloc path is
preserved unchanged.

Fixes kvcache-ai#1627
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue causing Endpoint (EP) hangs on GB200 MNNVL clusters by enhancing the gdr_buffer allocation strategy. It introduces a fabric-aware memory allocation path that utilizes CUDA's cuMemCreate with fabric handles, enabling seamless cross-node NVLink access. This new path bypasses the traditional IPC handle exchange, which is not suitable for MNNVL environments, while ensuring that existing cudaMalloc and IPC-based mechanisms remain functional for other cluster configurations.

Highlights

  • Conditional Memory Allocation: Implemented conditional memory allocation for the gdr_buffer. On MNNVL clusters, cuMemCreate with fabric handles is used for cross-node NVLink access, while cudaMalloc is retained for IB clusters and single-node setups.
  • Fabric Memory Management: Introduced logic to detect fabric memory support and manage CUmemGenericAllocationHandle for fabric-aware allocations, including cuMemMap, cuMemSetAccess, and proper deallocation.
  • IPC Handle Optimization: Modified the IPC handle exchange mechanism to skip it when fabric memory is in use, as cuMemSetAccess makes the buffer globally visible across the fabric clique, eliminating the need for explicit IPC handles.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • mooncake-ep/include/mooncake_ep_buffer.h
    • Added cuda.h include for CUDA driver API types.
    • Introduced use_fabric_mem_, fabric_mem_handle_, and fabric_alloc_size_ members to the MooncakeEpBuffer struct to manage fabric memory.
  • mooncake-ep/src/mooncake_ep_buffer.cpp
    • Added a static helper function supportFabricMem() to determine if all GPUs support fabric memory handles (MNNVL).
    • Modified the MooncakeEpBuffer constructor to conditionally allocate gdr_buffer using cuMemCreate (for fabric memory) or cudaMalloc (for non-fabric memory), including error handling and memory mapping.
    • Updated the MooncakeEpBuffer destructor to conditionally deallocate gdr_buffer using cuMemUnmap and cuMemRelease for fabric memory, or cudaFree for standard CUDA memory.
    • Modified get_ipc_handle() to return an empty vector when fabric memory is active, signaling that IPC handle exchange is not required.
    • Updated sync_nvlink_ipc_handles() to skip IPC handle processing and mark all ranks as NVLink-accessible when fabric memory is enabled, relying on cuMemSetAccess for global visibility.
Activity
  • The author, he-yufeng, has outlined a comprehensive test plan to verify the fix on GB200 MNNVL clusters, IB clusters (H100/A100), and single-node multi-GPU setups.
  • The author noted that @tzulingk offered to assist with validation on GB200 hardware, as the author does not have access to MNNVL hardware.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to add support for MNNVL clusters by using fabric-aware memory allocation for the EP buffer. However, the implementation introduces critical denial of service (DoS) vulnerabilities due to incorrect assumptions about Hopper-based systems and skipped IPC handle exchanges, leading to null pointer dereferences and crashes on both MNNVL and standard Hopper clusters. Additionally, there's a critical issue in how remote peer memory is handled, as the current implementation incorrectly assumes automatic fabric memory accessibility across nodes without handle exchange. A minor maintainability concern regarding repetitive resource cleanup code was also noted. Addressing these critical issues is essential for the feature to function correctly and securely.

Comment on lines +673 to 680
for (int i = 0; i < num_ranks; ++i) {
nvlink_array[i] = 1;
// Each rank's gdr_buffer is directly accessible; the remote
// addresses will be exchanged via the RDMA address sync path
// (sync_ib / sync_roce) or via a separate fabric address exchange.
// For local rank, point to our own buffer.
ipc_peer_ptrs_host[i] = (i == rank) ? gdr_buffer : nullptr;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

This code introduces a denial of service (DoS) vulnerability. In MooncakeEpBuffer::sync_nvlink_ipc_handles, when use_fabric_mem_ is true, ipc_peer_ptrs_host[i] is incorrectly set to nullptr for remote ranks. Despite this, nvlink_array[i] is set to 1, indicating NVLink availability. GPU kernels (dispatch and combine in mooncake-ep/src/mooncake_ep_kernel.cu) will then attempt to access remote memory via these null pointers, causing a page fault and CUDA context crash, leading to DoS. For cross-node access with fabric memory, explicit handle exchange and mapping are required, as cuMemSetAccess is insufficient. The correct procedure involves exporting a CUmemFabricHandle from the CUmemGenericAllocationHandle using cuMemExportToShareableHandle, exchanging these fabric handles between ranks, importing them using cuMemImportFromShareableHandle, mapping them into the VA space using cuMemMap, and storing the resulting pointers in ipc_peer_ptrs_host. Refer to the implementation in mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp's relocateSharedMemoryAddress function for a good example of this process.

Comment on lines +9 to +24
static bool supportFabricMem() {
if (getenv("MC_USE_NVLINK_IPC")) return false;

int num_devices = 0;
cudaError_t err = cudaGetDeviceCount(&num_devices);
if (err != cudaSuccess || num_devices == 0) return false;

for (int dev = 0; dev < num_devices; ++dev) {
int supported = 0;
cuDeviceGetAttribute(&supported,
CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED,
dev);
if (!supported) return false;
}
return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The supportFabricMem() function checks for the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute on all GPUs. This attribute is true for all NVIDIA Hopper (e.g., H100) and later GPUs. However, the PR assumes that if this attribute is true, the system is a GB200 MNNVL cluster where all nodes are in a single NVLink fabric clique. On standard H100 clusters that use InfiniBand for cross-node communication, supportFabricMem() will return true, but these clusters are NOT a single NVLink domain across nodes. The code in sync_nvlink_ipc_handles will then incorrectly set nvlink_available[i] = 1 for all ranks, including those on different nodes. This will cause the kernels to attempt NVLink access for cross-node communication, which will fail and crash the application. This is a significant regression that breaks Mooncake on any multi-node Hopper cluster that is not MNNVL.

Comment on lines +111 to +144
CUdeviceptr dptr = 0;
res = cuMemAddressReserve(&dptr, fabric_alloc_size_, granularity, 0, 0);
if (res != CUDA_SUCCESS) {
cuMemRelease(fabric_mem_handle_);
LOG(ERROR) << "[EP] cuMemAddressReserve failed: " << res;
throw std::runtime_error("cuMemAddressReserve failed");
}

res = cuMemMap(dptr, fabric_alloc_size_, 0, fabric_mem_handle_, 0);
if (res != CUDA_SUCCESS) {
cuMemAddressFree(dptr, fabric_alloc_size_);
cuMemRelease(fabric_mem_handle_);
LOG(ERROR) << "[EP] cuMemMap failed: " << res;
throw std::runtime_error("cuMemMap failed");
}

// Grant read/write access to all devices in the fabric clique
int device_count = 0;
cudaGetDeviceCount(&device_count);
std::vector<CUmemAccessDesc> access(device_count);
for (int i = 0; i < device_count; ++i) {
access[i].location.type = CU_MEM_LOCATION_TYPE_DEVICE;
access[i].location.id = i;
access[i].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
}
res = cuMemSetAccess(dptr, fabric_alloc_size_, access.data(),
device_count);
if (res != CUDA_SUCCESS) {
cuMemUnmap(dptr, fabric_alloc_size_);
cuMemAddressFree(dptr, fabric_alloc_size_);
cuMemRelease(fabric_mem_handle_);
LOG(ERROR) << "[EP] cuMemSetAccess failed: " << res;
throw std::runtime_error("cuMemSetAccess failed");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error handling logic in this block for fabric memory allocation has a lot of duplicated resource cleanup code. For example, cuMemRelease(fabric_mem_handle_) is called in multiple failure paths. This makes the code harder to read and maintain.

Consider refactoring this to reduce repetition. Common patterns for this in C++ are:

  • Using RAII wrapper classes for the CUDA driver API resources (CUmemGenericAllocationHandle, CUdeviceptr for reserved VA). The destructor of the wrapper would handle cleanup.
  • Using a goto-based cleanup block at the end of the function, which is a common C-style pattern for resource management.

This would centralize the cleanup logic and make the allocation flow easier to follow.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@UNIDY2002
Copy link
Copy Markdown
Collaborator

Thank you for your contribution! Please fix the CI lint error.

@UNIDY2002
Copy link
Copy Markdown
Collaborator

When testing the wheel artifact in Hopper environment, I encountered "undefined symbol: cuMemAddressFree". Is there anything I should adjust in my settings?

EP now calls cuMemCreate, cuMemMap, cuMemSetAccess, cuMemAddressFree
etc. from the CUDA driver API (libcuda.so). Previously only the runtime
API was linked.
@he-yufeng
Copy link
Copy Markdown
Contributor Author

@UNIDY2002 Thanks for testing! The undefined symbol: cuMemAddressFree is a linker issue — the EP target was only linking the CUDA runtime API (libcudart), but the new allocation path calls driver API functions (cuMemCreate, cuMemMap, cuMemAddressFree, etc.) from libcuda.so.

Just pushed a fix: added CUDA::cuda_driver to the EP link dependencies in mooncake-ep/src/CMakeLists.txt. This is the same pattern used by the transport layer (e.g. tent_xport_nvlink, tent_xport_mnnvl).

Copy link
Copy Markdown
Collaborator

@UNIDY2002 UNIDY2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your valuable contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants