Fix MNNVL warmup hang: skip warmup when fabric mem is available#1644
Fix MNNVL warmup hang: skip warmup when fabric mem is available#1644UNIDY2002 merged 3 commits intokvcache-ai:mainfrom
Conversation
On GB200 MNNVL clusters, the warmup handshake in ConnectionContext allocates send/recv buffers from the CPU heap. The NVLink transport can only access cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) memory cross-node, so remote writes to these heap buffers silently fail and the state machine retries forever. Since MNNVL fabric guarantees connectivity between all peers in a ComputeDomain, we can safely skip the warmup write entirely when supportFabricMem() is true. The store key exchange alone is sufficient proof of peer reachability. Changes: - Add supportFabricMem() to connection_poller.cpp (same check as nvlink_transport.cpp and mooncake_ep_buffer.cpp) - Skip warmup buffer allocation and warmup write when on MNNVL, transition directly to CONNECTED after opening the segment - Guard destructor against null warmup buffers - Link CUDA driver library in setup.py for cuDeviceGetAttribute Fixes kvcache-ai#1639
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a deadlock issue encountered on MNNVL clusters during the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
ActivityUsing Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
The CI environment seems to have issues with CUDA linking. You might want to take a look at how it's done in #1629's Sent from UNIDY2002's OpenClaw |
There was a problem hiding this comment.
Code Review
This pull request aims to resolve a hang issue on MNNVL clusters by skipping the warmup handshake when fabric memory is available. However, a critical vulnerability has been identified: a null pointer dereference in the CONNECTED state handling within pollPeer. This occurs because the logic fails to account for null pointers in the connection failure recovery path, leading to an application crash if a peer disconnects when warmup is skipped. Additionally, the implementation could be improved by addressing code duplication in supportFabricMem and pollPeer, and by adding error checking for a CUDA driver call.
| // remote NVLink writes to them will fail. The fabric topology already | ||
| // guarantees connectivity, so we skip the warmup handshake entirely. | ||
| warmup_send_region_ = nullptr; | ||
| warmup_recv_region_ = nullptr; |
There was a problem hiding this comment.
A null pointer dereference occurs here when skip_warmup_ is true. In the constructor, warmup_recv_region_ is set to nullptr if skip_warmup_ is enabled. However, the CONNECTED state logic in pollPeer (lines 304-306) attempts to reset this region by dereferencing the pointer without a null check. If a connection failure is detected (e.g., a peer disconnects), the poller thread will crash with a segmentation fault, leading to a Denial of Service.
To fix this, ensure that the warmup region is only reset if it was actually allocated.
| static bool supportFabricMem() { | ||
| if (getenv("MC_USE_NVLINK_IPC")) return false; | ||
|
|
||
| int num_devices = 0; | ||
| cudaError_t err = cudaGetDeviceCount(&num_devices); | ||
| if (err != cudaSuccess || num_devices == 0) return false; | ||
|
|
||
| for (int dev = 0; dev < num_devices; ++dev) { | ||
| int supported = 0; | ||
| cuDeviceGetAttribute( | ||
| &supported, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, dev); | ||
| if (!supported) return false; | ||
| } | ||
| return true; | ||
| } |
There was a problem hiding this comment.
This function is a good addition. I have two suggestions for improvement:
-
Code Duplication: As noted in the PR description, this function is duplicated across a few files. To improve long-term maintainability, it would be ideal to move this to a shared utility file (e.g.,
cuda_utils.h/.cpp) to avoid having to update it in multiple places in the future. -
Error Handling: The return value from
cuDeviceGetAttributeis not checked. It's good practice to check for errors from CUDA driver API calls to ensure robustness.
Here's a suggestion that adds the error check:
static bool supportFabricMem() {
if (getenv("MC_USE_NVLINK_IPC")) return false;
int num_devices = 0;
cudaError_t err = cudaGetDeviceCount(&num_devices);
if (err != cudaSuccess || num_devices == 0) return false;
for (int dev = 0; dev < num_devices; ++dev) {
int supported = 0;
CUresult cu_err = cuDeviceGetAttribute(
&supported, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, dev);
if (cu_err != CUDA_SUCCESS) {
// Consider logging a warning here for easier debugging.
return false;
}
if (!supported) return false;
}
return true;
}| if (skip_warmup_) { | ||
| // MNNVL: fabric guarantees connectivity, skip warmup write | ||
| // since CPU heap buffers aren't fabric-accessible anyway. | ||
| meta_->peerConnected[pollingRank] = true; | ||
| global_peerConnected_[globalPollingRank] = true; | ||
| peerState.state = PeerConnectionState::CONNECTED; | ||
| { | ||
| std::lock_guard<std::mutex> lock(backend_wakeup_mutex_); | ||
| totalConnectedPeers_.fetch_add(1, | ||
| std::memory_order_release); | ||
| if (isAllPeerConnected()) backend_wakeup_cv_.notify_all(); | ||
| } | ||
| } else if (pollingRank <= rank_) { |
There was a problem hiding this comment.
This block of code to mark a peer as connected is also present in the state handlers for WAITING_WARMUP_TRANSFER (lines 218-231) and WAITING_PEER_WARMUP (lines 249-258). To reduce code duplication and improve readability, consider extracting this logic into a private helper method within the ConnectionContext class, for example void markPeerAsConnected(int pollingRank, uint64_t globalPollingRank).
Use CUDA_HOME/lib64/stubs/libcuda.so stub path, same approach as mooncake-ep/setup.py. The CI environment doesn't have libcuda.so in the standard library search path.
|
Good catch, thanks! Fixed in 2761535 — now using the same |
UNIDY2002
left a comment
There was a problem hiding this comment.
Thank you very much for your work!
…che-ai#1644) * Fix MNNVL warmup hang: skip warmup when fabric mem is available On GB200 MNNVL clusters, the warmup handshake in ConnectionContext allocates send/recv buffers from the CPU heap. The NVLink transport can only access cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) memory cross-node, so remote writes to these heap buffers silently fail and the state machine retries forever. Since MNNVL fabric guarantees connectivity between all peers in a ComputeDomain, we can safely skip the warmup write entirely when supportFabricMem() is true. The store key exchange alone is sufficient proof of peer reachability. Changes: - Add supportFabricMem() to connection_poller.cpp (same check as nvlink_transport.cpp and mooncake_ep_buffer.cpp) - Skip warmup buffer allocation and warmup write when on MNNVL, transition directly to CONNECTED after opening the segment - Guard destructor against null warmup buffers - Link CUDA driver library in setup.py for cuDeviceGetAttribute Fixes kvcache-ai#1639
Summary
Fixes #1639. Follow-up to #1629 (EP buffer fix).
On GB200 MNNVL clusters,
ConnectionContextwarmup buffers are allocated from CPU heap (new int32_t[]). The NVLink transport can only accesscuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC)memory cross-node, so remote writes to these heap buffers silently fail andwaitUntilAllConnected()blocks forever.Fix
When
supportFabricMem()is true (MNNVL cluster), skip the warmup handshake entirely and transition directly toCONNECTEDafter the store key exchange +openSegment(). The fabric topology already guarantees connectivity between all peers in a ComputeDomain, so the warmup write is redundant.This is Option A from the issue — simpler than allocating warmup buffers with
cuMemCreate(FABRIC)(Option B), and avoids the complexity of fabric handle exchange for what is essentially a connectivity check that the fabric infrastructure already guarantees.Changes
connection_poller.cpp: AddsupportFabricMem()(same check asnvlink_transport.cppandmooncake_ep_buffer.cpp)connection_poller.cpp: Skip buffer allocation and warmup write whenskip_warmup_is true, go straight toCONNECTEDconnection_poller.h: Addskip_warmup_membersetup.py: Linkcudadriver library forcuDeviceGetAttributeTesting
Needs verification on an MNNVL cluster (GB200 NVSwitch domain). IB clusters are unaffected —
supportFabricMem()returns false, so the warmup path is unchanged.cc @tzulingk @UNIDY2002