Summary
After applying #1629 (which fixed gdr_buffer allocation in mooncake-ep), we hit a new hang on AWS GB200 MNNVL clusters. The mooncake-pg connection poller warmup handshake uses CPU heap buffers as NVLink write targets, which the MNNVL transport cannot access cross-node.
Related: #1627 (original gdr_buffer bug), #1629 (partial fix).
cc @he-yufeng
Environment
- GPU: NVIDIA GB200 (4 GPUs/node × 4 nodes = 16 GPUs total)
- Cluster: Multi-node NVLink fabric (MNNVL) via Kubernetes ComputeDomain
CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED (attr=128) = 1 on all GPUs ✓ — ComputeDomain is correctly set up
- No InfiniBand/EFA in pods → transfer engine auto-selects NVLink:
Using cross-node NVLink transport (MC_FORCE_MNNVL or no HCA detected)
- sglang:
--moe-a2a-backend mooncake --elastic-ep-backend mooncake --deepep-mode low_latency
Symptoms
Workers spam indefinitely (connection_poller.cpp:123):
Rank 4 got invalid buffer data from 12.
Rank 5 got invalid buffer data from 13.
...
Leader logs (nvlink_transport.cpp:497):
Memory region 0x15ba1e50 is not allocated by cuMemCreate, but it can be used as local buffer
Requested address 0x15ba2184 to 0x15ba2188 not found!
Requested address 0x1eb91128 to 0x1eb9112c not found!
waitUntilAllConnected() blocks forever → sglang never finishes initializing.
Root Cause
In mooncake-pg/src/connection_poller.cpp, ConnectionContext allocates warmup buffers from CPU heap:
// connection_poller.cpp — ConnectionContext constructor
warmup_send_region_ = new int32_t[kMaxNumRanks]; // CPU heap!
warmup_recv_region_ = new int32_t[kMaxNumRanks]{}; // CPU heap!
engine_->registerLocalMemory(warmup_send_region_, kMaxNumRanks * sizeof(int32_t), kWildcardLocation);
engine_->registerLocalMemory(warmup_recv_region_, kMaxNumRanks * sizeof(int32_t), kWildcardLocation);
Their addresses are published in SegmentInfo via mooncake_backend.cpp:
rank_info.warmup_buffer[0] = (uint64_t)connection_ctx_->warmup_send_region();
rank_info.warmup_buffer[1] = (uint64_t)connection_ctx_->warmup_recv_region();
During the warmup handshake (pollPeer, WAITING_WARMUP_TRANSFER state), the lower-ranked peer submits a NVLink WRITE to the remote rank's warmup_recv_region_:
engine_->submitTransfer(batchID, {TransferRequest{
.opcode = TransferRequest::WRITE,
.source = warmup_send_region_,
.target_id = meta_->segmentIDs[pollingRank],
.target_offset = meta_->segmentInfos[pollingRank].warmup_buffer[1] + rank_ * sizeof(int32_t),
.length = sizeof(int32_t),
}});
The MNNVL NVLink transport can only access cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) GPU memory cross-node. CPU heap memory is registered as local-only by the transport, so any cross-node NVLink write to it fails with Requested address not found!.
The state machine then resets to WAITING_STORE and retries forever.
Proposed Fix
Option A — Skip cross-node warmup for MNNVL (simpler):
On a ComputeDomain cluster, NVLink connectivity between all ranks is guaranteed by the fabric infrastructure. The store key exchange alone is sufficient proof of peer reachability. The warmup write can be skipped.
// In pollPeer, case WAITING_STORE, after successfully reading SegmentInfo:
auto segment_id = engine_->openSegment(peerServerName);
meta_->segmentIDs[pollingRank] = segment_id;
peerState.segmentId = segment_id;
memcpy(&meta_->segmentInfos[pollingRank], buffer_data.data(), sizeof(SegmentInfo));
if (supportFabricMem()) {
// MNNVL: fabric connectivity guaranteed by ComputeDomain.
// Skip warmup write — warmup_recv_region_ is CPU heap, not fabric-accessible.
meta_->peerConnected[pollingRank] = true;
global_peerConnected_[globalPollingRank] = true;
peerState.state = PeerConnectionState::CONNECTED;
totalConnectedPeers_.fetch_add(1, std::memory_order_release);
if (isAllPeerConnected()) backend_wakeup_cv_.notify_all();
} else if (pollingRank <= rank_) {
// Original IB/RoCE path: send warmup write
...
}
Option B — Allocate warmup buffers as FABRIC memory (more correct):
Allocate warmup_recv_region_ with cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC) + cuMemMap, export the fabric handle, share it alongside SegmentInfo in the store, and import/map it on the remote side before the warmup write. This is more involved but actually validates end-to-end NVLink connectivity.
We're happy to test a follow-up patch if helpful!
Summary
After applying #1629 (which fixed
gdr_bufferallocation inmooncake-ep), we hit a new hang on AWS GB200 MNNVL clusters. Themooncake-pgconnection poller warmup handshake uses CPU heap buffers as NVLink write targets, which the MNNVL transport cannot access cross-node.Related: #1627 (original
gdr_bufferbug), #1629 (partial fix).cc @he-yufeng
Environment
CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED (attr=128)= 1 on all GPUs ✓ — ComputeDomain is correctly set upUsing cross-node NVLink transport (MC_FORCE_MNNVL or no HCA detected)--moe-a2a-backend mooncake --elastic-ep-backend mooncake --deepep-mode low_latencySymptoms
Workers spam indefinitely (
connection_poller.cpp:123):Leader logs (
nvlink_transport.cpp:497):waitUntilAllConnected()blocks forever → sglang never finishes initializing.Root Cause
In
mooncake-pg/src/connection_poller.cpp,ConnectionContextallocates warmup buffers from CPU heap:Their addresses are published in
SegmentInfoviamooncake_backend.cpp:During the warmup handshake (
pollPeer,WAITING_WARMUP_TRANSFERstate), the lower-ranked peer submits a NVLink WRITE to the remote rank'swarmup_recv_region_:The MNNVL NVLink transport can only access
cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC)GPU memory cross-node. CPU heap memory is registered as local-only by the transport, so any cross-node NVLink write to it fails withRequested address not found!.The state machine then resets to
WAITING_STOREand retries forever.Proposed Fix
Option A — Skip cross-node warmup for MNNVL (simpler):
On a ComputeDomain cluster, NVLink connectivity between all ranks is guaranteed by the fabric infrastructure. The store key exchange alone is sufficient proof of peer reachability. The warmup write can be skipped.
Option B — Allocate warmup buffers as FABRIC memory (more correct):
Allocate
warmup_recv_region_withcuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC)+cuMemMap, export the fabric handle, share it alongsideSegmentInfoin the store, and import/map it on the remote side before the warmup write. This is more involved but actually validates end-to-end NVLink connectivity.We're happy to test a follow-up patch if helpful!