[Bug] MNNVL hang after #1629: mooncake-pg warmup buffers (CPU heap) are not NVLink fabric-accessible

## Summary

After applying https://github.com/kvcache-ai/Mooncake/pull/1629 (which fixed `gdr_buffer` allocation in `mooncake-ep`), we hit a new hang on AWS GB200 MNNVL clusters. The `mooncake-pg` connection poller warmup handshake uses CPU heap buffers as NVLink write targets, which the MNNVL transport cannot access cross-node.

Related: https://github.com/kvcache-ai/Mooncake/issues/1627 (original `gdr_buffer` bug), https://github.com/kvcache-ai/Mooncake/pull/1629 (partial fix).

cc @he-yufeng

---

## Environment

- **GPU**: NVIDIA GB200 (4 GPUs/node × 4 nodes = 16 GPUs total)
- **Cluster**: Multi-node NVLink fabric (MNNVL) via Kubernetes ComputeDomain
- `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED (attr=128)` = **1** on all GPUs ✓ — ComputeDomain is correctly set up
- No InfiniBand/EFA in pods → transfer engine auto-selects NVLink: `Using cross-node NVLink transport (MC_FORCE_MNNVL or no HCA detected)`
- sglang: `--moe-a2a-backend mooncake --elastic-ep-backend mooncake --deepep-mode low_latency`

---

## Symptoms

Workers spam indefinitely (`connection_poller.cpp:123`):
```
Rank 4 got invalid buffer data from 12.
Rank 5 got invalid buffer data from 13.
...
```

Leader logs (`nvlink_transport.cpp:497`):
```
Memory region 0x15ba1e50 is not allocated by cuMemCreate, but it can be used as local buffer
Requested address 0x15ba2184 to 0x15ba2188 not found!
Requested address 0x1eb91128 to 0x1eb9112c not found!
```

`waitUntilAllConnected()` blocks forever → sglang never finishes initializing.

---

## Root Cause

In `mooncake-pg/src/connection_poller.cpp`, `ConnectionContext` allocates warmup buffers from CPU heap:

```cpp
// connection_poller.cpp — ConnectionContext constructor
warmup_send_region_ = new int32_t[kMaxNumRanks];   // CPU heap!
warmup_recv_region_ = new int32_t[kMaxNumRanks]{};  // CPU heap!

engine_->registerLocalMemory(warmup_send_region_, kMaxNumRanks * sizeof(int32_t), kWildcardLocation);
engine_->registerLocalMemory(warmup_recv_region_, kMaxNumRanks * sizeof(int32_t), kWildcardLocation);
```

Their addresses are published in `SegmentInfo` via `mooncake_backend.cpp`:

```cpp
rank_info.warmup_buffer[0] = (uint64_t)connection_ctx_->warmup_send_region();
rank_info.warmup_buffer[1] = (uint64_t)connection_ctx_->warmup_recv_region();
```

During the warmup handshake (`pollPeer`, `WAITING_WARMUP_TRANSFER` state), the lower-ranked peer submits a NVLink WRITE to the remote rank's `warmup_recv_region_`:

```cpp
engine_->submitTransfer(batchID, {TransferRequest{
    .opcode = TransferRequest::WRITE,
    .source = warmup_send_region_,
    .target_id = meta_->segmentIDs[pollingRank],
    .target_offset = meta_->segmentInfos[pollingRank].warmup_buffer[1] + rank_ * sizeof(int32_t),
    .length = sizeof(int32_t),
}});
```

**The MNNVL NVLink transport can only access `cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC)` GPU memory cross-node.** CPU heap memory is registered as local-only by the transport, so any cross-node NVLink write to it fails with `Requested address not found!`.

The state machine then resets to `WAITING_STORE` and retries forever.

---

## Proposed Fix

**Option A — Skip cross-node warmup for MNNVL (simpler):**

On a ComputeDomain cluster, NVLink connectivity between all ranks is guaranteed by the fabric infrastructure. The store key exchange alone is sufficient proof of peer reachability. The warmup write can be skipped.

```cpp
// In pollPeer, case WAITING_STORE, after successfully reading SegmentInfo:
auto segment_id = engine_->openSegment(peerServerName);
meta_->segmentIDs[pollingRank] = segment_id;
peerState.segmentId = segment_id;
memcpy(&meta_->segmentInfos[pollingRank], buffer_data.data(), sizeof(SegmentInfo));

if (supportFabricMem()) {
    // MNNVL: fabric connectivity guaranteed by ComputeDomain.
    // Skip warmup write — warmup_recv_region_ is CPU heap, not fabric-accessible.
    meta_->peerConnected[pollingRank] = true;
    global_peerConnected_[globalPollingRank] = true;
    peerState.state = PeerConnectionState::CONNECTED;
    totalConnectedPeers_.fetch_add(1, std::memory_order_release);
    if (isAllPeerConnected()) backend_wakeup_cv_.notify_all();
} else if (pollingRank <= rank_) {
    // Original IB/RoCE path: send warmup write
    ...
}
```

**Option B — Allocate warmup buffers as FABRIC memory (more correct):**

Allocate `warmup_recv_region_` with `cuMemCreate(CU_MEM_HANDLE_TYPE_FABRIC)` + `cuMemMap`, export the fabric handle, share it alongside `SegmentInfo` in the store, and import/map it on the remote side before the warmup write. This is more involved but actually validates end-to-end NVLink connectivity.

---

We're happy to test a follow-up patch if helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] MNNVL hang after #1629: mooncake-pg warmup buffers (CPU heap) are not NVLink fabric-accessible #1639

Summary

Environment

Symptoms

Root Cause

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] MNNVL hang after #1629: mooncake-pg warmup buffers (CPU heap) are not NVLink fabric-accessible #1639

Description

Summary

Environment

Symptoms

Root Cause

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions