Skip to content

[Bug] Mooncake Disaggregation + PP: Prefill Bootstrap Timeout causes AssertionError crash in pop_bootstrapped due to Decode KV Cache Saturation #20485

@Sispheqgj

Description

@Sispheqgj

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Please use English.

Describe the bug

In a PD Disaggregation + Pipeline Parallel (PP) cluster using the Mooncake KV transfer backend (DeepSeek-R1 model), when Decode nodes reach high KV cache and pre-allocated memory occupancy, the system experiences:

  1. Prefill nodes stuck in KVPoll.Bootstrapping for 600s → bootstrap timeout → KVTransferError
  2. A crash triggered by an AssertionError in pop_bootstrapped of PrefillBootstrapQueue

The root cause is a race condition between PP ranks when bootstrap times out:

  • PP rank N (e.g., DP3/TP3) reaches its 600s timeout → marks the request as KVPoll.Failed
  • PP rank N+1 (e.g., DP5/TP5) receives the rid in consensus_bootstrapped_rids from rank N, but its own local poll still returns KVPoll.Bootstrapping (due to clock skew / slightly different init_time)
  • pop_bootstrapped with rids_to_check set skips the if poll == KVPoll.Bootstrapping: continue guard and hits the assertion

Error Logs

[DP3 TP3 EP3 PP1] Some requests timed out when bootstrapping...
  If a greater mean TTFT is acceptable, you can 'export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600'
[DP3 TP3 EP3 PP1] Prefill bootstrap failed for request rank=3 req.rid='28a38646...' 
  with exception KVTransferError(bootstrap_room=...): Request ... timed out after 600.0s in KVPoll.Bootstrapping

[DP5 TP5 EP5 PP1] Scheduler hit an exception:
  File "sglang/srt/managers/scheduler.py", line 2668, in run_scheduler_process
    scheduler.event_loop_pp_disagg_prefill()
  File "sglang/srt/disaggregation/prefill.py", line 230, in pop_bootstrapped
    assert poll == KVPoll.WaitingForInput or poll == KVPoll.Failed
AssertionError

Decode node metrics at time of failure (KV saturation evidence):

[DP0] token usage: 0.84, pre-allocated usage: 0.42
[DP3] token usage: 0.81, pre-allocated usage: 0.52
[DP6] token usage: 0.87, pre-allocated usage: 0.72  ← highest saturation

Root Cause Analysis

Step 1: Decode node KV Cache becomes saturated (token_usage ≈ 0.8+, pre_allocated_usage ≈ 0.5–0.7). The Decode node cannot allocate new KV indices for incoming requests, delaying execution of MooncakeKVReceiver.init().

Step 2: Without init() being called, TransferInfo is never sent to the Prefill Bootstrap Server. The Prefill MooncakeKVSender.poll() stays in KVPoll.Bootstrapping indefinitely.

Step 3: After 600s, MooncakeKVSender.poll() times out and returns KVPoll.Failed. PP rank 0 marks the request as failed and includes the rid in consensus_bootstrapped_rids sent to rank 1.

Step 4: PP rank 1 receives the rid in rids_to_check, but its own local poll() still returns KVPoll.Bootstrapping (init_time is slightly later). In pop_bootstrapped, because the rid is in rids_to_check, execution does not hit the if poll == KVPoll.Bootstrapping: continue guard, and reaches the assertion:

# python/sglang/srt/disaggregation/prefill.py
assert poll == KVPoll.WaitingForInput or poll == KVPoll.Failed  # poll is actually KVPoll.Bootstrapping → CRASH

Expected Behavior

  • PP ranks should tolerate small state-propagation delays between each other. A KVPoll.Bootstrapping state on a rank that's in rids_to_check should be treated as "still waiting", not a crash.
  • The system should degrade gracefully (abort the request with a soft error) rather than crashing the entire Scheduler process.

Proposed Fix

Defensive fix in pop_bootstrapped (avoids crash while preserving correctness):

# python/sglang/srt/disaggregation/prefill.py, in pop_bootstrapped()

for i, (req, poll) in enumerate(zip(self.queue, polls)):
    if rids_to_check is not None:
        if req.rid not in rids_to_check:
            continue
    
    if poll == KVPoll.Bootstrapping:
        # PP rank state has not yet propagated; skip this round safely
        # instead of asserting (which crashes the Scheduler)
        continue
    elif poll == KVPoll.Failed:
        ...
        continue

    assert poll == KVPoll.WaitingForInput or poll == KVPoll.Failed

Longer-term fix: Add backpressure on Decode pre-alloc. When pre_allocated_usage exceeds a threshold, stop accepting new bootstrap requests rather than silently stalling for up to 600s.


Environment

  • sglang commit: e29305c120a9830538e52dac9faf3e584b675be8
  • Transfer backend: Mooncake
  • Model: DeepSeek-R1-0528
  • Parallelism: DP=8, TP=8, EP=8, PP=2
  • SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600

Related Code

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions