Checklist
Describe the bug
In a PD Disaggregation + Pipeline Parallel (PP) cluster using the Mooncake KV transfer backend (DeepSeek-R1 model), when Decode nodes reach high KV cache and pre-allocated memory occupancy, the system experiences:
- Prefill nodes stuck in
KVPoll.Bootstrapping for 600s → bootstrap timeout → KVTransferError
- A crash triggered by an
AssertionError in pop_bootstrapped of PrefillBootstrapQueue
The root cause is a race condition between PP ranks when bootstrap times out:
- PP rank N (e.g., DP3/TP3) reaches its 600s timeout → marks the request as
KVPoll.Failed
- PP rank N+1 (e.g., DP5/TP5) receives the rid in
consensus_bootstrapped_rids from rank N, but its own local poll still returns KVPoll.Bootstrapping (due to clock skew / slightly different init_time)
pop_bootstrapped with rids_to_check set skips the if poll == KVPoll.Bootstrapping: continue guard and hits the assertion
Error Logs
[DP3 TP3 EP3 PP1] Some requests timed out when bootstrapping...
If a greater mean TTFT is acceptable, you can 'export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600'
[DP3 TP3 EP3 PP1] Prefill bootstrap failed for request rank=3 req.rid='28a38646...'
with exception KVTransferError(bootstrap_room=...): Request ... timed out after 600.0s in KVPoll.Bootstrapping
[DP5 TP5 EP5 PP1] Scheduler hit an exception:
File "sglang/srt/managers/scheduler.py", line 2668, in run_scheduler_process
scheduler.event_loop_pp_disagg_prefill()
File "sglang/srt/disaggregation/prefill.py", line 230, in pop_bootstrapped
assert poll == KVPoll.WaitingForInput or poll == KVPoll.Failed
AssertionError
Decode node metrics at time of failure (KV saturation evidence):
[DP0] token usage: 0.84, pre-allocated usage: 0.42
[DP3] token usage: 0.81, pre-allocated usage: 0.52
[DP6] token usage: 0.87, pre-allocated usage: 0.72 ← highest saturation
Root Cause Analysis
Step 1: Decode node KV Cache becomes saturated (token_usage ≈ 0.8+, pre_allocated_usage ≈ 0.5–0.7). The Decode node cannot allocate new KV indices for incoming requests, delaying execution of MooncakeKVReceiver.init().
Step 2: Without init() being called, TransferInfo is never sent to the Prefill Bootstrap Server. The Prefill MooncakeKVSender.poll() stays in KVPoll.Bootstrapping indefinitely.
Step 3: After 600s, MooncakeKVSender.poll() times out and returns KVPoll.Failed. PP rank 0 marks the request as failed and includes the rid in consensus_bootstrapped_rids sent to rank 1.
Step 4: PP rank 1 receives the rid in rids_to_check, but its own local poll() still returns KVPoll.Bootstrapping (init_time is slightly later). In pop_bootstrapped, because the rid is in rids_to_check, execution does not hit the if poll == KVPoll.Bootstrapping: continue guard, and reaches the assertion:
# python/sglang/srt/disaggregation/prefill.py
assert poll == KVPoll.WaitingForInput or poll == KVPoll.Failed # poll is actually KVPoll.Bootstrapping → CRASH
Expected Behavior
- PP ranks should tolerate small state-propagation delays between each other. A
KVPoll.Bootstrapping state on a rank that's in rids_to_check should be treated as "still waiting", not a crash.
- The system should degrade gracefully (abort the request with a soft error) rather than crashing the entire Scheduler process.
Proposed Fix
Defensive fix in pop_bootstrapped (avoids crash while preserving correctness):
# python/sglang/srt/disaggregation/prefill.py, in pop_bootstrapped()
for i, (req, poll) in enumerate(zip(self.queue, polls)):
if rids_to_check is not None:
if req.rid not in rids_to_check:
continue
if poll == KVPoll.Bootstrapping:
# PP rank state has not yet propagated; skip this round safely
# instead of asserting (which crashes the Scheduler)
continue
elif poll == KVPoll.Failed:
...
continue
assert poll == KVPoll.WaitingForInput or poll == KVPoll.Failed
Longer-term fix: Add backpressure on Decode pre-alloc. When pre_allocated_usage exceeds a threshold, stop accepting new bootstrap requests rather than silently stalling for up to 600s.
Environment
- sglang commit:
e29305c120a9830538e52dac9faf3e584b675be8
- Transfer backend: Mooncake
- Model: DeepSeek-R1-0528
- Parallelism: DP=8, TP=8, EP=8, PP=2
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
Related Code
Checklist
Describe the bug
In a PD Disaggregation + Pipeline Parallel (PP) cluster using the Mooncake KV transfer backend (DeepSeek-R1 model), when Decode nodes reach high KV cache and pre-allocated memory occupancy, the system experiences:
KVPoll.Bootstrappingfor 600s → bootstrap timeout →KVTransferErrorAssertionErrorinpop_bootstrappedofPrefillBootstrapQueueThe root cause is a race condition between PP ranks when bootstrap times out:
KVPoll.Failedconsensus_bootstrapped_ridsfrom rank N, but its own local poll still returnsKVPoll.Bootstrapping(due to clock skew / slightly different init_time)pop_bootstrappedwithrids_to_checkset skips theif poll == KVPoll.Bootstrapping: continueguard and hits the assertionError Logs
Decode node metrics at time of failure (KV saturation evidence):
Root Cause Analysis
Step 1: Decode node KV Cache becomes saturated (
token_usage ≈ 0.8+,pre_allocated_usage ≈ 0.5–0.7). The Decode node cannot allocate new KV indices for incoming requests, delaying execution ofMooncakeKVReceiver.init().Step 2: Without
init()being called,TransferInfois never sent to the Prefill Bootstrap Server. The PrefillMooncakeKVSender.poll()stays inKVPoll.Bootstrappingindefinitely.Step 3: After 600s,
MooncakeKVSender.poll()times out and returnsKVPoll.Failed. PP rank 0 marks the request as failed and includes the rid inconsensus_bootstrapped_ridssent to rank 1.Step 4: PP rank 1 receives the rid in
rids_to_check, but its own localpoll()still returnsKVPoll.Bootstrapping(init_time is slightly later). Inpop_bootstrapped, because the rid is inrids_to_check, execution does not hit theif poll == KVPoll.Bootstrapping: continueguard, and reaches the assertion:Expected Behavior
KVPoll.Bootstrappingstate on a rank that's inrids_to_checkshould be treated as "still waiting", not a crash.Proposed Fix
Defensive fix in
pop_bootstrapped(avoids crash while preserving correctness):Longer-term fix: Add backpressure on Decode pre-alloc. When
pre_allocated_usageexceeds a threshold, stop accepting new bootstrap requests rather than silently stalling for up to 600s.Environment
e29305c120a9830538e52dac9faf3e584b675be8SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600Related Code
python/sglang/srt/disaggregation/prefill.py#L230— assertion crash sitepython/sglang/srt/disaggregation/mooncake/conn.py#L1178—MooncakeKVSender.poll()timeout logicpython/sglang/srt/managers/scheduler_pp_mixin.py#L145—event_loop_pp_disagg_prefill