Fix PP is_fully_idle missing in-flight microbatches by fzyzcjy · Pull Request #27446 · sgl-project/sglang

fzyzcjy · 2026-06-06T13:52:44Z

Motivation

PR 2 of 2 (based on #27445). test/registered/chunked_prefill/test_scripted_core_4gpu.py is flaky in CI (example failure on an unrelated PR): the PP0 scheduler crashes with

File ".../sglang/srt/mem_cache/radix_cache.py", line 512, in cache_unfinished_req
    self.dec_lock_ref(req.last_node)
File ".../sglang/srt/mem_cache/radix_cache.py", line 594, in dec_lock_ref
    node is self.root_node
AssertionError: This request holds the node from another tree

via event_loop_pp -> _pp_process_batch_result -> process_batch_result_prefill -> maybe_cache_unfinished_req, right after a Cache flushed successfully! log, and the whole test file then times out (1200s).

Root cause: `is_fully_idle()` cannot see in-flight PP microbatches

Under pipeline parallelism, the scheduler keeps in-flight microbatches in self.mbs[slot]; a batch launched at slot i only has its result processed pp_size + pp_async_batch_depth - 1 iterations later. is_fully_idle() checked:

running_batch / running_mbs — only contains requests after their extend result has been processed and merged for decode;
chunked_req — cleared as soon as the last chunk is dispatched;
cur_batch / last_batch — rebound every iteration to the current slot only;
waiting_queue — the request left it when chunking started.

So after the last prefill chunk of a request is dispatched but before its batch results are processed, every structure is_fully_idle() looks at is empty while 1+ extend microbatches are still in flight. A /flush_cache processed inside that window passes the idle gate and runs tree_cache.reset() + req_to_token_pool.clear() + token_to_kv_pool_allocator.clear(). When the in-flight chunk result is processed afterwards, cache_unfinished_req calls dec_lock_ref(req.last_node) with a node from the pre-flush tree and hits the assertion above.

This is reachable outside tests: any PP + chunked-prefill deployment that calls /flush_cache right after traffic stops (e.g. RL weight-update flows) can land in the window. The same blind spot also gates on_idle() housekeeping (pool-leak / tree invariant checks) and HiCache attach/detach.

The scripted-runtime harness had the same blind spot (_get_all_reqs only scanned the current slot bindings, so ScriptedReqHandle.finished flipped to true spuriously while chunk results were in flight), and its inter-subtest reset drained on that partial view and silently proceeded to flush_cache even when the drain budget was exhausted — which is how CI landed the flush in exactly that window.

Modifications

The commits are ordered test-first so the unfixed failure is reproducible from this PR's own history:

ad55152a5dd09b95c560a87dbdcaca7d33892f6b / b565cc65d52a08eb15078d29a98c4eba9a212a3d — add test_pp_flush_cache_during_inflight_chunk_results: the script waits for the exact race window (queues and current slot bindings clear, scheduler.mbs still holding dispatched chunk batches) and posts flush_cache into it.
3fb6204e29827af3938542de9171771f01acb8d2 — Scheduler.is_fully_idle(): also require all in-flight PP microbatches (self.mbs) to be empty, so flush_cache and other destructive idle-gated operations wait for pending microbatch results.
b799d93cc0a13b22b6ed886323b6ebd273e2bc60 — scripted runtime harness: _get_all_reqs scans all PP microbatch slots; _reset_engine_state drains on scheduler.is_fully_idle() and raises if the engine does not quiesce instead of silently flushing.
0540cc7edd0577dc89c2d2270a16e3ba62d56861 / c99b622a72e7a99935aa31b78a503cd9438a0844 — extract _pp_microbatches_drained() and drop redundant comments.

Verification

Deterministic A/B on a 4-GPU machine (Qwen3-0.6B, tp=1 pp=4 pp_async_batch_depth=2, chunked_prefill_size=64), same command for both:

cd test/registered/chunked_prefill
python3 test_scripted_core_4gpu.py TestScriptedPpChunkSweep.test_pp_flush_cache_during_inflight_chunk_results

A — test only, fix absent (checkout b565cc65d52a08eb15078d29a98c4eba9a212a3d, the commit right before the fix): the flush posted into the window is accepted by the pre-fix idle gate and the scheduler crashes with the exact CI signature; the harness then hangs until timeout:

[PP0] Cache flushed successfully!            <- script's flush, accepted mid-flight (all 4 ranks)
...
AssertionError: This request holds the node from another tree
...
exit code 124 (timeout)

B — with the fix (checkout c99b622a72e7a99935aa31b78a503cd9438a0844, this PR's HEAD): the same flush is refused purely by the new in-flight-microbatch check (both legacy counters are 0), the request completes, and the whole file passes:

[PP0] Cache not flushed because there are pending requests. #queue-req: 0, #running-req: 0   (x4 ranks)
...
Ran 2 tests in 40.774s
OK

Full scripted-runtime matrix on the chain (all green): test_scripted_core_4gpu.py (sweep + new test, repeated x3), test_scripted_core_1gpu.py, test_scripted_swa_1gpu.py, test_scripted_runtime_core.py (42 tests), unit/scripted_runtime/test_http_server.py, unit/scripted_runtime/test_scheduler_hook.py, unit/scripted_runtime/test_tokenizer_recv_proxy.py.

Accuracy Tests

N/A (scheduler idle-gating and test harness only).

Speed Tests and Profiling

N/A.

CI States

Latest PR Test (Base): 🚫 Run #27088850686
Latest PR Test (Extra): ✅ Run #27088850584

Once the server reports Up, /health runs the generation-based health check (a real probe request through the scheduler), which cannot complete while the scheduler waits for scripts between RunScript commands; the readiness poll would then never get a response within its per-request timeout. /model_info is pure metadata and proves the same thing: the socket is bound and routes are registered.

In scripted mode the scheduler loop freezes between RunScript commands, so the launch_server warmup thread's /generate used to land in the scheduler queues at an arbitrary point during the first scripts as foreign traffic, and /health stayed 503 for the whole run because the warmup could never complete. The hook now runs the engine until the warmup request has been observed on the recv socket and the scheduler stays fully idle for two microbatch rotations, and only then sends HookReady. Scripts start with the server warmed up, Up, and quiesced, with no ambient traffic left.

gemini-code-assist

Code Review

This pull request improves pipeline parallelism (PP) handling in the scheduler and test harness. Specifically, it updates the scheduler's idle check to account for in-flight PP microbatches, ensures all microbatch slots are scanned when retrieving active requests under PP, and refactors the engine state reset logic to use the idle check. Additionally, a new test is added to verify cache flushing during in-flight chunk results. The reviewer feedback suggests using getattr with default fallbacks when accessing microbatch attributes (mbs, last_mbs, running_mbs) to prevent potential AttributeErrors during early initialization or in unit tests.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

fzyzcjy · 2026-06-06T14:20:57Z

/rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py

github-actions · 2026-06-06T14:21:18Z

Results for /rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py:

🚀 ubuntu-latest (3 tests): ✅ View workflow run

cd test/ && python3 registered/unit/scripted_runtime/test_http_server.py
cd test/ && python3 registered/unit/scripted_runtime/test_scheduler_hook.py
cd test/ && python3 registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py

🚀 1-gpu-5090 (2 tests): ✅ View workflow run

cd test/ && python3 registered/scripted_runtime/test_scripted_runtime_core.py
cd test/ && python3 registered/chunked_prefill/test_scripted_core_1gpu.py

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_swa_1gpu.py

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_core_4gpu.py

fzyzcjy · 2026-06-06T14:37:41Z

/tag-and-rerun-ci

The script waits for the window where the queues and current microbatch slot bindings are clear while dispatched chunk batch results are still in flight, posts flush_cache into it, and asserts the request still finishes. Without the is_fully_idle fix this deterministically crashes the scheduler with "This request holds the node from another tree".

Under pipeline parallelism, is_fully_idle() only checked running_mbs and the current slot's cur_batch/last_batch bindings. A chunked-prefill request whose last chunk has been dispatched (chunked_req already cleared) but whose batch results are still in flight was invisible, so /flush_cache could reset the radix tree and memory pools mid-flight. Processing the in-flight chunk result then crashed the scheduler with "AssertionError: This request holds the node from another tree" in RadixCache.dec_lock_ref (seen as a flaky timeout of test/registered/chunked_prefill/test_scripted_core_4gpu.py in CI). Also check self.mbs so flush_cache and other destructive idle-gated operations wait for in-flight microbatch results.

- queries._get_all_reqs: under PP, scan all microbatch slots (mbs/last_mbs/running_mbs) instead of only the current slot bindings, so requests whose batch results are in flight stay visible and ScriptedReqHandle.finished does not flip to true spuriously through the _seen_rids fallback. - _reset_engine_state: drain on scheduler.is_fully_idle() (which now covers in-flight PP microbatches) and raise if the engine does not quiesce, instead of silently proceeding to flush_cache; drop the fixed PP pad.

fzyzcjy · 2026-06-06T14:49:01Z

/tag-and-rerun-ci extra

fzyzcjy added 2 commits June 6, 2026 21:46

fzyzcjy requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners June 6, 2026 13:52

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler.py Outdated

Comment thread python/sglang/test/scripted_runtime/context/queries.py

Simplify warmup drive idle-streak tracking

41768cb

fzyzcjy force-pushed the tom/fix_pp_flush_inflight_mbs branch from dd8403c to 1c1ffd3 Compare June 6, 2026 13:58

Hoist server_args local in warmup drive

9d9df1e

fzyzcjy force-pushed the tom/fix_pp_flush_inflight_mbs branch from 1c1ffd3 to 7efdb50 Compare June 6, 2026 14:04

Log warmup drive start and completion

9f2dd4c

fzyzcjy force-pushed the tom/fix_pp_flush_inflight_mbs branch from 7efdb50 to 2e306d0 Compare June 6, 2026 14:20

github-actions Bot added the run-ci label Jun 6, 2026

fzyzcjy added 6 commits June 6, 2026 22:38

Group flush regression test with its script

b565cc6

Extract PP microbatch drain predicate in is_fully_idle

0540cc7

Drop redundant comments in PP drain predicate and request scan

c99b622

fzyzcjy force-pushed the tom/fix_pp_flush_inflight_mbs branch from 2e306d0 to c99b622 Compare June 6, 2026 14:39

github-actions Bot added the run-ci-extra label Jun 6, 2026

fzyzcjy mentioned this pull request Jun 7, 2026

Complete server warmup before scripted runtime scripts start #27445

Merged

Base automatically changed from tom/fix_pp_chunk_sweep_flush_race to main June 7, 2026 09:37

Merge branch 'main' into tom/fix_pp_flush_inflight_mbs

6a5e612

fzyzcjy merged commit 0a190d1 into main Jun 7, 2026
13 of 15 checks passed

fzyzcjy deleted the tom/fix_pp_flush_inflight_mbs branch June 7, 2026 09:38

fzyzcjy mentioned this pull request Jun 7, 2026

Sandbox: verify full main CI is green on latest main (do not merge) #25647

Closed

4 tasks

monkeyLoveding pushed a commit to monkeyLoveding/sglang_open that referenced this pull request Jun 9, 2026

Fix PP is_fully_idle missing in-flight microbatches (sgl-project#27446)

071fd56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PP is_fully_idle missing in-flight microbatches#27446

Fix PP is_fully_idle missing in-flight microbatches#27446
fzyzcjy merged 12 commits into
mainfrom
tom/fix_pp_flush_inflight_mbs

fzyzcjy commented Jun 6, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fzyzcjy commented Jun 6, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Root cause: is_fully_idle() cannot see in-flight PP microbatches

Modifications

Verification

Accuracy Tests

Speed Tests and Profiling

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fzyzcjy commented Jun 6, 2026 •

edited by github-actions Bot

Loading

Root cause: `is_fully_idle()` cannot see in-flight PP microbatches

github-actions Bot commented Jun 6, 2026 •

edited

Loading