Skip to content

Fix PP is_fully_idle missing in-flight microbatches#27446

Merged
fzyzcjy merged 12 commits into
mainfrom
tom/fix_pp_flush_inflight_mbs
Jun 7, 2026
Merged

Fix PP is_fully_idle missing in-flight microbatches#27446
fzyzcjy merged 12 commits into
mainfrom
tom/fix_pp_flush_inflight_mbs

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Motivation

PR 2 of 2 (based on #27445). test/registered/chunked_prefill/test_scripted_core_4gpu.py is flaky in CI (example failure on an unrelated PR): the PP0 scheduler crashes with

File ".../sglang/srt/mem_cache/radix_cache.py", line 512, in cache_unfinished_req
    self.dec_lock_ref(req.last_node)
File ".../sglang/srt/mem_cache/radix_cache.py", line 594, in dec_lock_ref
    node is self.root_node
AssertionError: This request holds the node from another tree

via event_loop_pp -> _pp_process_batch_result -> process_batch_result_prefill -> maybe_cache_unfinished_req, right after a Cache flushed successfully! log, and the whole test file then times out (1200s).

Root cause: is_fully_idle() cannot see in-flight PP microbatches

Under pipeline parallelism, the scheduler keeps in-flight microbatches in self.mbs[slot]; a batch launched at slot i only has its result processed pp_size + pp_async_batch_depth - 1 iterations later. is_fully_idle() checked:

  • running_batch / running_mbs — only contains requests after their extend result has been processed and merged for decode;
  • chunked_req — cleared as soon as the last chunk is dispatched;
  • cur_batch / last_batch — rebound every iteration to the current slot only;
  • waiting_queue — the request left it when chunking started.

So after the last prefill chunk of a request is dispatched but before its batch results are processed, every structure is_fully_idle() looks at is empty while 1+ extend microbatches are still in flight. A /flush_cache processed inside that window passes the idle gate and runs tree_cache.reset() + req_to_token_pool.clear() + token_to_kv_pool_allocator.clear(). When the in-flight chunk result is processed afterwards, cache_unfinished_req calls dec_lock_ref(req.last_node) with a node from the pre-flush tree and hits the assertion above.

This is reachable outside tests: any PP + chunked-prefill deployment that calls /flush_cache right after traffic stops (e.g. RL weight-update flows) can land in the window. The same blind spot also gates on_idle() housekeeping (pool-leak / tree invariant checks) and HiCache attach/detach.

The scripted-runtime harness had the same blind spot (_get_all_reqs only scanned the current slot bindings, so ScriptedReqHandle.finished flipped to true spuriously while chunk results were in flight), and its inter-subtest reset drained on that partial view and silently proceeded to flush_cache even when the drain budget was exhausted — which is how CI landed the flush in exactly that window.

Modifications

The commits are ordered test-first so the unfixed failure is reproducible from this PR's own history:

  1. ad55152a5dd09b95c560a87dbdcaca7d33892f6b / b565cc65d52a08eb15078d29a98c4eba9a212a3d — add test_pp_flush_cache_during_inflight_chunk_results: the script waits for the exact race window (queues and current slot bindings clear, scheduler.mbs still holding dispatched chunk batches) and posts flush_cache into it.
  2. 3fb6204e29827af3938542de9171771f01acb8d2 — Scheduler.is_fully_idle(): also require all in-flight PP microbatches (self.mbs) to be empty, so flush_cache and other destructive idle-gated operations wait for pending microbatch results.
  3. b799d93cc0a13b22b6ed886323b6ebd273e2bc60 — scripted runtime harness: _get_all_reqs scans all PP microbatch slots; _reset_engine_state drains on scheduler.is_fully_idle() and raises if the engine does not quiesce instead of silently flushing.
  4. 0540cc7edd0577dc89c2d2270a16e3ba62d56861 / c99b622a72e7a99935aa31b78a503cd9438a0844 — extract _pp_microbatches_drained() and drop redundant comments.

Verification

Deterministic A/B on a 4-GPU machine (Qwen3-0.6B, tp=1 pp=4 pp_async_batch_depth=2, chunked_prefill_size=64), same command for both:

cd test/registered/chunked_prefill
python3 test_scripted_core_4gpu.py TestScriptedPpChunkSweep.test_pp_flush_cache_during_inflight_chunk_results

A — test only, fix absent (checkout b565cc65d52a08eb15078d29a98c4eba9a212a3d, the commit right before the fix): the flush posted into the window is accepted by the pre-fix idle gate and the scheduler crashes with the exact CI signature; the harness then hangs until timeout:

[PP0] Cache flushed successfully!            <- script's flush, accepted mid-flight (all 4 ranks)
...
AssertionError: This request holds the node from another tree
...
exit code 124 (timeout)

B — with the fix (checkout c99b622a72e7a99935aa31b78a503cd9438a0844, this PR's HEAD): the same flush is refused purely by the new in-flight-microbatch check (both legacy counters are 0), the request completes, and the whole file passes:

[PP0] Cache not flushed because there are pending requests. #queue-req: 0, #running-req: 0   (x4 ranks)
...
Ran 2 tests in 40.774s
OK

Full scripted-runtime matrix on the chain (all green): test_scripted_core_4gpu.py (sweep + new test, repeated x3), test_scripted_core_1gpu.py, test_scripted_swa_1gpu.py, test_scripted_runtime_core.py (42 tests), unit/scripted_runtime/test_http_server.py, unit/scripted_runtime/test_scheduler_hook.py, unit/scripted_runtime/test_tokenizer_recv_proxy.py.

Accuracy Tests

N/A (scheduler idle-gating and test harness only).

Speed Tests and Profiling

N/A.


CI States

Latest PR Test (Base): 🚫 Run #27088850686
Latest PR Test (Extra): ✅ Run #27088850584

fzyzcjy added 2 commits June 6, 2026 21:46
Once the server reports Up, /health runs the generation-based health
check (a real probe request through the scheduler), which cannot
complete while the scheduler waits for scripts between RunScript
commands; the readiness poll would then never get a response within its
per-request timeout. /model_info is pure metadata and proves the same
thing: the socket is bound and routes are registered.
In scripted mode the scheduler loop freezes between RunScript commands,
so the launch_server warmup thread's /generate used to land in the
scheduler queues at an arbitrary point during the first scripts as
foreign traffic, and /health stayed 503 for the whole run because the
warmup could never complete.

The hook now runs the engine until the warmup request has been observed
on the recv socket and the scheduler stays fully idle for two microbatch
rotations, and only then sends HookReady. Scripts start with the server
warmed up, Up, and quiesced, with no ambient traffic left.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves pipeline parallelism (PP) handling in the scheduler and test harness. Specifically, it updates the scheduler's idle check to account for in-flight PP microbatches, ensures all microbatch slots are scanned when retrieving active requests under PP, and refactors the engine state reset logic to use the idle check. Additionally, a new test is added to verify cache flushing during in-flight chunk results. The reviewer feedback suggests using getattr with default fallbacks when accessing microbatch attributes (mbs, last_mbs, running_mbs) to prevent potential AttributeErrors during early initialization or in unit tests.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment thread python/sglang/test/scripted_runtime/context/queries.py
@fzyzcjy fzyzcjy force-pushed the tom/fix_pp_flush_inflight_mbs branch from dd8403c to 1c1ffd3 Compare June 6, 2026 13:58
@fzyzcjy fzyzcjy force-pushed the tom/fix_pp_flush_inflight_mbs branch from 1c1ffd3 to 7efdb50 Compare June 6, 2026 14:04
@fzyzcjy fzyzcjy force-pushed the tom/fix_pp_flush_inflight_mbs branch from 7efdb50 to 2e306d0 Compare June 6, 2026 14:20
@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Results for /rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py:

🚀 ubuntu-latest (3 tests): ✅ View workflow run

cd test/ && python3 registered/unit/scripted_runtime/test_http_server.py
cd test/ && python3 registered/unit/scripted_runtime/test_scheduler_hook.py
cd test/ && python3 registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py

🚀 1-gpu-5090 (2 tests): ✅ View workflow run

cd test/ && python3 registered/scripted_runtime/test_scripted_runtime_core.py
cd test/ && python3 registered/chunked_prefill/test_scripted_core_1gpu.py

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_swa_1gpu.py

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_core_4gpu.py

@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Jun 6, 2026
fzyzcjy added 6 commits June 6, 2026 22:38
The script waits for the window where the queues and current microbatch
slot bindings are clear while dispatched chunk batch results are still
in flight, posts flush_cache into it, and asserts the request still
finishes. Without the is_fully_idle fix this deterministically crashes
the scheduler with "This request holds the node from another tree".
Under pipeline parallelism, is_fully_idle() only checked running_mbs and
the current slot's cur_batch/last_batch bindings. A chunked-prefill
request whose last chunk has been dispatched (chunked_req already
cleared) but whose batch results are still in flight was invisible, so
/flush_cache could reset the radix tree and memory pools mid-flight.
Processing the in-flight chunk result then crashed the scheduler with
"AssertionError: This request holds the node from another tree" in
RadixCache.dec_lock_ref (seen as a flaky timeout of
test/registered/chunked_prefill/test_scripted_core_4gpu.py in CI).

Also check self.mbs so flush_cache and other destructive idle-gated
operations wait for in-flight microbatch results.
- queries._get_all_reqs: under PP, scan all microbatch slots
  (mbs/last_mbs/running_mbs) instead of only the current slot bindings,
  so requests whose batch results are in flight stay visible and
  ScriptedReqHandle.finished does not flip to true spuriously through
  the _seen_rids fallback.
- _reset_engine_state: drain on scheduler.is_fully_idle() (which now
  covers in-flight PP microbatches) and raise if the engine does not
  quiesce, instead of silently proceeding to flush_cache; drop the
  fixed PP pad.
@fzyzcjy fzyzcjy force-pushed the tom/fix_pp_flush_inflight_mbs branch from 2e306d0 to c99b622 Compare June 6, 2026 14:39
@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci extra

Base automatically changed from tom/fix_pp_chunk_sweep_flush_race to main June 7, 2026 09:37
@fzyzcjy fzyzcjy merged commit 0a190d1 into main Jun 7, 2026
13 of 15 checks passed
@fzyzcjy fzyzcjy deleted the tom/fix_pp_flush_inflight_mbs branch June 7, 2026 09:38
monkeyLoveding pushed a commit to monkeyLoveding/sglang_open that referenced this pull request Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant