Fix PP is_fully_idle missing in-flight microbatches#27446
Conversation
Once the server reports Up, /health runs the generation-based health check (a real probe request through the scheduler), which cannot complete while the scheduler waits for scripts between RunScript commands; the readiness poll would then never get a response within its per-request timeout. /model_info is pure metadata and proves the same thing: the socket is bound and routes are registered.
In scripted mode the scheduler loop freezes between RunScript commands, so the launch_server warmup thread's /generate used to land in the scheduler queues at an arbitrary point during the first scripts as foreign traffic, and /health stayed 503 for the whole run because the warmup could never complete. The hook now runs the engine until the warmup request has been observed on the recv socket and the scheduler stays fully idle for two microbatch rotations, and only then sends HookReady. Scripts start with the server warmed up, Up, and quiesced, with no ambient traffic left.
There was a problem hiding this comment.
Code Review
This pull request improves pipeline parallelism (PP) handling in the scheduler and test harness. Specifically, it updates the scheduler's idle check to account for in-flight PP microbatches, ensures all microbatch slots are scanned when retrieving active requests under PP, and refactors the engine state reset logic to use the idle check. Additionally, a new test is added to verify cache flushing during in-flight chunk results. The reviewer feedback suggests using getattr with default fallbacks when accessing microbatch attributes (mbs, last_mbs, running_mbs) to prevent potential AttributeErrors during early initialization or in unit tests.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
dd8403c to
1c1ffd3
Compare
1c1ffd3 to
7efdb50
Compare
7efdb50 to
2e306d0
Compare
|
/rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py |
|
Results for 🚀 🚀 🚀 🚀 |
|
/tag-and-rerun-ci |
The script waits for the window where the queues and current microbatch slot bindings are clear while dispatched chunk batch results are still in flight, posts flush_cache into it, and asserts the request still finishes. Without the is_fully_idle fix this deterministically crashes the scheduler with "This request holds the node from another tree".
Under pipeline parallelism, is_fully_idle() only checked running_mbs and the current slot's cur_batch/last_batch bindings. A chunked-prefill request whose last chunk has been dispatched (chunked_req already cleared) but whose batch results are still in flight was invisible, so /flush_cache could reset the radix tree and memory pools mid-flight. Processing the in-flight chunk result then crashed the scheduler with "AssertionError: This request holds the node from another tree" in RadixCache.dec_lock_ref (seen as a flaky timeout of test/registered/chunked_prefill/test_scripted_core_4gpu.py in CI). Also check self.mbs so flush_cache and other destructive idle-gated operations wait for in-flight microbatch results.
- queries._get_all_reqs: under PP, scan all microbatch slots (mbs/last_mbs/running_mbs) instead of only the current slot bindings, so requests whose batch results are in flight stay visible and ScriptedReqHandle.finished does not flip to true spuriously through the _seen_rids fallback. - _reset_engine_state: drain on scheduler.is_fully_idle() (which now covers in-flight PP microbatches) and raise if the engine does not quiesce, instead of silently proceeding to flush_cache; drop the fixed PP pad.
2e306d0 to
c99b622
Compare
|
/tag-and-rerun-ci extra |
Motivation
PR 2 of 2 (based on #27445).
test/registered/chunked_prefill/test_scripted_core_4gpu.pyis flaky in CI (example failure on an unrelated PR): the PP0 scheduler crashes withvia
event_loop_pp -> _pp_process_batch_result -> process_batch_result_prefill -> maybe_cache_unfinished_req, right after aCache flushed successfully!log, and the whole test file then times out (1200s).Root cause:
is_fully_idle()cannot see in-flight PP microbatchesUnder pipeline parallelism, the scheduler keeps in-flight microbatches in
self.mbs[slot]; a batch launched at slotionly has its result processedpp_size + pp_async_batch_depth - 1iterations later.is_fully_idle()checked:running_batch/running_mbs— only contains requests after their extend result has been processed and merged for decode;chunked_req— cleared as soon as the last chunk is dispatched;cur_batch/last_batch— rebound every iteration to the current slot only;waiting_queue— the request left it when chunking started.So after the last prefill chunk of a request is dispatched but before its batch results are processed, every structure
is_fully_idle()looks at is empty while 1+ extend microbatches are still in flight. A/flush_cacheprocessed inside that window passes the idle gate and runstree_cache.reset()+req_to_token_pool.clear()+token_to_kv_pool_allocator.clear(). When the in-flight chunk result is processed afterwards,cache_unfinished_reqcallsdec_lock_ref(req.last_node)with a node from the pre-flush tree and hits the assertion above.This is reachable outside tests: any PP + chunked-prefill deployment that calls
/flush_cacheright after traffic stops (e.g. RL weight-update flows) can land in the window. The same blind spot also gateson_idle()housekeeping (pool-leak / tree invariant checks) and HiCache attach/detach.The scripted-runtime harness had the same blind spot (
_get_all_reqsonly scanned the current slot bindings, soScriptedReqHandle.finishedflipped to true spuriously while chunk results were in flight), and its inter-subtest reset drained on that partial view and silently proceeded toflush_cacheeven when the drain budget was exhausted — which is how CI landed the flush in exactly that window.Modifications
The commits are ordered test-first so the unfixed failure is reproducible from this PR's own history:
test_pp_flush_cache_during_inflight_chunk_results: the script waits for the exact race window (queues and current slot bindings clear,scheduler.mbsstill holding dispatched chunk batches) and postsflush_cacheinto it.Scheduler.is_fully_idle(): also require all in-flight PP microbatches (self.mbs) to be empty, soflush_cacheand other destructive idle-gated operations wait for pending microbatch results._get_all_reqsscans all PP microbatch slots;_reset_engine_statedrains onscheduler.is_fully_idle()and raises if the engine does not quiesce instead of silently flushing._pp_microbatches_drained()and drop redundant comments.Verification
Deterministic A/B on a 4-GPU machine (Qwen3-0.6B, tp=1 pp=4
pp_async_batch_depth=2,chunked_prefill_size=64), same command for both:cd test/registered/chunked_prefill python3 test_scripted_core_4gpu.py TestScriptedPpChunkSweep.test_pp_flush_cache_during_inflight_chunk_resultsA — test only, fix absent (checkout b565cc65d52a08eb15078d29a98c4eba9a212a3d, the commit right before the fix): the flush posted into the window is accepted by the pre-fix idle gate and the scheduler crashes with the exact CI signature; the harness then hangs until timeout:
B — with the fix (checkout c99b622a72e7a99935aa31b78a503cd9438a0844, this PR's HEAD): the same flush is refused purely by the new in-flight-microbatch check (both legacy counters are 0), the request completes, and the whole file passes:
Full scripted-runtime matrix on the chain (all green):
test_scripted_core_4gpu.py(sweep + new test, repeated x3),test_scripted_core_1gpu.py,test_scripted_swa_1gpu.py,test_scripted_runtime_core.py(42 tests),unit/scripted_runtime/test_http_server.py,unit/scripted_runtime/test_scheduler_hook.py,unit/scripted_runtime/test_tokenizer_recv_proxy.py.Accuracy Tests
N/A (scheduler idle-gating and test harness only).
Speed Tests and Profiling
N/A.
CI States
Latest PR Test (Base): 🚫 Run #27088850686
Latest PR Test (Extra): ✅ Run #27088850584