Complete server warmup before scripted runtime scripts start#27445
Conversation
There was a problem hiding this comment.
Code Review
This pull request improves pipeline parallelism (PP) handling in the scripted runtime. It updates the scheduler's idle check to account for in-flight PP microbatches, ensures all microbatch slots are scanned when querying active requests under PP, disables server warmup during scripted tests to prevent race conditions, and adds a test case for flushing the cache during in-flight chunk results. Review feedback suggests making the is_fully_idle check more robust by using hasattr to verify the existence of mbs and running_mbs before accessing them, preventing potential AttributeErrors during early initialization.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| and (self.ps.pp_size == 1 or all(x.is_empty() for x in self.running_mbs)) | ||
| # In-flight PP microbatches (e.g. dispatched prefill chunks whose | ||
| # batch results are not processed yet) are not in running_mbs. | ||
| and ( | ||
| self.ps.pp_size == 1 | ||
| or all(mb is None or mb.is_empty() for mb in self.mbs) | ||
| ) |
There was a problem hiding this comment.
If is_fully_idle is called during early initialization or health checks before the event loop starts (which is when init_pp_loop_state is executed to initialize mbs and running_mbs), accessing self.mbs or self.running_mbs will raise an AttributeError when pp_size > 1.
To make this check robust and defensive, we should verify if these attributes exist using hasattr before accessing them.
and (
self.ps.pp_size == 1
or not hasattr(self, "running_mbs")
or all(x.is_empty() for x in self.running_mbs)
)
# In-flight PP microbatches (e.g. dispatched prefill chunks whose
# batch results are not processed yet) are not in running_mbs.
and (
self.ps.pp_size == 1
or not hasattr(self, "mbs")
or all(mb is None or mb.is_empty() for mb in self.mbs)
)There was a problem hiding this comment.
All is_fully_idle call sites (on_idle, the health-check gate in input processing, flush/attach/detach wrappers, weight updater) run inside the scheduler event loop, which calls init_pp_loop_state() first when pp_size > 1, so mbs/running_mbs are always initialized by then — same as the pre-existing running_mbs access on the line above. Keeping the access direct to avoid masking a real initialization-order bug with hasattr.
|
/rerun-test test/registered/chunked_prefill/test_scripted_core_4gpu.py |
|
Results for 🚀 |
Once the server reports Up, /health runs the generation-based health check (a real probe request through the scheduler), which cannot complete while the scheduler waits for scripts between RunScript commands; the readiness poll would then never get a response within its per-request timeout. /model_info is pure metadata and proves the same thing: the socket is bound and routes are registered.
In scripted mode the scheduler loop freezes between RunScript commands, so the launch_server warmup thread's /generate used to land in the scheduler queues at an arbitrary point during the first scripts as foreign traffic, and /health stayed 503 for the whole run because the warmup could never complete. The hook now runs the engine until the warmup request has been observed on the recv socket and the scheduler stays fully idle for two microbatch rotations, and only then sends HookReady. Scripts start with the server warmed up, Up, and quiesced, with no ambient traffic left.
74732b3 to
161d09c
Compare
|
/rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py |
|
Results for 🚀 🚀 🚀 🚀 |
CI status triage:
|
|
/rerun-failed-ci |
Classification: non-CUDA lane, unrelated to this PR — this PR only touches the scripted-runtime test harness ( |
Classification: non-CUDA lane (AMD mi325, stage-c), unrelated to this PR — this PR only touches the env-gated scripted-runtime test harness. Not chasing further per lane policy. |
|
All CUDA lanes remain green so far (118+ checks passed), including the 7 scripted-runtime test files this PR actually touches (each also passed a dedicated |
Classification: infra (H20 runner network/proxy, a chronically flaky lane). |
|
ci checked in #27446 |
Motivation
This is PR 1 of 2 extracted from the investigation of the flaky
test/registered/chunked_prefill/test_scripted_core_4gpu.py(example CI failure); PR 2 (based on this branch) fixes the actual scheduler bug.The scripted runtime promises scripts a deterministic, script-driven scheduler, but the server it launches still runs the standard
launch_serverwarmup thread. Since the scheduler loop freezes betweenRunScriptcommands, the warmup could never complete on its own; instead:/generate("The capital city of France is") landed in the scheduler queues at an arbitrary point during the first scripts as foreign traffic, perturbing queue contents, batch compositions and reset drains (one of the amplifiers behind the flaky CI failure above, and directly observed blocking a reproduction experiment withCache not flushed ... #queue-req: 1);GET /model_infoonce per second in the background;/healthstayed 503 (server_status == Starting) for the whole run.Modifications
scheduler_hook.py,tokenizer_recv_proxy.py): the hook now advances the real scheduler loop until the warmup request has been observed on the recv socket (the proxy counts inbound tokenized work requests) and the scheduler then stays fully idle for two microbatch rotations, and only then sendsHookReady. Scripts start with the server warmed up,Up, and quiesced, with no ambient traffic left. The two-rotation quiesce guard is needed becauseis_fully_idle()can transiently report idle while a PP microbatch result is still in flight (the scheduler-side bug fixed in PR 2). Honorsskip_server_warmup=Trueby skipping the drive. Bounded by a 120s wall-clock timeout that fails loudly./model_infoinstead of/healthfor readiness (http_server.py): with the server now genuinely reachingUp,/healthruns the generation-based health check (a real probe through the scheduler), which cannot complete while the scheduler waits for scripts between commands — the old readiness poll would hang./model_infois pure metadata and proves the same thing (socket bound, routes registered).A side benefit: the first KV-canary sweeps (triggered after each rank's first forward, ~100ms each, cascading across PP ranks) now also happen during the warmup drive instead of racing the first script.
Scope note: the warmup-completion detection targets the current scripted-runtime scope (no disaggregation, text models); PD-disagg/VLM warmups have different shapes and should be revisited if the scripted runtime grows there.
Accuracy Tests
N/A (test harness only).
Speed Tests and Profiling
N/A.
CI States
Latest PR Test (Base): ✅ Run #27064732059
Latest PR Test (Extra): ❌ Run #27065446238