Complete server warmup before scripted runtime scripts start by fzyzcjy · Pull Request #27445 · sgl-project/sglang

fzyzcjy · 2026-06-06T12:25:23Z

Motivation

This is PR 1 of 2 extracted from the investigation of the flaky test/registered/chunked_prefill/test_scripted_core_4gpu.py (example CI failure); PR 2 (based on this branch) fixes the actual scheduler bug.

The scripted runtime promises scripts a deterministic, script-driven scheduler, but the server it launches still runs the standard launch_server warmup thread. Since the scheduler loop freezes between RunScript commands, the warmup could never complete on its own; instead:

the warmup thread's /generate ("The capital city of France is") landed in the scheduler queues at an arbitrary point during the first scripts as foreign traffic, perturbing queue contents, batch compositions and reset drains (one of the amplifiers behind the flaky CI failure above, and directly observed blocking a reproduction experiment with Cache not flushed ... #queue-req: 1);
the warmup thread kept polling GET /model_info once per second in the background;
/health stayed 503 (server_status == Starting) for the whole run.

Modifications

Drive the engine through warmup before accepting scripts (scheduler_hook.py, tokenizer_recv_proxy.py): the hook now advances the real scheduler loop until the warmup request has been observed on the recv socket (the proxy counts inbound tokenized work requests) and the scheduler then stays fully idle for two microbatch rotations, and only then sends HookReady. Scripts start with the server warmed up, Up, and quiesced, with no ambient traffic left. The two-rotation quiesce guard is needed because is_fully_idle() can transiently report idle while a PP microbatch result is still in flight (the scheduler-side bug fixed in PR 2). Honors skip_server_warmup=True by skipping the drive. Bounded by a 120s wall-clock timeout that fails loudly.
Poll /model_info instead of /health for readiness (http_server.py): with the server now genuinely reaching Up, /health runs the generation-based health check (a real probe through the scheduler), which cannot complete while the scheduler waits for scripts between commands — the old readiness poll would hang. /model_info is pure metadata and proves the same thing (socket bound, routes registered).

A side benefit: the first KV-canary sweeps (triggered after each rank's first forward, ~100ms each, cascading across PP ranks) now also happen during the warmup drive instead of racing the first script.

Scope note: the warmup-completion detection targets the current scripted-runtime scope (no disaggregation, text models); PD-disagg/VLM warmups have different shapes and should be revisited if the scripted runtime grows there.

Accuracy Tests

N/A (test harness only).

Speed Tests and Profiling

N/A.

CI States

Latest PR Test (Base): ✅ Run #27064732059
Latest PR Test (Extra): ❌ Run #27065446238

gemini-code-assist

Code Review

This pull request improves pipeline parallelism (PP) handling in the scripted runtime. It updates the scheduler's idle check to account for in-flight PP microbatches, ensures all microbatch slots are scanned when querying active requests under PP, disables server warmup during scripted tests to prevent race conditions, and adds a test case for flushing the cache during in-flight chunk results. Review feedback suggests making the is_fully_idle check more robust by using hasattr to verify the existence of mbs and running_mbs before accessing them, preventing potential AttributeErrors during early initialization.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-06T12:26:27Z

            and (self.ps.pp_size == 1 or all(x.is_empty() for x in self.running_mbs))
+            # In-flight PP microbatches (e.g. dispatched prefill chunks whose
+            # batch results are not processed yet) are not in running_mbs.
+            and (
+                self.ps.pp_size == 1
+                or all(mb is None or mb.is_empty() for mb in self.mbs)
+            )


If is_fully_idle is called during early initialization or health checks before the event loop starts (which is when init_pp_loop_state is executed to initialize mbs and running_mbs), accessing self.mbs or self.running_mbs will raise an AttributeError when pp_size > 1.

To make this check robust and defensive, we should verify if these attributes exist using hasattr before accessing them.

and ( self.ps.pp_size == 1 or not hasattr(self, "running_mbs") or all(x.is_empty() for x in self.running_mbs) ) # In-flight PP microbatches (e.g. dispatched prefill chunks whose # batch results are not processed yet) are not in running_mbs. and ( self.ps.pp_size == 1 or not hasattr(self, "mbs") or all(mb is None or mb.is_empty() for mb in self.mbs) )

All is_fully_idle call sites (on_idle, the health-check gate in input processing, flush/attach/detach wrappers, weight updater) run inside the scheduler event loop, which calls init_pp_loop_state() first when pp_size > 1, so mbs/running_mbs are always initialized by then — same as the pre-existing running_mbs access on the line above. Keeping the access direct to avoid masking a real initialization-order bug with hasattr.

fzyzcjy · 2026-06-06T13:24:03Z

/rerun-test test/registered/chunked_prefill/test_scripted_core_4gpu.py

github-actions · 2026-06-06T13:24:23Z

Results for /rerun-test test/registered/chunked_prefill/test_scripted_core_4gpu.py:

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_core_4gpu.py

Once the server reports Up, /health runs the generation-based health check (a real probe request through the scheduler), which cannot complete while the scheduler waits for scripts between RunScript commands; the readiness poll would then never get a response within its per-request timeout. /model_info is pure metadata and proves the same thing: the socket is bound and routes are registered.

In scripted mode the scheduler loop freezes between RunScript commands, so the launch_server warmup thread's /generate used to land in the scheduler queues at an arbitrary point during the first scripts as foreign traffic, and /health stayed 503 for the whole run because the warmup could never complete. The hook now runs the engine until the warmup request has been observed on the recv socket and the scheduler stays fully idle for two microbatch rotations, and only then sends HookReady. Scripts start with the server warmed up, Up, and quiesced, with no ambient traffic left.

fzyzcjy · 2026-06-06T14:20:55Z

/rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py

github-actions · 2026-06-06T14:21:15Z

Results for /rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py:

🚀 ubuntu-latest (3 tests): ✅ View workflow run

cd test/ && python3 registered/unit/scripted_runtime/test_http_server.py
cd test/ && python3 registered/unit/scripted_runtime/test_scheduler_hook.py
cd test/ && python3 registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py

🚀 1-gpu-5090 (2 tests): ✅ View workflow run

cd test/ && python3 registered/scripted_runtime/test_scripted_runtime_core.py
cd test/ && python3 registered/chunked_prefill/test_scripted_core_1gpu.py

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_swa_1gpu.py

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_core_4gpu.py

fzyzcjy · 2026-06-06T14:52:20Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. CI triage on the current round; decisions below per the user's direction to take this chain fully green. Please push back if any conclusion is off.

CI status triage:

stage-a-test-1-gpu-xpu (job): runner-level cleanup failure before any test ran — EACCES: permission denied, unlink .../sglang.egg-info/PKG-INFO. Classification: infra (dirty self-hosted XPU runner), unrelated to this PR (test-harness-only diff). Will rerun.
call-gate / pr-gate + pr-test-extra-finish: extra workflow gate failing only because this PR lacked run-ci-extra. Three of the affected scripted-runtime test files live in extra stages, so opting in now.
All 7 affected scripted-runtime files already passed dedicated /rerun-test dispatches on this PR (runs 27064745430, 27064747754, 27064750296, 27064752868).

fzyzcjy · 2026-06-06T14:52:26Z

/rerun-failed-ci

fzyzcjy · 2026-06-06T15:09:33Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Non-CUDA lane triage; please push back if any conclusion is off.

stage-b-test-large-8-gpu-mi35x-disaggregation-amd (job) failed in test/registered/amd/disaggregation/test_mori_transfer_engine_e2e.py::TestMoriTransferEngineHybridMambaE2E::test_generate_smoke_hybrid_mamba — the disaggregation server never came up:

ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 11120)

Classification: non-CUDA lane, unrelated to this PR — this PR only touches the scripted-runtime test harness (python/sglang/test/scripted_runtime/), which is env-gated and not imported by AMD disaggregation paths. Not chasing further per lane policy.

fzyzcjy · 2026-06-06T16:09:56Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Non-CUDA lane triage; please push back if any conclusion is off.

stage-c-test-large-8-gpu-amd (linux-mi325-8gpu-sglang, 1) (job) failed in test/registered/amd/test_deepseek_v32_basic.py — the server never came up:

urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

Classification: non-CUDA lane (AMD mi325, stage-c), unrelated to this PR — this PR only touches the env-gated scripted-runtime test harness. Not chasing further per lane policy.

fzyzcjy · 2026-06-06T16:14:54Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Non-CUDA lane triage; please push back if any conclusion is off.

stage-c-test-large-8-gpu-amd (linux-mi325-8gpu-sglang, 3) (job) failed in test/registered/ops/test_aiter_allreduce_fusion_amd.py (torch.distributed.elastic ChildFailedError). Same classification as the other stage-c AMD partition: non-CUDA lane, unrelated to this PR (env-gated scripted-runtime harness diff only; aiter allreduce fusion is an AMD-only op path). Not chasing further per lane policy.

fzyzcjy · 2026-06-06T16:29:35Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Lane-policy triage of the two newest red checks; please push back if any conclusion is off.

base-c-test-8-gpu-h20 (1): H20 runners have chronic machine issues and are ignored by lane policy; not rerunning.
pr-test-amd-finish: aggregate finish job of the AMD workflow — red only because of the mi35x/mi325 lane failures already triaged above as unrelated to this harness-only PR.

All CUDA lanes remain green so far (118+ checks passed), including the 7 scripted-runtime test files this PR actually touches (each also passed a dedicated /rerun-test dispatch).

fzyzcjy · 2026-06-06T19:25:21Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. CI triage; please push back if any conclusion is off.

base-c-test-8-gpu-h20 (1) (job) failed in the install step before any test ran:

error: Request failed after 3 retries in 30.0s
  Caused by: Failed to fetch: https://docs.sglang.ai/whl/cu130/sglang-kernel/
  Caused by: tunnel error: unsuccessful

Classification: infra (H20 runner network/proxy, a chronically flaky lane). pr-test-finish is red only as a consequence. Re-running the failed jobs of that run once; if the tunnel error repeats it confirms the chronic H20 machine issue.

fzyzcjy · 2026-06-07T09:37:19Z

ci checked in #27446

…ject#27445)

fzyzcjy requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners June 6, 2026 12:25

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

fzyzcjy added the run-ci label Jun 6, 2026

fzyzcjy added 2 commits June 6, 2026 21:46

fzyzcjy force-pushed the tom/fix_pp_chunk_sweep_flush_race branch from 74732b3 to 161d09c Compare June 6, 2026 13:50

fzyzcjy changed the title ~~Fix PP is_fully_idle missing in-flight microbatches~~ Complete server warmup before scripted runtime scripts start Jun 6, 2026

fzyzcjy mentioned this pull request Jun 6, 2026

Fix PP is_fully_idle missing in-flight microbatches #27446

Merged

fzyzcjy added 3 commits June 6, 2026 21:58

Simplify warmup drive idle-streak tracking

41768cb

Hoist server_args local in warmup drive

9d9df1e

Log warmup drive start and completion

9f2dd4c

fzyzcjy added the run-ci-extra label Jun 6, 2026

fzyzcjy merged commit 14b8f98 into main Jun 7, 2026
309 of 343 checks passed

fzyzcjy deleted the tom/fix_pp_chunk_sweep_flush_race branch June 7, 2026 09:37

fzyzcjy mentioned this pull request Jun 7, 2026

Sandbox: verify full main CI is green on latest main (do not merge) #25647

Closed

4 tasks

monkeyLoveding pushed a commit to monkeyLoveding/sglang_open that referenced this pull request Jun 9, 2026

Complete server warmup before scripted runtime scripts start (sgl-pro…

56f9956

…ject#27445)

Conversation

fzyzcjy commented Jun 6, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 6, 2026

Uh oh!

fzyzcjy commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fzyzcjy commented Jun 6, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented Jun 6, 2026 •

edited

Loading

github-actions Bot commented Jun 6, 2026 •

edited

Loading