Skip to content

Complete server warmup before scripted runtime scripts start#27445

Merged
fzyzcjy merged 5 commits into
mainfrom
tom/fix_pp_chunk_sweep_flush_race
Jun 7, 2026
Merged

Complete server warmup before scripted runtime scripts start#27445
fzyzcjy merged 5 commits into
mainfrom
tom/fix_pp_chunk_sweep_flush_race

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Motivation

This is PR 1 of 2 extracted from the investigation of the flaky test/registered/chunked_prefill/test_scripted_core_4gpu.py (example CI failure); PR 2 (based on this branch) fixes the actual scheduler bug.

The scripted runtime promises scripts a deterministic, script-driven scheduler, but the server it launches still runs the standard launch_server warmup thread. Since the scheduler loop freezes between RunScript commands, the warmup could never complete on its own; instead:

  • the warmup thread's /generate ("The capital city of France is") landed in the scheduler queues at an arbitrary point during the first scripts as foreign traffic, perturbing queue contents, batch compositions and reset drains (one of the amplifiers behind the flaky CI failure above, and directly observed blocking a reproduction experiment with Cache not flushed ... #queue-req: 1);
  • the warmup thread kept polling GET /model_info once per second in the background;
  • /health stayed 503 (server_status == Starting) for the whole run.

Modifications

  • Drive the engine through warmup before accepting scripts (scheduler_hook.py, tokenizer_recv_proxy.py): the hook now advances the real scheduler loop until the warmup request has been observed on the recv socket (the proxy counts inbound tokenized work requests) and the scheduler then stays fully idle for two microbatch rotations, and only then sends HookReady. Scripts start with the server warmed up, Up, and quiesced, with no ambient traffic left. The two-rotation quiesce guard is needed because is_fully_idle() can transiently report idle while a PP microbatch result is still in flight (the scheduler-side bug fixed in PR 2). Honors skip_server_warmup=True by skipping the drive. Bounded by a 120s wall-clock timeout that fails loudly.
  • Poll /model_info instead of /health for readiness (http_server.py): with the server now genuinely reaching Up, /health runs the generation-based health check (a real probe through the scheduler), which cannot complete while the scheduler waits for scripts between commands — the old readiness poll would hang. /model_info is pure metadata and proves the same thing (socket bound, routes registered).

A side benefit: the first KV-canary sweeps (triggered after each rank's first forward, ~100ms each, cascading across PP ranks) now also happen during the warmup drive instead of racing the first script.

Scope note: the warmup-completion detection targets the current scripted-runtime scope (no disaggregation, text models); PD-disagg/VLM warmups have different shapes and should be revisited if the scripted runtime grows there.

Accuracy Tests

N/A (test harness only).

Speed Tests and Profiling

N/A.


CI States

Latest PR Test (Base): ✅ Run #27064732059
Latest PR Test (Extra): ❌ Run #27065446238

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves pipeline parallelism (PP) handling in the scripted runtime. It updates the scheduler's idle check to account for in-flight PP microbatches, ensures all microbatch slots are scanned when querying active requests under PP, disables server warmup during scripted tests to prevent race conditions, and adds a test case for flushing the cache during in-flight chunk results. Review feedback suggests making the is_fully_idle check more robust by using hasattr to verify the existence of mbs and running_mbs before accessing them, preventing potential AttributeErrors during early initialization.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment on lines +3327 to +3333
and (self.ps.pp_size == 1 or all(x.is_empty() for x in self.running_mbs))
# In-flight PP microbatches (e.g. dispatched prefill chunks whose
# batch results are not processed yet) are not in running_mbs.
and (
self.ps.pp_size == 1
or all(mb is None or mb.is_empty() for mb in self.mbs)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If is_fully_idle is called during early initialization or health checks before the event loop starts (which is when init_pp_loop_state is executed to initialize mbs and running_mbs), accessing self.mbs or self.running_mbs will raise an AttributeError when pp_size > 1.

To make this check robust and defensive, we should verify if these attributes exist using hasattr before accessing them.

            and (
                self.ps.pp_size == 1
                or not hasattr(self, "running_mbs")
                or all(x.is_empty() for x in self.running_mbs)
            )
            # In-flight PP microbatches (e.g. dispatched prefill chunks whose
            # batch results are not processed yet) are not in running_mbs.
            and (
                self.ps.pp_size == 1
                or not hasattr(self, "mbs")
                or all(mb is None or mb.is_empty() for mb in self.mbs)
            )

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All is_fully_idle call sites (on_idle, the health-check gate in input processing, flush/attach/detach wrappers, weight updater) run inside the scheduler event loop, which calls init_pp_loop_state() first when pp_size > 1, so mbs/running_mbs are always initialized by then — same as the pre-existing running_mbs access on the line above. Keeping the access direct to avoid masking a real initialization-order bug with hasattr.

@fzyzcjy fzyzcjy added the run-ci label Jun 6, 2026
@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/chunked_prefill/test_scripted_core_4gpu.py

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Results for /rerun-test test/registered/chunked_prefill/test_scripted_core_4gpu.py:

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_core_4gpu.py

fzyzcjy added 2 commits June 6, 2026 21:46
Once the server reports Up, /health runs the generation-based health
check (a real probe request through the scheduler), which cannot
complete while the scheduler waits for scripts between RunScript
commands; the readiness poll would then never get a response within its
per-request timeout. /model_info is pure metadata and proves the same
thing: the socket is bound and routes are registered.
In scripted mode the scheduler loop freezes between RunScript commands,
so the launch_server warmup thread's /generate used to land in the
scheduler queues at an arbitrary point during the first scripts as
foreign traffic, and /health stayed 503 for the whole run because the
warmup could never complete.

The hook now runs the engine until the warmup request has been observed
on the recv socket and the scheduler stays fully idle for two microbatch
rotations, and only then sends HookReady. Scripts start with the server
warmed up, Up, and quiesced, with no ambient traffic left.
@fzyzcjy fzyzcjy force-pushed the tom/fix_pp_chunk_sweep_flush_race branch from 74732b3 to 161d09c Compare June 6, 2026 13:50
@fzyzcjy fzyzcjy changed the title Fix PP is_fully_idle missing in-flight microbatches Complete server warmup before scripted runtime scripts start Jun 6, 2026
@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Results for /rerun-test test/registered/unit/scripted_runtime/test_http_server.py test/registered/unit/scripted_runtime/test_scheduler_hook.py test/registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py test/registered/scripted_runtime/test_scripted_runtime_core.py test/registered/chunked_prefill/test_scripted_core_1gpu.py test/registered/chunked_prefill/test_scripted_swa_1gpu.py test/registered/chunked_prefill/test_scripted_core_4gpu.py:

🚀 ubuntu-latest (3 tests): ✅ View workflow run

cd test/ && python3 registered/unit/scripted_runtime/test_http_server.py
cd test/ && python3 registered/unit/scripted_runtime/test_scheduler_hook.py
cd test/ && python3 registered/unit/scripted_runtime/test_tokenizer_recv_proxy.py

🚀 1-gpu-5090 (2 tests): ✅ View workflow run

cd test/ && python3 registered/scripted_runtime/test_scripted_runtime_core.py
cd test/ && python3 registered/chunked_prefill/test_scripted_core_1gpu.py

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_swa_1gpu.py

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/chunked_prefill/test_scripted_core_4gpu.py

@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. CI triage on the current round; decisions below per the user's direction to take this chain fully green. Please push back if any conclusion is off.

CI status triage:

  • stage-a-test-1-gpu-xpu (job): runner-level cleanup failure before any test ran — EACCES: permission denied, unlink .../sglang.egg-info/PKG-INFO. Classification: infra (dirty self-hosted XPU runner), unrelated to this PR (test-harness-only diff). Will rerun.
  • call-gate / pr-gate + pr-test-extra-finish: extra workflow gate failing only because this PR lacked run-ci-extra. Three of the affected scripted-runtime test files live in extra stages, so opting in now.
  • All 7 affected scripted-runtime files already passed dedicated /rerun-test dispatches on this PR (runs 27064745430, 27064747754, 27064750296, 27064752868).

@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Non-CUDA lane triage; please push back if any conclusion is off.

stage-b-test-large-8-gpu-mi35x-disaggregation-amd (job) failed in test/registered/amd/disaggregation/test_mori_transfer_engine_e2e.py::TestMoriTransferEngineHybridMambaE2E::test_generate_smoke_hybrid_mamba — the disaggregation server never came up:

ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 11120)

Classification: non-CUDA lane, unrelated to this PR — this PR only touches the scripted-runtime test harness (python/sglang/test/scripted_runtime/), which is env-gated and not imported by AMD disaggregation paths. Not chasing further per lane policy.

@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Non-CUDA lane triage; please push back if any conclusion is off.

stage-c-test-large-8-gpu-amd (linux-mi325-8gpu-sglang, 1) (job) failed in test/registered/amd/test_deepseek_v32_basic.py — the server never came up:

urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

Classification: non-CUDA lane (AMD mi325, stage-c), unrelated to this PR — this PR only touches the env-gated scripted-runtime test harness. Not chasing further per lane policy.

@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Non-CUDA lane triage; please push back if any conclusion is off.

stage-c-test-large-8-gpu-amd (linux-mi325-8gpu-sglang, 3) (job) failed in test/registered/ops/test_aiter_allreduce_fusion_amd.py (torch.distributed.elastic ChildFailedError). Same classification as the other stage-c AMD partition: non-CUDA lane, unrelated to this PR (env-gated scripted-runtime harness diff only; aiter allreduce fusion is an AMD-only op path). Not chasing further per lane policy.

@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Lane-policy triage of the two newest red checks; please push back if any conclusion is off.

  • base-c-test-8-gpu-h20 (1): H20 runners have chronic machine issues and are ignored by lane policy; not rerunning.
  • pr-test-amd-finish: aggregate finish job of the AMD workflow — red only because of the mi35x/mi325 lane failures already triaged above as unrelated to this harness-only PR.

All CUDA lanes remain green so far (118+ checks passed), including the 7 scripted-runtime test files this PR actually touches (each also passed a dedicated /rerun-test dispatch).

@fzyzcjy

fzyzcjy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. CI triage; please push back if any conclusion is off.

base-c-test-8-gpu-h20 (1) (job) failed in the install step before any test ran:

error: Request failed after 3 retries in 30.0s
  Caused by: Failed to fetch: https://docs.sglang.ai/whl/cu130/sglang-kernel/
  Caused by: tunnel error: unsuccessful

Classification: infra (H20 runner network/proxy, a chronically flaky lane). pr-test-finish is red only as a consequence. Re-running the failed jobs of that run once; if the tunnel error repeats it confirms the chronic H20 machine issue.

@fzyzcjy

fzyzcjy commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

ci checked in #27446

@fzyzcjy fzyzcjy merged commit 14b8f98 into main Jun 7, 2026
309 of 343 checks passed
@fzyzcjy fzyzcjy deleted the tom/fix_pp_chunk_sweep_flush_race branch June 7, 2026 09:37
monkeyLoveding pushed a commit to monkeyLoveding/sglang_open that referenced this pull request Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant