DO NOT MERGE - scripted runtime#26991
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
TriageAll 13 red checks are gate/aggregator jobs ( The Next stepClose + reopen the PR to fire a |
CUDA-lane triage (fresh run, gate now green)Two distinct root causes, both introduced by this PR, both deterministic: 1. Scripted-runtime tests time out — HTTP endpoint not ready when reset POSTsAffected jobs / tests (all the same failure mode):
Fingerprint: Root cause: in 2.
|
Fixes pushed (
|
f43fe86 to
829e7d9
Compare
829e7d9 to
372ac76
Compare
CUDA status (tip
|
| Test | Result |
|---|---|
base-a-test-cpu (pause_generation unit test) |
✅ |
base-b test_scripted_runtime_core.py |
✅ |
extra-b (4-gpu-h100) test_scripted_core_4gpu.py |
✅ |
extra-a (1-gpu) test_scripted_core_1gpu.py |
❌ (real test bug, below) |
Remaining real failure — test_scripted_core_1gpu.py
test_pause_retract_at_lifecycle_points_then_resume[stage=last_decode]:
File ".../test_scripted_core_1gpu.py", line 81, in _script_pause_retract_at_stage
len(r.req.output_ids) == _LIFECYCLE_MAX_NEW_TOKENS
AttributeError: 'NoneType' object has no attribute 'output_ids'
At last_decode, pause_generation(retract) + one yield lets the req finish and be removed from all scheduler structures. r.finished stays True (tracked via _seen_rids), but r.req (find_req_by_rid) returns None, so r.req.output_ids raises. The assertion mixes a survives-removal signal (finished) with a gone-after-removal one (req). test_scripted_runtime_core.py passes because its stage/param mix never hits the finished-then-removed instant.
This needs a harness/test-semantics decision (a hook-derived output-token probe vs restructuring the assertion), so holding for the author rather than weakening the assertion.
Not chasing
H20 (base-c), extra-b-8-gpu-h200 (test_disaggregation_hybrid_attention.py, untouched by this PR), NPU/XPU/AMD lanes — unrelated to this PR.
428763a to
3501f01
Compare
CI triage snapshotRollup: 34 SUCCESS / 23 SKIPPED / 89 running / 3 FAILURE. The CUDA hard-gate jobs are still running. The 3 current failures are all non-CUDA lanes:
Preliminary classification: likely unrelated to this PR. Reasoning from the diff:
Next step: REST API budget is currently exhausted (resets in ~45 min). Once it resets I will pull the XPU/NPU job logs to confirm the failure fingerprints, and continue watching the CUDA gate. |
Non-CUDA failures confirmed unrelated to this PRREST budget is back; I fetched the logs for all 5 current
None of these touch the scheduler/IPC hook code this PR adds. No fix is warranted on the PR side. Per non-CUDA-lane policy I am leaving them as-is and continuing to watch the CUDA gate; will report any CUDA-lane failure separately. |
Addendum — 6th failure (AMD), also unrelated
Grepping the full AMD log, the only occurrence of Status: all 6 current failures are non-CUDA-lane and confirmed unrelated to this PR. The CUDA hard gate (PR Test Base / Extra) is still running with 0 failures (~22 jobs left). Continuing to watch; will report any CUDA-lane failure separately. |
CUDA-lane failure:
|
|
/rerun-test test/registered/hicache/test_qwen35_hicache.py |
|
Results for 🚀 |
|
| Branch / commit | Run | Same file? | Outcome |
|---|---|---|---|
PR #26991 — extra-b-test-4-gpu-h100 |
job 79655383617 | ✓ | FAIL — ✗ TIMEOUT after 1200s (hang after flush_cache, first eval scored 0.980) |
main @ c6c1f1a29a |
run 26992947929 | ✓ | FAIL — same shape: first eval Score: 0.980, flush_cache → 200 OK, then server hangs and is force-killed (Terminate orphan process: sglang/python) |
c6c1f1a29a is the exact base commit this PR is stacked on — i.e. the test hangs on the base tree without any of this PR's 4 commits applied. Combined with the earlier code-level analysis (every path this PR touches is gated behind SGLANG_TEST_SCRIPTED_RUNTIME / kv_canary, both off here; HiCache is untouched), this is conclusive: the HiCache timeout is a pre-existing main flake/hang, not introduced by this PR.
I had also kicked off a /rerun-test of this file on the PR branch (run 26997570249, still in progress) — its outcome doesn't change the verdict, but I'll note it once it finishes.
Overall status: all 8 current failures (6 non-CUDA infra/flake + the HiCache CUDA timeout + its pr-test-extra-finish cascade) are confirmed unrelated to this PR. No PR-side fix is warranted. 8 CUDA jobs still running with no other CUDA failures; continuing to watch.
What job
|
| Branch / commit | Run | Same file | Outcome |
|---|---|---|---|
this PR — extra-b-test-4-gpu-h100 (1), both attempts |
attempt 1 · attempt 2 | ✓ | FAIL — TIMEOUT 1200s |
main @ c6c1f1a29a (= this PR's base commit) |
run 26992947929 | ✓ | FAIL — same hang-after-flush_cache shape |
c6c1f1a29a is the exact commit this PR is stacked on, with none of this PR's 4 commits applied, and the test already hangs there. On top of that, the server server_args for the failing run show kv_canary='none' and SGLANG_TEST_SCRIPTED_RUNTIME unset, so every code path this PR touches (the env-gated scheduler/IPC hooks, the kv_canary helper) is inert; HiCache is not touched by this PR at all.
Conclusion: the HiCache CUDA timeout is a pre-existing, consistent hang on main (most likely a wait_complete prefetch deadlock), not introduced by this PR. No fix belongs in this PR's commits.
Root cause found — fixed by #27366
This is not something to fix in this PR. The hang is a known main bug in the HiMamba HiCache prefetch path, and a fix is already up:
- #27366 — "[BugFix]: Fix HiMamba HiCache prefetch hang after L3 sidecar transfer" (author @hzh0425)
- Its stated motivation is literally "Fix qwen3.5 hicache hanging", referencing a
mainPR Test Basefailure (run 26985767961) — anothermain-side hit of this same hang. - It touches exactly one file,
python/sglang/srt/mem_cache/hi_mamba_radix_cache.py— theHiMambaRadixCachepath this test exercises — adding a guard incan_terminate_prefetchso a prefetch is not terminated whileoperation.pool_transfersare still in flight (not operation.pool_transfers_done). That matches the symptom exactly:--hicache-storage-prefetch-policy wait_completewaiting forever on an L3 sidecar pool transfer that was allowed to "complete" early.
So the verdict stands and is now root-caused: the test_qwen35_hicache.py CUDA timeout is a pre-existing main HiMamba-prefetch hang, unrelated to this PR, and it will be resolved by #27366 (not by anything in this PR's commits). Once #27366 lands and this PR is rebased, the test should go green.
3501f01 to
f958fe8
Compare
✅ HiCache timeout resolved by #27366 — confirmed on this PRAfter rebasing onto Current round status: 124 SUCCESS / 2 still queued ( The only remaining failures are the chronic non-CUDA lanes (NPU |
|
/rerun-failed-ci |
✅ CUDA gate fully green — final round statusThe round on the rebased head (
Remaining 5 failures are all chronic non-CUDA lanes (3 real + 2
(The earlier |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
merged |
…ain) into tom/scripted_runtime_and_chunked_testing 19 files conflicted (mostly add/add: both branches independently grew the scripted-runtime + kv_canary test infra). This branch is the developed superset; sgl-project#26991 is the extracted chain. Resolution: - 16 files -> took ours (this branch is a strict superset; sgl-project#26991 only stripped comments/code ours has). Includes req_handle.py: kept ours' SWA-aware, None-safe lock_refs; sgl-project#26991 still had the old node.lock_ref that crashes on SWA nodes / finished reqs. - http_server.py -> took sgl-project#26991's _await_http_ready() (waits for the uvicorn port to bind before scripts POST control messages -- a real startup-race fix ours lacked) and re-inserted ours' overlap-scheduler MUST-run comment. - scripted_runtime_chunked_helpers.py -> kept ours' comments; took sgl-project#26991's advance_to_nth_chunk using accumulated chunks_done instead of sampling is_chunking (which undercounts on the step the req leaves chunked_req). - test/registered/.../test_scripted_core_1gpu.py -> kept ours' comments and the finished-case output-length assertion, but made it r.req-None-safe per sgl-project#26991's note that a finished req may already be removed (output_ids unobservable). Verified: no conflict markers, all resolved files compile (py3.12+), no unmerged files. merge_diff_check.py reported 24 'lost' files -- all false positives (ours == theirs for those, so they auto-merged identically; confirmed present in the merged tree, e.g. environ.py keeps all 4 SCRIPTED_RUNTIME env vars).
…sting Upstream landed the extracted scripted-runtime chain (sgl-project#27410-sgl-project#27413), which conflicts with this branch's further-developed copies of the same files. All 23 conflicts resolve to ours: theirs' only unique content is older forms this branch has since corrected on H200 (pre-SWA lock_refs, short-circuit run_until_all_finished, the finished-case-dropping core_1gpu variant, the narrower abort() signature, pre-stop_token_ids imports); rerere replayed the matching resolutions from the earlier sgl-project#26991 merge for the rest.

CI States
Latest PR Test (Base): ✅ Run #27017228482
Latest PR Test (Extra): ✅ Run #27017228308