Sandbox: verify full main CI is green on latest main (do not merge) by fzyzcjy · Pull Request #25647 · sgl-project/sglang

fzyzcjy · 2026-05-18T12:20:46Z

Summary

Sandbox PR — do not merge. Touches python/sglang/version.py with a no-op comment so paths-filter flips main_package=true and the full PR Test Base + PR Test Extra matrix dispatches.

Carries three labels so the workflow gates all pass:

Label	Effect
`run-ci`	Passes `pr-gate.yml`'s `require-run-ci` gate
`run-ci-extra`	Allows `pr-test-extra.yml` to run on this `pull_request` event
`bypass-fastfail`	Makes the per-job `check-pr-test-health` action no-op (no cascade fast-fail when a single sibling fails on infra flake)

Purpose: verify upstream/main (f04c522534) is green end-to-end with the full CI surface (base stages + extra stages, no fast-fail cascade). This is the PR-side equivalent of the dispatched main CI; cleaner than gh workflow run because the dispatch interface cannot pass skip_pr_test_health_check.

Close this PR after the run completes — no source change is intended to land.

Test plan

pre-commit run --files python/sglang/version.py
PR Test Base dispatches and runs to completion
PR Test Extra dispatches and runs to completion
No check-pr-test-health cascade failures

CI States

Latest PR Test (Base): ⏳ Run #27205975942
Latest PR Test (Extra): ✅ Run #27205975115

gemini-code-assist · 2026-05-18T12:20:50Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

fzyzcjy · 2026-05-18T12:20:57Z

/tag-and-rerun-ci

fzyzcjy · 2026-05-19T01:48:17Z

CI failure: `base-b-test-1-gpu-large (1)` (PR Test Base, B200, 80 GB)

Job log

Failing test: test/registered/spec/eagle/test_eagle_infer_b.py::TestEAGLEServerAdditional::test_radix_attention

Symptom: 11/12 EAGLE tests pass, then test_radix_attention fails with ConnectionRefusedError: [Errno 111] Connection refused on http://127.0.0.1:11000/generate. The server died — a 59 MB cuda-coredumps-run-1.zip artifact was produced (artifact 7073114703).

File ".../test/registered/spec/eagle/test_eagle_infer_b.py", line 104, in test_radix_attention
    run_radix_attention_test(self.base_url)
File ".../python/sglang/test/kits/radix_cache_server_kit.py", line 49, in run_radix_attention_test
    res = requests.post(base_url + "/generate", json=data)
...
urllib3.exceptions.NewConnectionError: ... [Errno 111] Connection refused
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure. 11/12 EAGLE tests on the same base_url passed and the server emitted a CUDA coredump during test_radix_attention — points to an EAGLE-specific server crash, almost certainly a flake unless it repeats.

Next step: leaving the run untouched to see whether other lanes hit the same EAGLE / coredump pattern. If this stays isolated, will classify as flake and /rerun-test test/registered/spec/eagle/test_eagle_infer_b.py.

fzyzcjy · 2026-05-19T02:32:06Z

CI failure: `extra-a-test-1-gpu-large (0)` (PR Test Extra, NVIDIA)

Job log

Failing test: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py::TestLoRAQwen3_8BLogprobDiff::test_lora_qwen3_8b_logprob_accuracy

Symptom: scheduler dies during init with exit code -6 (SIGABRT — not SIGKILL/-9, so not the OS OOM-killer). Retry exhausted after the engine construction throws.

File ".../test/registered/lora/test_lora_qwen3_8b_logprob_diff.py", line 134, in test_lora_qwen3_8b_logprob_accuracy
    engine = sgl.Engine(...)
File ".../python/sglang/srt/entrypoints/engine.py", line 236, in __init__
    ) = self._launch_subprocesses(
File ".../python/sglang/srt/entrypoints/engine.py", line 856, in _launch_subprocesses
    scheduler_init_result.wait_for_ready()
File ".../python/sglang/srt/entrypoints/engine.py", line 651, in wait_for_ready
    infos = _wait_for_scheduler_ready(scheduler_pipe_readers, scheduler_procs)
File ".../python/sglang/srt/entrypoints/engine.py", line 1337, in _wait_for_scheduler_ready
    raise _scheduler_died_error(i, scheduler_procs[i])
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6). If exit code is -9 (SIGKILL), a common cause is the OS OOM killer. Run `dmesg -T | grep -i oom` to check.
...
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure on the NVIDIA extra-a-1-gpu-large lane. Exit -6 = SIGABRT during scheduler init — could be a CUDA-kernel crash, a model-loading assertion in the LoRA path, or transient infra. Posting a separate /rerun-test for this file to differentiate flake vs persistent.

fzyzcjy · 2026-05-19T02:32:12Z

CI failure: `base-b-test-1-gpu-small (5)` (PR Test Base, NVIDIA, 32 GB)

Job log

Failing test: test/registered/core/test_srt_endpoint.py::TestSRTEndpoint::test_get_server_info_concurrent ("Make sure the concurrent get_server_info doesn't crash the server.")

Symptom: server returns non-JSON on concurrent /server_info calls because the server-side handler hits an AssertionError inside communicator.queueing_call. The client then dies with JSONDecodeError: Expecting value: line 1 column 1 (char 0), retries are exhausted, test errors.

Server-side traceback:

File ".../python/sglang/srt/entrypoints/http_server.py", line 635, in server_info
    await _global_state.tokenizer_manager.get_internal_state()
File ".../python/sglang/srt/managers/tokenizer_control_mixin.py", line 788, in get_internal_state
    await self.get_internal_state_communicator(req)
File ".../python/sglang/srt/managers/communicator.py", line 79, in __call__
    return await self.queueing_call(obj)
File ".../python/sglang/srt/managers/communicator.py", line 40, in queueing_call
    assert self._result_event is None
AssertionError

Client-side:

File ".../test/registered/core/test_srt_endpoint.py", line 635, in s
    server_info.json()
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure on the NVIDIA base-b-1-gpu-small (32 GB) lane. The assertion assert self._result_event is None in communicator.queueing_call is a concurrency race in the internal-state communicator — the test (test_get_server_info_concurrent) is specifically designed to catch exactly this class of bug. Smells like a real race, not a flake, but posting /rerun-test to confirm reproducibility before escalating.

fzyzcjy · 2026-05-19T02:37:09Z

/rerun-test test/registered/spec/eagle/test_eagle_infer_b.py

github-actions · 2026-05-19T02:37:29Z

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/spec/eagle/test_eagle_infer_b.py

fzyzcjy · 2026-05-19T02:50:15Z

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T02:50:17Z

/rerun-test test/registered/core/test_srt_endpoint.py

github-actions · 2026-05-19T02:50:37Z

🚀 1-gpu-5090 (1 test): ✅ View workflow run

cd test/ && python3 registered/core/test_srt_endpoint.py

github-actions · 2026-05-19T02:50:45Z

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T02:57:02Z

`/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py` result: ❌ FAIL (reproducible)

Rerun job log

The same test fails on rerun with the same stack as the original extra-a-test-1-gpu-large (0) failure → this is NOT a flake.

Actual root cause (the SIGABRT in extra-a-test-1-gpu-large (0) was just the post-mortem; pre-coredump output reveals):

coredump: Starting GPU coredump generation
coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
coredump:   - Device: 0

Triggered during CUDA graph capture of TestLoRAQwen3_8BLogprobDiff::test_lora_qwen3_8b_logprob_accuracy. The C++/CUDA stack from the Python faulthandler after the coredump (the abort thread):

File ".../python/sglang/srt/layers/quantization/unquant.py", line 161 in apply
File ".../python/sglang/srt/lora/layers.py", line 724 in forward
...
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1112 in run_once
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1134 in capture_one_batch_size
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 854 in _capture_one_stream
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 867 in capture
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 707 in __init__
File ".../python/sglang/srt/model_executor/model_runner.py", line 2776 in init_device_graphs

Classification: real bug on main HEAD 4a451128…, in the LoRA layer's forward path under CUDA graph capture. Likely a bad index / out-of-bounds memory access in lora/layers.py:724 (or in the unquant apply at unquant.py:161) when running Qwen3-8B with LoRA. Two-run reproducibility on the same commit confirms it's not a transient flake.

(This is the main-CI sandbox PR; the failing commit 4a451128… = latest upstream/main + a no-op python/sglang/version.py comment touch, so this bug is on main proper.)

fzyzcjy · 2026-05-19T03:04:53Z

`/rerun-test test/registered/core/test_srt_endpoint.py` result: ✅ PASS (flake)

Rerun job → SUCCESS.

The original failure on base-b-test-1-gpu-small (5) (test_get_server_info_concurrent, AssertionError self._result_event is None in communicator.queueing_call) did not reproduce. Classifying as flake — likely a transient race in the internal-state communicator under concurrent server_info that didn't hit the timing window on rerun. Not pursuing further.

Final per-file rerun verdicts on this main-CI sandbox:

File	Original lane	Rerun verdict
`test/registered/spec/eagle/test_eagle_infer_b.py` (`test_radix_attention`)	`base-b-test-1-gpu-large (1)`	✅ PASS — flake
`test/registered/core/test_srt_endpoint.py` (`test_get_server_info_concurrent`)	`base-b-test-1-gpu-small (5)`	✅ PASS — flake
`test/registered/lora/test_lora_qwen3_8b_logprob_diff.py` (`test_lora_qwen3_8b_logprob_accuracy`)	`extra-a-test-1-gpu-large (0)`	❌ FAIL same `CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS` during CUDA graph capture — real bug (bisecting next)

fzyzcjy · 2026-05-19T03:19:03Z

Bisect probe: `d90bc65e30` (`[NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)` — pre-chain, HEAD-28)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26073712614 — FAIL
Tree date: 2026-05-19 (the commit on main directly preceding Tom's 23-commit refactor chain)

Verbatim CUDA error fingerprint:

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
coredump:   - Device: 0
Fatal Python error: Aborted
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

Same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS during CUDA graph capture as on c2a212bf… / 4a451128… (current main HEAD).

→ bug PRE-EXISTS Tom's chain. The 28 PRs between d90bc65e30 and current HEAD (PRs #25703–#25728 — Tom's scheduler refactor chain — plus #25282 DeepSeek V4 host pool, #25596 LTX2 diffusion fix, #25699 PD/NIXL aux, #25689 spec_verify metric, #24710 RMSNorm dispatch) are NOT the cause.

Bisect bound moves to last-good < d90bc65e30. Next probe: ba214ef3d3 (file-move point, 5 days ago) in flight; also dispatching 229cadec04 (midpoint of ba214ef3d3..d90bc65e30, 2026-05-16) to narrow in parallel.

fzyzcjy · 2026-05-19T03:50:01Z

Bisect probes: `ba214ef3d3` + `229cadec04`

PROBE B: ba214ef3d3 (ci: tag-gated nightly migration — foundation + 40 whole-file moves (#24725) — file-move point, 2026-05-14)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26073728329 — PASS ✅

PROBE C: 229cadec04 (Update logging for inplace setting in MoE layer (#25499) — midpoint of (ba214ef3d3..d90bc65e30), 2026-05-16)

rerun-test run: 26074082226 — PASS ✅

→ Bisect bound collapses to bug introduced in 229cadec04..d90bc65e30 (92 commits, 2026-05-16 → 2026-05-19).

Next probe: c58b47bc86 (Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618) — midpoint of the new range, 2026-05-18) — in flight as run 26075022728.

Bisect state so far:

SHA	Date	Subject	rerun-test verdict
`ba214ef3d3`	2026-05-14	tag-gated nightly migration — 40 whole-file moves	PASS
`229cadec04`	2026-05-16	logging update for inplace setting in MoE layer	PASS
`c58b47bc86`	2026-05-18	PoolStats dataclass move	(in flight)
`d90bc65e30`	2026-05-19	[NPU] Fix TypeError in MLA `index_head_dim`	FAIL
current HEAD	2026-05-19	(Tom's chain + 5 unrelated)	FAIL

fzyzcjy · 2026-05-19T04:01:39Z

Bisect probe: `c58b47bc86` (`Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26075022728 — PASS ✅

→ Bisect bound collapses to bug introduced in c58b47bc86..d90bc65e30 (46 commits, 2026-05-18 → 2026-05-19).

Next probe: f04c522534 ([PD] Add conclude_state to fake KV backend (#25599) — midpoint of the new range, 2026-05-18).

Bisect state:

SHA	Date	Verdict
`ba214ef3d3`	2026-05-14	PASS
`229cadec04`	2026-05-16	PASS
`c58b47bc86`	2026-05-18	PASS ✅
`f04c522534`	2026-05-18	(in flight)
`d90bc65e30`	2026-05-19	FAIL

fzyzcjy · 2026-05-19T04:12:53Z

Bisect probe: `f04c522534` (`[PD] Add conclude_state to fake KV backend (#25599)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26075390772 — PASS ✅

→ Bisect bound collapses to bug introduced in f04c522534..d90bc65e30 (23 commits, both same-day 2026-05-18 / 2026-05-19).

Next probe: f5049709b3 (fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454) — midpoint, 2026-05-18).

Bisect state:

SHA	Date	Verdict
`ba214ef3d3`	2026-05-14	PASS
`229cadec04`	2026-05-16	PASS
`c58b47bc86`	2026-05-18	PASS
`f04c522534`	2026-05-18	PASS ✅
`f5049709b3`	2026-05-18	(in flight)
`d90bc65e30`	2026-05-19	FAIL

fzyzcjy · 2026-05-19T04:24:56Z

Bisect probe: `f5049709b3` (`fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26075730388 — PASS ✅

→ Bisect bound: bug introduced in f5049709b3..d90bc65e30 (12 commits, 2026-05-18 → 2026-05-19).

Full range (suspicion-worthy commits highlighted):

1f185c6ba8 Support draft extend cuda graph for tokenspeed_mla attention backend (#25489)  ← CUDA graph
b7267e8fce [CI] Enable weight prefetch for 8-gpu-h200 basic tests (#25684)
9e3bb9a307 [Spec] fold can_run_cuda_graph into EagleVerifyOutput (#25566)                  ← CUDA graph
c904fdd20e ci: pr-states match renamed "PR Test Base" workflow_run (#25687)
6f892047ec [misc] Throw error when single batch overlap is enabled on Hopper (#25509)      ← Hopper
878e6b8886 [SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685)          ← midpoint
745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)                ← cutedsl
dbac464726 [Spec]: Make Triton standalone spec test deterministic (#25303)
d028697d17 [NPU][Docs] Add Kimi-K2.5-W4A8 instance doc on NPU (#25269)
d90bc65e30 [NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)

Next probe: 878e6b8886 (midpoint).

Bisect state:

SHA	Date	Verdict
`f5049709b3`	2026-05-18	PASS ✅ (last good lower bound)
`878e6b8886`	2026-05-18	(in flight)
`d90bc65e30`	2026-05-19	FAIL (first bad upper bound)

fzyzcjy · 2026-05-19T04:36:56Z

Bisect probe: `878e6b8886` (`[SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26076099919 — PASS ✅

→ Bisect bound: bug introduced in 878e6b8886..d90bc65e30 (6 commits).

Remaining range:

745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)  ← prime suspect (CUDA-DSL packages)
dbac464726 [Spec]: Make Triton standalone spec test deterministic (#25303)
d028697d17 [NPU][Docs] Add Kimi-K2.5-W4A8 instance doc on NPU (#25269)
d90bc65e30 [NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)

Next probe (also the midpoint): b79e4b1e68 — the cutedsl-packages fix. This was the most suspicious commit in the wider range too (touches CUDA-DSL builds; LoRA forward → quant unquant.apply → cuBLAS path is a plausible blast radius).

Bisect state:

SHA	Date	Verdict
`878e6b8886`	2026-05-18	PASS ✅ (last good)
`b79e4b1e68`	2026-05-18	(in flight — prime suspect)
`d90bc65e30`	2026-05-19	FAIL (first bad)

fzyzcjy · 2026-05-19T04:48:53Z

Bisect probe: `b79e4b1e68` (`[Fix] Try to fix error caused by latest cutedsl packages (#25690)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26076486815 — FAIL ❌

Same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14) fingerprint as on HEAD:

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

→ Bisect bound: bug introduced in 878e6b8886..b79e4b1e68 (3 commits inclusive of b79e4b1e68):

745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)  ← FAIL ❌

Next probe: 314dedf7c6 (midpoint of the 3-commit range, 2026-05-18).

If PASS → offender is b79e4b1e68 itself (the cutedsl fix).
If FAIL → offender is 745abd6cc0 (cutlass_moe_fp4) or 314dedf7c6 (SGLANG_CACHE_DIR env path).

Bisect state:

SHA	Date	Verdict
`878e6b8886`	2026-05-18	PASS ✅ (last good)
`745abd6cc0`	2026-05-18	(untested)
`314dedf7c6`	2026-05-18	(in flight)
`b79e4b1e68`	2026-05-18	FAIL ❌ (first bad upper bound)
`d90bc65e30`	2026-05-19	FAIL

fzyzcjy · 2026-05-19T05:01:02Z

🤖 Posted autonomously by Claude Code acting on Tom's behalf. The 9-probe bisect (PROBE_A..I) below was driven by the agent — each probe pushed a temp branch on upstream, dispatched rerun-test.yml against it, classified the result, and narrowed the range. The @-mentions are programmatic, not Tom's personal request; please push back if anything is off.

Bisect result: `test_lora_qwen3_8b_logprob_diff.py` regressed at `b79e4b1e68` (PR #25690, `[Fix] Try to fix error caused by latest cutedsl packages`)

PROBE I (the deciding probe): 314dedf7c6 (Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)) → rerun-test 26076870779 — PASS ✅

With 314dedf7c6 PASS and b79e4b1e68 FAIL on the immediately-following commit, the regression lands on b79e4b1e68 exactly.

Final bisect table

SHA	Date	Subject	Verdict
`ba214ef3d3`	2026-05-14	tag-gated nightly migration — 40 whole-file moves	PASS
`229cadec04`	2026-05-16	logging update for inplace setting in MoE layer	PASS
`c58b47bc86`	2026-05-18	PoolStats dataclass move	PASS
`f04c522534`	2026-05-18	[PD] Add conclude_state to fake KV backend	PASS
`f5049709b3`	2026-05-18	eagle3 aux-layer-ids +1 offset fix	PASS
`878e6b8886`	2026-05-18	[SP] Fix runtime_max_tokens_per_rank	PASS
`314dedf7c6`	2026-05-18	Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path	PASS ✅ (last good)
`b79e4b1e68`	2026-05-18	[Fix] Try to fix error caused by latest cutedsl packages (#25690)	FAIL ❌ (first bad)
`d90bc65e30`	2026-05-19	[NPU] Fix TypeError in MLA `index_head_dim`	FAIL
current HEAD	2026-05-19	(Tom's chain + a handful of unrelated)	FAIL

Offending change

PR: [Fix] Try to fix error caused by latest cutedsl packages #25690 — [Fix] Try to fix error caused by latest cutedsl packages
Author: @Fridge003 (Co-authored-by @hnyls2002)
Merged: 2026-05-18 23:51 UTC
Diff: 21 +, 4 -. Touches python/pyproject.toml (switches flashinfer_python and nvidia-cutlass-dsl to the [cu13] extras variant) and scripts/ci/cuda/ci_install_dependency.sh (regex-update for [extras] notation + new purge_cutlass_libs_base() step that uninstalls nvidia-cutlass-dsl-libs-base then force-reinstalls nvidia-cutlass-dsl-libs-cu13).

The PR's own commit message explains the original bug it was fixing:

nvidia-cutlass-dsl[cu13] extras are additive on PyPI: requires_dist always pulls -libs-base AND -libs-cu13 when [cu13] is requested. Both wheels write to the same site-packages paths with different content, leaving the wrapper (cutlass.py, cu13 style) mismatched with the binding (_gpu_ops_gen.py, base style) -> GPUModuleOp signature TypeError.

The fix correctly purges -libs-base in the install script, but the LoRA Qwen3-8B forward path with CUDA graph capture now hits a kernel-side illegal address — so either the cu13 wheel's compiled kernel is broken for this path, or the purge_cutlass_libs_base step doesn't actually win in all install orderings.

Failure fingerprint (every FAIL probe + current HEAD)

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
Fatal Python error: Aborted
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

Python call stack at the abort thread:
  File ".../python/sglang/srt/layers/quantization/unquant.py", line 161 in apply
  File ".../python/sglang/srt/lora/layers.py", line 724 in forward
  ...
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1112 in run_once
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1134 in capture_one_batch_size
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 707 in __init__
  File ".../python/sglang/srt/model_executor/model_runner.py", line 2776 in init_device_graphs

Reproduce

# Probe latest good (PASS):
git push upstream 314dedf7c6:refs/heads/tmp-good
gh workflow run rerun-test.yml --repo sgl-project/sglang --ref tmp-good \
  -f mode=cuda -f test_command="registered/lora/test_lora_qwen3_8b_logprob_diff.py" \
  -f runs_on="1-gpu-h100" -f install_script="scripts/ci/cuda/ci_install_dependency.sh"

# Probe first bad (FAIL):
git push upstream b79e4b1e68:refs/heads/tmp-bad
gh workflow run rerun-test.yml --repo sgl-project/sglang --ref tmp-bad \
  -f mode=cuda -f test_command="registered/lora/test_lora_qwen3_8b_logprob_diff.py" \
  -f runs_on="1-gpu-h100" -f install_script="scripts/ci/cuda/ci_install_dependency.sh"

cc @Fridge003 @hnyls2002 — could you take a look? This regression has been on main since 2026-05-18 and is currently surfacing as extra-a-test-1-gpu-large (0) on the main-CI sandbox.

Diagnostic revert PR opened for verification: #25743 — /rerun-test of the failing LoRA file is pending there.

fzyzcjy · 2026-05-19T05:22:57Z

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Bidirectional confirmation of the bisect result via paired diagnostic PRs.

Bisect confirmed via paired diagnostic PRs

Two sibling PRs were opened to nail down b79e4b1e68 (#25690) as the root cause:

PR	What it does	`/rerun-test` LoRA file verdict	Run
#25743	Reverts `b79e4b1e68`	PASS ✅	26077407201
#25744	No revert; only a 1-line sentinel comment in `python/sglang/version.py` so the PR isn't auto-closed for 0-diff	FAIL ❌ (same `CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)` fingerprint)	26077826917

Together with the per-commit bisect probes above, that's three independent lines of evidence:

Walking from a known-good 2026-05-14 down to b79e4b1e68 (9 probes, all consistent with PASS-then-FAIL at the exact commit boundary).
Revert-the-commit → PASS on the same test file.
Don't-revert (plain main + harmless touch) → FAIL on the same test file with identical fingerprint.

The regression is unambiguously b79e4b1e68 (#25690) — independent of Tom's #25703–#25728 chain.

cc @Fridge003 @hnyls2002 — could you take a look? Closing the two diagnostic PRs now.

fzyzcjy · 2026-05-31T03:11:44Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Confirming the earlier gemma4-MTP GSM8K failure was a flake via the /rerun-test result plus a cross-branch data point. Please push back if any conclusion is off.

Flake confirmed: `test_gemma4_mtp_31b_extra.py`

/rerun-test on this sandbox head (c6e27e08fb): ✅ PASS — https://github.com/sgl-project/sglang/actions/runs/26701330481
Cross-branch: the same test also passed on a fresh tip-of-main rerun (~02:42 UTC) and on an unrelated PR — so the 0.770 vs 0.775 miss is GSM8K run-to-run variance, not a regression.

Remaining red lanes are all non-CUDA / chronic and not gating for a main-verification sandbox: base-c-test-8-gpu-h20 (h20 — ignored per policy), XPU, NPU (stage-b-test-1-npu-a2, pr-test-npu-finish), AMD (amd-nondeterministic, mi35x-disaggregation-amd), and the finish cascade. The CUDA gate is otherwise green; ~20 CUDA jobs still finishing.

fzyzcjy · 2026-05-31T04:11:59Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Logging the one newly-surfaced non-CUDA lane failure (Xeon CPU) with its fingerprint, classified as unrelated to this main-verification sandbox. Please push back if any conclusion is off.

Non-CUDA lane: `build-test (xeon-gnr, base-b-test-cpu)` — not gating

Failing file: test/registered/cpu/test_cpu_graph.py (exit 1). All other files in the job passed (test_soft_watchdog.py passed — its watchdog-timeout / py-spy lines are that test's intentional behavior, not an error).
Fingerprint:

UserWarning: Triton is not supported on current platform, roll back to CPU.
RuntimeError: Failed to parse benchmark output. prefill_latency=None, decode_throughput=None, decode_latency=None
[rank0]: AttributeError: 'NoneType' object has no attribute 'shape'

Classification: CPU-backend (Xeon) issue, not CUDA, and this is the no-diff main-verification sandbox — unrelated to anything here. Per lane policy (non-CUDA, not caused by a diff) it is not chased.

CUDA gate so far: green except the already-confirmed gemma4 GSM8K flake; a couple of CUDA h200 jobs still finishing. Other reds remain h20 (ignored) / NPU / AMD-mi35x / cascade finish aggregations.

fzyzcjy · 2026-05-31T05:41:07Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Final summary of the full main-CI verification run (head = current upstream/main, i.e. the just-landed 24-PR KV-canary chain). Closing the sandbox afterward for reuse. Please push back if any conclusion is off.

✅ CUDA gate GREEN — main verification complete

Head SHA c6e27e08fb (= upstream/main + the no-op sentinel). All CUDA lanes passed: base-b, base-c (incl. 8-gpu h100 / h200 / b200), extra-a, extra-b, piecewise-cuda-graph, etc.

The only CUDA red was a confirmed flake:

extra-a-test-2-gpu-large → test_gemma4_mtp_31b_extra.py GSM8K 0.770 vs 0.775 (1 question / 200). /rerun-test on this exact SHA passed (run); same test also green on tip-of-main.

Remaining red lanes are non-gating (non-CUDA / chronic / cascade), none related to the landed chain:

base-c-test-8-gpu-h20 → h20 (chronic, ignored per policy)
stage-a-test-1-gpu-xpu → XPU; stage-b-test-1-npu-a2 → NPU; stage-b/stage-c ...-amd, ...-mi35x-disaggregation-amd → AMD
build-test (xeon-gnr, base-b-test-cpu) → Xeon CPU test_cpu_graph.py benchmark-parse issue (CPU backend, unrelated)
finish / pr-test-finish / pr-test-extra-finish / pr-test-npu-finish → aggregation jobs cascading from the above

Conclusion: the KV-canary feature, landed on main via the 24-PR chain (#26798–#26821), is CUDA-CI green. Closing this sandbox PR (do not merge) so it's ready for the next reuse.

fzyzcjy · 2026-06-06T01:20:01Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaging this sandbox main-CI round; will follow up once logs are fetchable. Please push back if any conclusion is off.

Round status (head 96c5c6e1db = main bf4f2ccc78 + sentinel):

base-a-test-cpu (0) (PR Test Base): FAILURE, annotation only says Process completed with exit code 255. Log fetch pending REST rate-limit reset (~50 min). Note: the same content (scripted-runtime chain Add kv_canary PP self-test fixture and SWA divergence coverage #27410–Add scripted-runtime unit, core integration, and chunked-prefill tests #27413) passed this job on DO NOT MERGE - scripted runtime #26991, so flake or new-main interaction is suspected — will classify from the log, then /rerun-failed-ci after the full round completes.
stage-a-test-1-gpu-xpu + XPU finish: chronic XPU runner infra (checkout EACCES leftover-file pattern seen on previous rounds), non-CUDA lane, not a gate.

Remaining ~95 jobs still running; will batch any reruns after the round lands.

fzyzcjy · 2026-06-06T03:22:23Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Fetched the failing job log after the rate-limit reset and classified the failure. Please push back if any conclusion is off.

base-a-test-cpu (0) root cause: HF Hub rate-limit (infra flake, not code).

test/registered/unit/server_args/test_server_args.py failed because Hugging Face Hub returned 429 Too Many Requests for Qwen/Qwen2.5-1.5B-Instruct/resolve/main/config.json, and the retry also could not connect (job log):

httpx.HTTPStatusError: Client error '429 Too Many Requests' for url 'https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/resolve/main/config.json'
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
✗ FAILED: test/registered/unit/server_args/test_server_args.py (exit code 1)

Other reds this round: AMD lane (27 jobs — ongoing repo-wide AMD outage), NPU a2 (recurring perf flake), XPU (chronic runner infra). None CUDA, none code-related.

Plan: wait for the ~13 still-running jobs to land, then /rerun-failed-ci once.

fzyzcjy · 2026-06-06T06:57:09Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Classified the second CUDA-lane failure of this round from the job log. Please push back if any conclusion is off.

base-c-test-4-gpu-h100 (3): marginal KL-divergence threshold exceedance (likely numeric flake).

test/registered/models_e2e/test_qwen3_next_models.py failed with (job log):

AssertionError: avg_kl_div=0.0015218577479656225 > threshold=0.001 for Qwen/Qwen3-Next-80B-A3B-Instruct test_input_output_logprobs_match_prefill_cache_hit_helper

Marginal exceedance (1.5e-3 vs 1e-3 threshold) on a logprob-consistency check.
The content under test (scripted-runtime chain Add kv_canary PP self-test fixture and SWA divergence coverage #27410–Add scripted-runtime unit, core integration, and chunked-prefill tests #27413) is test-only / env-gated (SGLANG_TEST_SCRIPTED_RUNTIME default off) and does not touch qwen3-next or logprob numerics; the same content passed this suite on DO NOT MERGE - scripted runtime #26991.

Round summary (running=0): CUDA reds = this + base-a-test-cpu (0) (HF Hub 429, infra) + pr-test-finish cascade. Non-CUDA reds = AMD lane outage (27), NPU a2 perf flake, XPU chronic infra.

Next: one batched /rerun-failed-ci.

fzyzcjy · 2026-06-06T06:57:15Z

/rerun-failed-ci

fzyzcjy · 2026-06-07T09:47:42Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the first CI failures of this verify-main run (head ffbe2e8 = main 0a190d1); classification below. Please push back if any conclusion is off.

stage-a-test-1-gpu-xpu / finish (job): runner-level infra failure during workspace cleanup, before any test ran:

##[error]File was unable to be removed Error: EACCES: permission denied, unlink '.../python/sglang.egg-info/PKG-INFO'

Classification: infra (self-hosted XPU runner permission residue), non-CUDA lane, unrelated to main's code. Not chasing per babysit policy; CUDA lanes remain the hard gate.

fzyzcjy · 2026-06-07T10:46:20Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged all non-CUDA failures on this verify-main run (head ffbe2e8 = main 0a190d1, which includes merged #27445 + #27446). All failures are non-CUDA lanes in code paths the merged PRs do not touch (the PRs only change scheduler PP idle-gating in is_fully_idle + the scripted-runtime test harness). CUDA lanes remain green. Please push back if any conclusion is off.

Non-CUDA failures (not chasing per babysit policy — none CUDA, none related to the merged code):

Lane	Job	Fingerprint	Class
XPU	79948782973	`EACCES: permission denied, unlink .../sglang.egg-info/PKG-INFO`	infra (runner cleanup)
Xeon CPU	79948776769	`decode: expect req_lens to be int64, got Int`; `--sampling-backend: invalid choice: 'token_oracle'`; exit -9	CPU-backend, pre-existing on main
NPU	79948784495	`AssertionError: 672.30 not greater than or equal to 700` (w8a8 throughput threshold)	perf-threshold flake
AMD mi325 (stage-c)	79948805518	registry pull timeout; `Residual accuracy check failed` (fused residual kernel)	chronic stage-c / infra
AMD mi35x (stage-c)	79948805511	registry pull timeout; `Fatal Python error: Aborted` (exit -6) + ConnectionRefused cascade	chronic stage-c / infra
finish / pr-test-npu-finish	—	rollup cascade of the above	cascade

The merged PRs touch no XPU/NPU/AMD/Xeon code, no sampling backends, no quantization or fused-residual kernels. Continuing to watch CUDA lanes (the hard gate) to completion.

fzyzcjy · 2026-06-07T13:18:51Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the first (and so far only) CUDA-lane failure of this verify-main run; cross-branch evidence below shows it reproduces identically on an independent scheduled main run, and a pre-merge-main probe has been dispatched. Please push back if any conclusion is off.

CUDA failure: `base-c-test-8-gpu-h200 (2)` — `test/registered/models_e2e/test_mimo_v2.py`

Job 79948809861 — server for XiaomiMiMo/MiMo-V2.5 (tp=8, dp=2, EAGLE MTP, fp8) crashes 2s after becoming HTTP-ready, during the first warmup generate:

/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:109: _assert_async_cuda_kernel:
Assertion `index >= 152576 (out of range): VocabParallelEmbedding input id` failed.   (x4 ranks)
coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ASSERT (12)
Fatal Python error: Aborted                                                             (x4)
...
TimeoutError: Server failed to start within the timeout period

Cross-branch evidence

Branch	Run	test_mimo_v2	Fingerprint
sandbox (main `0a190d1c9` + sentinel)	27088945685	✗ FAIL	VocabParallelEmbedding input id out of range
`main` scheduled (`a07d813ec`, independent runner)	27091400009	✗ FAIL	byte-identical
`main` pre-#27445/#27446 (`a39c428d3`)	27093698014 (probe dispatched)	pending	—

Classification

Pre-existing main regression, deterministic (2/2 independent runs), unrelated to #27445/#27446: the merged PRs touch only scripted-runtime test harness files and PP idle-gating in is_fully_idle (short-circuited at pp_size==1; this server is pp=1). The failing path is the model-side out-of-range-token-id async assert (same family as the tp=1 fix in #27482) on MiMo-V2.5's first warmup forward. Will report the pre-merge probe result when it completes.

fzyzcjy · 2026-06-08T07:04:32Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the two current CI failures on this main-verification sandbox run; classified both as non-code issues (one infra flake, one chronic-hardware lane). Please push back if any conclusion is off.

CI triage — head `d16481061d` (latest `upstream/main` + version.py sentinel only)

This sandbox PR's only diff is a one-line comment in python/sglang/version.py, so any failure here reflects main/infra, not a code change.

1. extra-a-test-1-gpu-small (PR Test Extra) — infra flake, will rerun

Job log

Failed in the dependency-install step, before any test ran:

× Failed to download `sglang-kernel==0.4.3+cu130`
├─▶ Request failed after 3 retries
╰─▶ HTTP status server error (504 Gateway Timeout) for url
    https://github.com/sgl-project/whl/releases/download/v0.4.3/sglang_kernel-0.4.3+cu130-...whl

Classification: infra (transient GitHub-releases 504). Plan: /rerun-failed-ci once the current round finishes (batching, not fix-by-fix).

2. base-c-test-8-gpu-h20 (PR Test Base) — h20 lane, ignored

Job log
Per chronic-H20 policy, H20 machine issues are ignored and not rerun.

~99 checks still running; will rerun the genuine infra flake after the round completes.

fzyzcjy · 2026-06-08T07:38:20Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged this round's failures: 7 share a transient GitHub-releases 504 at install (infra), h20 is the chronic-ignore lane, 3 non-CUDA lanes are unrelated, and one B200 LoRA job surfaced a real, freshly-merged main regression traced to #27063. Please push back if any conclusion is off.

CI triage round 2 — head `d16481061d` (latest `upstream/main` + version.py sentinel only)

🔴 Real main regression (CUDA) — extra-b-test-4-gpu-b200

Test: test/registered/lora/test_lora_gpt_oss_20b_logprob_diff.py
Job log

File "python/sglang/srt/models/gpt_oss.py", line 294, in forward_normal
    hidden_dim_unpadded = self.experts.hidden_size
AttributeError: 'FusedMoEWithLoRA' object has no attribute 'hidden_size'

Root cause: [AMD] Optimize gpt-oss-120B performance #27063 "[AMD] Optimize gpt-oss-120B performance" (commit 1c73ff8ad3, merged today) changed gpt_oss.py to read self.experts.hidden_size. When LoRA is active, self.experts is wrapped by FusedMoEWithLoRA (python/sglang/srt/lora/layers.py:862), which copies many base-layer attrs in __init__ but not hidden_size, so the access raises. Deterministic (not a flake) and unrelated to this sandbox diff → pre-existing on main.
Fix direction: expose hidden_size on FusedMoEWithLoRA (proxy to base_layer), or read it from the base layer in gpt_oss.py.

🟡 Infra — transient GitHub-releases 504 at install (7 jobs, batch-rerun pending)

extra-a-test-1-gpu-small, base-b-test-1-gpu-small shards (0/4/5/7), base-c-test-4-gpu-h100, base-c-test-4-gpu-gb300
All died in the dependency-install step with the identical error:

× Failed to download `sglang-kernel==0.4.3+cu130`
╰─▶ HTTP status server error (504 Gateway Timeout)
    github.com/sgl-project/whl/releases/download/v0.4.3/sglang_kernel-0.4.3+cu130-...whl

Classification: transient CDN incident, not code. Will /rerun-failed-ci once the current round finishes.

⚪ h20 — ignored

base-c-test-8-gpu-h20: chronic H20 machine issues, ignored per policy.

🔵 Non-CUDA lanes (not this PR; noted, not chased)

PR Test (Xeon) base-b-test-cpu: RuntimeError: decode: expect req_lens to be int64, got Int in test_external_models.py — looks like a real CPU-path issue on main.
PR Test (XPU) stage-a-test-1-gpu-xpu: exit 137 (killed).
PR Test (NPU) stage-b-test-1-npu-a2: self-hosted-runner/container failure (infra).

~66 checks still running. Next: batch-rerun the 504 infra jobs after the round completes. The B200 regression needs a code fix (#27063); reruns will not clear it.

fzyzcjy · 2026-06-08T09:06:49Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Correcting an earlier guess: the 3 extra b200 shards + base-b-large are also the 504 install flake, not the LoRA bug — so exactly one real failure stands. Please push back if any conclusion is off.

Round 3 — head `d16481061d` (106 pass / 4 running / 21 fail)

With REST budget recovered I fingerprinted the remaining CUDA reds. Correction to round 2: extra-b-b200 (1), base-c-b200 (1), base-b-b200 (1), and base-b-large (8) are all the same 504 install flake (Failed to download sglang-kernel==0.4.3+cu130 → 504 Gateway Timeout), not the #27063 LoRA bug. My earlier "likely same bug" guess for b200 shard 1 was wrong.

Net CUDA picture:

1 real bug: extra-b-b200 (0) → [AMD] Optimize gpt-oss-120B performance #27063 FusedMoEWithLoRA missing hidden_size (reported on [AMD] Optimize gpt-oss-120B performance #27063).
11 transient 504 install flakes: extra-a, base-b-small (0/4/5/7), base-b-large (8), base-b-b200 (1), base-c-b200 (1), base-c-h100 (0), base-c-gb300 (0), extra-b-b200 (1).
h20: ignored (chronic).

4 jobs still running (AMD mi35x-disagg; base-c-h100 shards 3/4; base-c-b200 shard 3). Once they settle I'll /rerun-failed-ci to clear the 504 batch; the #27063 regression stays red until fixed.

fzyzcjy · 2026-06-08T09:37:58Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Update: a b200 shard (base-c-b200 (3)) actually ran instead of 504ing and exposed a SECOND #27063 regression on the core gpt-oss MXFP4 path — so it's 2 real bugs, not 1. Please push back if any conclusion is off.

Round 4 — head `d16481061d` (107 pass / 1 queued / 24 fail)

Correction to round 3's "exactly 1 real bug": 2 real B200 regressions, both from #27063:

MXFP4 (new): base-c-b200 (3) → test_gpt_oss_4gpu_mxfp4.py: RuntimeError: shape '[4096, 3072]' is invalid for input of size 11796480 at gpt_oss.py:320. Experts output is 2880-wide but hidden_dim_unpadded = self.experts.hidden_size resolves to padded 3072. Core (non-LoRA) serving path. (reported on [AMD] Optimize gpt-oss-120B performance #27063)
LoRA: extra-b-b200 (0) → FusedMoEWithLoRA missing hidden_size at gpt_oss.py:294. (reported on [AMD] Optimize gpt-oss-120B performance #27063)

Everything else unchanged: 11 transient 504 install flakes, h20 ignored, 9 non-CUDA (AMD/Xeon/XPU/NPU) noted-not-ours, plus the extra/npu/amd finish aggregates.

1 job still queued (base-c-h100 (4)). Once it settles I'll /rerun-failed-ci to clear the 504 batch; the two #27063 regressions stay red until fixed.

fzyzcjy · 2026-06-08T10:07:33Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. First full round complete: 2 real B200 regressions (both #27063), and 11 CUDA jobs that died at install on a transient GitHub-releases 504 — those never ran their tests, so I'm re-running the failed set to actually verify those suites on main. Please push back if any conclusion is off.

Round 1 complete — head `d16481061d` (107 pass / 25 fail / 25 skip / 2 cancelled)

🔴 Real main regressions (2, both from #27063):

Lane / test	Failure	Path
`base-c-b200 (3)` · `test_gpt_oss_4gpu_mxfp4.py`	`RuntimeError: shape '[4096, 3072]' is invalid for input of size 11796480` (`gpt_oss.py:320`)	core gpt-oss MXFP4
`extra-b-b200 (0)` · `test_lora_gpt_oss_20b_logprob_diff.py`	`AttributeError: 'FusedMoEWithLoRA' object has no attribute 'hidden_size'` (`gpt_oss.py:294`)	gpt-oss + LoRA

Both reported on #27063. They will re-fail on rerun (deterministic).

🟡 Transient install 504 — 11 CUDA jobs whose tests never ran: extra-a, base-b-small (0/4/5/7), base-b-large (8), base-b-b200 (1), base-c-b200 (1), base-c-h100 (0), base-c-gb300 (0), extra-b-b200 (1). Same Failed to download sglang-kernel==0.4.3+cu130 → 504 across all; 107 other jobs pulled the same wheel fine, so the CDN incident is over. Re-running these to actually exercise their suites.

⚪ h20 ignored. 🔵 Non-CUDA (not ours): Xeon req_lens int64, XPU exit-137, NPU container, AMD mi325 / mi35x / mi35x-disagg. ⚫ Cancelled (not failures): base-c-h100 (4), notify-pr-states.

Triggering /rerun-failed-ci next. After the rerun settles I'll confirm the 504 suites pass, leave the two #27063 regressions as the standing verdict, and close this sandbox PR (do not merge).

fzyzcjy · 2026-06-08T10:07:42Z

/rerun-failed-ci

fzyzcjy · 2026-06-08T12:36:12Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Final verdict for this tip-of-main CI verification: after clearing a transient GitHub-releases 504 install incident (all 11 affected CUDA jobs passed on rerun), the only real CUDA failures on main are 2 deterministic B200 regressions, both from #27063. Closing this sandbox PR (do not merge). Please push back if any conclusion is off.

Final verdict — head `d16481061d` (latest `upstream/main` + version.py sentinel)

123 pass / 25 skip / 11 fail after rerunning the transient-504 batch.

🔴 Real main regressions — 2, both from #27063 ("[AMD] Optimize gpt-oss-120B performance", 1c73ff8ad3):

Lane / test	Failure	Confirmed
`base-c-b200 (3)` · `test_gpt_oss_4gpu_mxfp4.py`	`RuntimeError: shape '[4096, 3072]' is invalid for input of size 11796480` (`gpt_oss.py:320`) — core gpt-oss MXFP4 path	failed 2/2 runs
`extra-b-b200 (0)` · `test_lora_gpt_oss_20b_logprob_diff.py`	`AttributeError: 'FusedMoEWithLoRA' object has no attribute 'hidden_size'` (`gpt_oss.py:294`) — gpt-oss + LoRA	failed 2/2 runs

Both reported on #27063. Root cause: #27063's hidden_dim_unpadded = self.experts.hidden_size assumes that attribute is the experts' unpadded output width, but it's the padded width for the B200 MXFP4 FusedMoE (3072 vs 2880), and the LoRA wrapper (FusedMoEWithLoRA) doesn't expose it at all. pr-test-finish (Base) and pr-test-extra-finish are red only because of these two.

✅ Transient infra, fully resolved: 11 CUDA jobs initially died at install with Failed to download sglang-kernel==0.4.3+cu130 → 504 Gateway Timeout (GitHub-releases CDN incident). All 11 passed on rerun (including base-c-h20), confirming their suites are green on main and the 504 was purely transient.

🔵 Non-CUDA lanes (pre-existing, not gpt-oss-related, not chased):

PR Test (Xeon) base-b-test-cpu: RuntimeError: decode: expect req_lens to be int64, got Int (test_external_models.py).
PR Test (NPU) stage-b-test-1-npu-a2: self-hosted-runner/container failure.
PR Test (AMD) stage-c mi325 + stage-c mi35x + stage-b mi35x-disaggregation (+ AMD/NPU finish aggregates).

Net: main CUDA is green except the 2 #27063 B200 regressions. Closing this sandbox PR — not merged.

fzyzcjy added run-ci bypass-fastfail run-ci-extra labels May 18, 2026

fzyzcjy closed this May 18, 2026

fzyzcjy reopened this May 19, 2026

This was referenced May 19, 2026

Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743

Closed

Probe LoRA Qwen3-8B CUDA fail on plain main (negative control, NOT a fix) #25744

Closed

fzyzcjy mentioned this pull request May 19, 2026

[Fix] Try to fix error caused by latest cutedsl packages #25690

Merged

5 tasks

fzyzcjy closed this May 31, 2026

fzyzcjy reopened this Jun 6, 2026

fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from c6e27e0 to 96c5c6e Compare June 6, 2026 01:12

fzyzcjy closed this Jun 6, 2026

fzyzcjy reopened this Jun 7, 2026

fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from 96c5c6e to ffbe2e8 Compare June 7, 2026 09:42

fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from ffbe2e8 to d164810 Compare June 8, 2026 06:57

fzyzcjy mentioned this pull request Jun 8, 2026

[AMD] Optimize gpt-oss-120B performance #27063

Merged

5 tasks

fzyzcjy closed this Jun 8, 2026

Sandbox: verify full main CI on latest main (20260609T122134Z)

03cde0e

fzyzcjy reopened this Jun 9, 2026

fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from d164810 to 03cde0e Compare June 9, 2026 12:25

fzyzcjy closed this Jun 9, 2026

Conversation

fzyzcjy commented May 18, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

CI States

Uh oh!

gemini-code-assist Bot commented May 18, 2026

Uh oh!

fzyzcjy commented May 18, 2026

Uh oh!

fzyzcjy commented May 19, 2026

CI failure: base-b-test-1-gpu-large (1) (PR Test Base, B200, 80 GB)

Uh oh!

fzyzcjy commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI failure: extra-a-test-1-gpu-large (0) (PR Test Extra, NVIDIA)

Uh oh!

fzyzcjy commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI failure: base-b-test-1-gpu-small (5) (PR Test Base, NVIDIA, 32 GB)

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 19, 2026

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py result: ❌ FAIL (reproducible)

Uh oh!

fzyzcjy commented May 19, 2026

/rerun-test test/registered/core/test_srt_endpoint.py result: ✅ PASS (flake)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: d90bc65e30 ([NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383) — pre-chain, HEAD-28)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probes: ba214ef3d3 + 229cadec04

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: c58b47bc86 (Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: f04c522534 ([PD] Add conclude_state to fake KV backend (#25599) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: f5049709b3 (fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: 878e6b8886 ([SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: b79e4b1e68 ([Fix] Try to fix error caused by latest cutedsl packages (#25690) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bisect result: test_lora_qwen3_8b_logprob_diff.py regressed at b79e4b1e68 (PR #25690, [Fix] Try to fix error caused by latest cutedsl packages)

Final bisect table

Offending change

Failure fingerprint (every FAIL probe + current HEAD)

Reproduce

Uh oh!

fzyzcjy commented May 19, 2026

Bisect confirmed via paired diagnostic PRs

Uh oh!

fzyzcjy commented May 31, 2026

Flake confirmed: test_gemma4_mtp_31b_extra.py

Uh oh!

fzyzcjy commented May 31, 2026

Non-CUDA lane: build-test (xeon-gnr, base-b-test-cpu) — not gating

fzyzcjy commented May 18, 2026 •

edited by github-actions Bot

Loading

CI failure: `base-b-test-1-gpu-large (1)` (PR Test Base, B200, 80 GB)

fzyzcjy commented May 19, 2026 •

edited

Loading

CI failure: `extra-a-test-1-gpu-large (0)` (PR Test Extra, NVIDIA)

fzyzcjy commented May 19, 2026 •

edited

Loading

CI failure: `base-b-test-1-gpu-small (5)` (PR Test Base, NVIDIA, 32 GB)

github-actions Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

`/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py` result: ❌ FAIL (reproducible)

`/rerun-test test/registered/core/test_srt_endpoint.py` result: ✅ PASS (flake)

Bisect probe: `d90bc65e30` (`[NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)` — pre-chain, HEAD-28)

Bisect probes: `ba214ef3d3` + `229cadec04`

Bisect probe: `c58b47bc86` (`Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618)` — 2026-05-18)

Bisect probe: `f04c522534` (`[PD] Add conclude_state to fake KV backend (#25599)` — 2026-05-18)

Bisect probe: `f5049709b3` (`fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454)` — 2026-05-18)

Bisect probe: `878e6b8886` (`[SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685)` — 2026-05-18)

Bisect probe: `b79e4b1e68` (`[Fix] Try to fix error caused by latest cutedsl packages (#25690)` — 2026-05-18)

fzyzcjy commented May 19, 2026 •

edited

Loading

Bisect result: `test_lora_qwen3_8b_logprob_diff.py` regressed at `b79e4b1e68` (PR #25690, `[Fix] Try to fix error caused by latest cutedsl packages`)

Flake confirmed: `test_gemma4_mtp_31b_extra.py`

Non-CUDA lane: `build-test (xeon-gnr, base-b-test-cpu)` — not gating

CUDA failure: `base-c-test-8-gpu-h200 (2)` — `test/registered/models_e2e/test_mimo_v2.py`

CI triage — head `d16481061d` (latest `upstream/main` + version.py sentinel only)

CI triage round 2 — head `d16481061d` (latest `upstream/main` + version.py sentinel only)

Round 3 — head `d16481061d` (106 pass / 4 running / 21 fail)

Round 4 — head `d16481061d` (107 pass / 1 queued / 24 fail)

Round 1 complete — head `d16481061d` (107 pass / 25 fail / 25 skip / 2 cancelled)

Final verdict — head `d16481061d` (latest `upstream/main` + version.py sentinel)