[ci] split stage-c-test-4-gpu-b200 to enable a low-disk runner pool#23417
Merged
Kangyan-Zhou merged 1 commit intomainfrom Apr 22, 2026
Merged
[ci] split stage-c-test-4-gpu-b200 to enable a low-disk runner pool#23417Kangyan-Zhou merged 1 commit intomainfrom
Kangyan-Zhou merged 1 commit intomainfrom
Conversation
Splits the per-commit B200 stage-c suite into two:
- stage-c-test-4-gpu-b200: only the 3 DeepSeek V3/V3.2 FP4 tests (~800GB
cached models). Stays on existing large-disk runners via the
4-gpu-b200{,-kernel} label.
- stage-c-test-4-gpu-b200-small: the other 9 tests (Qwen3.5 FP4, gpt-oss-120B,
3 LoRA tests, cutedsl_moe, eagle, update-weights). Targets the new
4-gpu-b200{,-kernel}-low-disk label, which is also advertised by the
existing large-disk runners — so the small suite is eligible to land on
either pool, while DeepSeek tests are isolated to the large-disk pool.
Also moves three 4-gpu-b200 tests that aren't critical for per-commit signal
to the existing nightly-4-gpu-b200 suite to free up per-commit time:
- test_fp8_blockwise_gemm.py
- test_nvfp4_gemm.py
- test_nvidia_nemotron_3_super_nvfp4.py
Wiring updates:
- pr-test.yml: new b200_low_disk_runner output from set-runner; new
stage-c-test-4-gpu-b200-small job duplicated from the existing one,
partition size dropped 6→3 for both jobs to match the smaller per-suite
test counts.
- slash_command_handler.py: add the new suite to nvidia_stages and
CUDA_SUITE_TO_RUNNER so /rerun-stage and per-file slash dispatch work.
- test/run_suite.py: register the new suite name in PER_COMMIT_SUITES.
Runner-fleet change required to take full effect: tag the new low-disk
runner with 4-gpu-b200-low-disk (and 4-gpu-b200-kernel-low-disk if it is
also kernel-build-capable), and ensure existing large-disk runners
advertise the *-low-disk labels in addition to their current ones.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/rerun-stage stage-c-test-4-gpu-b200-small |
Contributor
|
✅ Triggered |
Collaborator
Author
zhangying098
pushed a commit
to zhangying098/sglang
that referenced
this pull request
Apr 23, 2026
…gl-project#23417) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
alisonshao
pushed a commit
that referenced
this pull request
Apr 30, 2026
The new LoRA logprob-diff tests for nemotron3-super and qwen3.5-35b were registered to stage-c-test-4-gpu-b200-small. Reverting #23417 deleted that suite, so validate_all_suites fails. Point them at stage-c-test-4-gpu-b200 to match the rest of the un-split fleet.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
stage-c-test-4-gpu-b200(3 files, ~3262s): only the heavy DeepSeek V3/V3.2 FP4 tests (~800GB cached models). Stays on existing large-disk runners.stage-c-test-4-gpu-b200-small(9 files, ~3495s): everything else (Qwen3.5 FP4, gpt-oss-120B, 3 LoRA tests, cutedsl_moe, eagle, update-weights). Eligible for either pool via the*-low-disklabel.test_fp8_blockwise_gemm.py,test_nvfp4_gemm.py,test_nvidia_nemotron_3_super_nvfp4.py) to the existingnightly-4-gpu-b200suite.Runner-fleet labeling required
For this split to take full effect, the runner fleet needs:
4-gpu-b200,4-gpu-b200-low-disk4-gpu-b200-kernel,4-gpu-b200-kernel-low-disk4-gpu-b200-low-disk(and4-gpu-b200-kernel-low-diskif kernel-build-capable)The
*-low-disksuffix is used as an eligibility label, not a hardware descriptor — both pools advertise it, but only the new runner is exclusively identified by it. The big DeepSeek job usesruns-on: 4-gpu-b200{,-kernel}, which the new runner does not advertise — so DeepSeek tests can never land on the low-disk runner.Wiring changes
.github/workflows/pr-test.yml: newb200_low_disk_runneroutput fromset-runnerstep; newstage-c-test-4-gpu-b200-smalljob duplicated from the existing one; partition size dropped 6→3 for both jobs (LPT-balanced: big suite ≤1380s/partition, small suite ≤1196s/partition); aggregatorpr-test-finishextended.scripts/ci/utils/slash_command_handler.py: added the new suite tonvidia_stagesandCUDA_SUITE_TO_RUNNERso/rerun-stage stage-c-test-4-gpu-b200-smalland per-file slash dispatch resolve correctly.test/run_suite.py: registered the new suite name inPER_COMMIT_SUITES[HWBackend.CUDA].stage-c-test-4-gpu-b200tostage-c-test-4-gpu-b200-small.stage-c-test-4-gpu-b200(per-commit) tonightly-4-gpu-b200(withnightly=True).Test plan
validate_all_suitespasses (every test's suite is registered) — verified via AST parse pre-commitpre-commit run --all-filespasses — verified locally/rerun-stage stage-c-test-4-gpu-b200-smalland confirm it can land on either pool/rerun-stage stage-c-test-4-gpu-b200continues to land only on existing large-disk runnersb200_runner→4-gpu-b200-kernel) still picks up both jobs correctly🤖 Generated with Claude Code