Skip to content

[ci] split stage-c-test-4-gpu-b200 to enable a low-disk runner pool#23417

Merged
Kangyan-Zhou merged 1 commit intomainfrom
split_b200_runners
Apr 22, 2026
Merged

[ci] split stage-c-test-4-gpu-b200 to enable a low-disk runner pool#23417
Kangyan-Zhou merged 1 commit intomainfrom
split_b200_runners

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

Summary

  • Splits the per-commit B200 stage-c suite into two so a new B200 runner with limited disk capacity can serve a subset of tests:
    • stage-c-test-4-gpu-b200 (3 files, ~3262s): only the heavy DeepSeek V3/V3.2 FP4 tests (~800GB cached models). Stays on existing large-disk runners.
    • stage-c-test-4-gpu-b200-small (9 files, ~3495s): everything else (Qwen3.5 FP4, gpt-oss-120B, 3 LoRA tests, cutedsl_moe, eagle, update-weights). Eligible for either pool via the *-low-disk label.
  • Frees per-commit time by moving 3 lower-priority B200 tests (test_fp8_blockwise_gemm.py, test_nvfp4_gemm.py, test_nvidia_nemotron_3_super_nvfp4.py) to the existing nightly-4-gpu-b200 suite.

Runner-fleet labeling required

For this split to take full effect, the runner fleet needs:

Runner Labels
Existing large-disk B200 4-gpu-b200, 4-gpu-b200-low-disk
Existing large-disk B200 (kernel pool) 4-gpu-b200-kernel, 4-gpu-b200-kernel-low-disk
New low-disk runner 4-gpu-b200-low-disk (and 4-gpu-b200-kernel-low-disk if kernel-build-capable)

The *-low-disk suffix is used as an eligibility label, not a hardware descriptor — both pools advertise it, but only the new runner is exclusively identified by it. The big DeepSeek job uses runs-on: 4-gpu-b200{,-kernel}, which the new runner does not advertise — so DeepSeek tests can never land on the low-disk runner.

Wiring changes

  • .github/workflows/pr-test.yml: new b200_low_disk_runner output from set-runner step; new stage-c-test-4-gpu-b200-small job duplicated from the existing one; partition size dropped 6→3 for both jobs (LPT-balanced: big suite ≤1380s/partition, small suite ≤1196s/partition); aggregator pr-test-finish extended.
  • scripts/ci/utils/slash_command_handler.py: added the new suite to nvidia_stages and CUDA_SUITE_TO_RUNNER so /rerun-stage stage-c-test-4-gpu-b200-small and per-file slash dispatch resolve correctly.
  • test/run_suite.py: registered the new suite name in PER_COMMIT_SUITES[HWBackend.CUDA].
  • 9 test files re-tagged from stage-c-test-4-gpu-b200 to stage-c-test-4-gpu-b200-small.
  • 3 test files re-tagged from stage-c-test-4-gpu-b200 (per-commit) to nightly-4-gpu-b200 (with nightly=True).

Test plan

  • validate_all_suites passes (every test's suite is registered) — verified via AST parse pre-commit
  • pre-commit run --all-files passes — verified locally
  • Once the new runner is online and labeled, run /rerun-stage stage-c-test-4-gpu-b200-small and confirm it can land on either pool
  • Confirm /rerun-stage stage-c-test-4-gpu-b200 continues to land only on existing large-disk runners
  • Confirm sgl-kernel PR (which switches b200_runner4-gpu-b200-kernel) still picks up both jobs correctly

🤖 Generated with Claude Code

Splits the per-commit B200 stage-c suite into two:
- stage-c-test-4-gpu-b200: only the 3 DeepSeek V3/V3.2 FP4 tests (~800GB
  cached models). Stays on existing large-disk runners via the
  4-gpu-b200{,-kernel} label.
- stage-c-test-4-gpu-b200-small: the other 9 tests (Qwen3.5 FP4, gpt-oss-120B,
  3 LoRA tests, cutedsl_moe, eagle, update-weights). Targets the new
  4-gpu-b200{,-kernel}-low-disk label, which is also advertised by the
  existing large-disk runners — so the small suite is eligible to land on
  either pool, while DeepSeek tests are isolated to the large-disk pool.

Also moves three 4-gpu-b200 tests that aren't critical for per-commit signal
to the existing nightly-4-gpu-b200 suite to free up per-commit time:
- test_fp8_blockwise_gemm.py
- test_nvfp4_gemm.py
- test_nvidia_nemotron_3_super_nvfp4.py

Wiring updates:
- pr-test.yml: new b200_low_disk_runner output from set-runner; new
  stage-c-test-4-gpu-b200-small job duplicated from the existing one,
  partition size dropped 6→3 for both jobs to match the smaller per-suite
  test counts.
- slash_command_handler.py: add the new suite to nvidia_stages and
  CUDA_SUITE_TO_RUNNER so /rerun-stage and per-file slash dispatch work.
- test/run_suite.py: register the new suite name in PER_COMMIT_SUITES.

Runner-fleet change required to take full effect: tag the new low-disk
runner with 4-gpu-b200-low-disk (and 4-gpu-b200-kernel-low-disk if it is
also kernel-build-capable), and ensure existing large-disk runners
advertise the *-low-disk labels in addition to their current ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added lora blackwell SM100/SM120 labels Apr 22, 2026
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-4-gpu-b200-small

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-b200-small to run independently (skipping dependencies). View workflow run

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

@Kangyan-Zhou Kangyan-Zhou merged commit 77fd86f into main Apr 22, 2026
87 of 93 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the split_b200_runners branch April 22, 2026 01:33
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
…gl-project#23417)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alisonshao pushed a commit that referenced this pull request Apr 30, 2026
The new LoRA logprob-diff tests for nemotron3-super and qwen3.5-35b
were registered to stage-c-test-4-gpu-b200-small. Reverting #23417
deleted that suite, so validate_all_suites fails. Point them at
stage-c-test-4-gpu-b200 to match the rest of the un-split fleet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant