[CI] Update B200 est_times to prevent timeouts on slower machine by alisonshao · Pull Request #22609 · sgl-project/sglang

alisonshao · 2026-04-12T02:01:49Z

Summary

Update est_time for 10 B200 tests based on actual elapsed times + 20% buffer
The second B200 machine runs ~1.8x slower than the first due to hardware differences:

Metric	Machine A	Machine B	Ratio
HBM bandwidth	5528 GB/s	3518 GB/s	1.6x slower
Host-to-Device PCIe	56.9 GB/s	55.2 GB/s	~same
Matmul TFLOPS (bf16)	1543	1499	~same
Disk read	3667 MB/s	1301 MB/s	2.8x slower

Tests pass on both machines but partitions time out on Machine B because est_times were calibrated on the faster Machine A

Test	Old est	Machine B actual	New est
`test_nvfp4_gemm.py`	322	459	550
`test_gpt_oss_4gpu.py`	312	615	740
`test_fp8_blockwise_gemm.py`	302	527	630
`test_eagle_infer_beta_dp_attention.py`	68	113	136
`test_nvidia_nemotron_3_super_nvfp4.py`	294	591	710
`test_cutedsl_moe.py`	13	491	590
`test_deepseek_v3_fp4_4gpu.py`	1146	1149	1380
`test_deepseek_v3_fp4_mtp_small.py`	416	424	510
`test_flash_attention_4.py`	259	276	332
`test_lora_qwen3_30b...py`	160	87	(no change, faster)

Example timeout: https://github.com/sgl-project/sglang/actions/runs/24288516804/job/70933367476

Test plan

est_time changes only, no logic changes
Benchmarked both machines to confirm hardware difference (HBM bandwidth, disk I/O)

The Innomatrix B200 machine runs ~1.8x slower than the Novita B200 due to running 2 concurrent CI containers sharing CPU/memory bandwidth. Update est_time for 6 B200 tests based on actual Innomatrix elapsed times + 20% buffer to prevent partition timeouts. Changes (old -> new est_time): - test_nvfp4_gemm.py: 322 -> 550 (actual: 459s) - test_gpt_oss_4gpu.py: 312 -> 740 (actual: 615s) - test_fp8_blockwise_gemm.py: 302 -> 630 (actual: 527s) - test_eagle_infer_beta_dp_attention.py: 68 -> 136 (actual: 113s) - test_nvidia_nemotron_3_super_nvfp4.py: 294 -> 710 (actual: 591s) - test_cutedsl_moe.py: 13 -> 322 (actual: 268s) Example timeout: https://github.com/sgl-project/sglang/actions/runs/24288516804/job/70933367476

gemini-code-assist

Code Review

This pull request updates the estimated execution times (est_time) for several test files within the stage-c-test-4-gpu-b200 suite, including GPT OSS, Nemotron, CuteDSL MoE, and various quantization and speculative decoding tests. I have no feedback to provide.

Additional tests found from other Inno runs: - test_cutedsl_moe.py: 322 -> 590 (worst case: 491s) - test_deepseek_v3_fp4_4gpu.py: 1146 -> 1380 (actual: 1149s) - test_deepseek_v3_fp4_mtp_small.py: 416 -> 510 (actual: 424s) - test_flash_attention_4.py: 259 -> 332 (actual: 276s)

Kangyan-Zhou · 2026-04-12T03:48:14Z

/rerun-stage stage-c-4-gpu-b200

github-actions · 2026-04-12T03:48:32Z

❌ Stage stage-c-4-gpu-b200 doesn't support isolated runs yet.

NVIDIA stages:

stage-a-test-1-gpu-small
stage-a-test-cpu
stage-b-test-1-gpu-small
stage-b-test-1-gpu-large
stage-b-test-2-gpu-large
stage-b-test-4-gpu-b200
stage-c-test-4-gpu-h100
stage-c-test-8-gpu-h200
stage-c-test-8-gpu-h20
stage-c-test-4-gpu-b200
stage-c-test-4-gpu-gb200
stage-c-test-deepep-4-gpu-h100
stage-c-test-deepep-8-gpu-h200
multimodal-gen-test-1-gpu
multimodal-gen-test-2-gpu
multimodal-gen-component-accuracy-1-gpu
multimodal-gen-component-accuracy-2-gpu
multimodal-gen-test-1-b200

AMD stages:

sgl-kernel-unit-test-amd
sgl-kernel-unit-test-2-gpu-amd
stage-a-test-1-gpu-small-amd
stage-b-test-1-gpu-small-amd
stage-b-test-1-gpu-small-amd-nondeterministic
stage-b-test-1-gpu-small-amd-mi35x
stage-b-test-1-gpu-large-amd
stage-b-test-2-gpu-large-amd
multimodal-gen-test-1-gpu-amd
multimodal-gen-test-2-gpu-amd
stage-c-test-large-8-gpu-amd
stage-c-test-large-8-gpu-amd-mi35x

Other stages will be added soon. For now, use /rerun-failed-ci for those stages.

Kangyan-Zhou · 2026-04-12T03:48:40Z

/rerun-stage stage-c-test-4-gpu-b200

github-actions · 2026-04-12T03:49:02Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

…-project#22609) Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>

github-actions Bot added the blackwell SM100/SM120 label Apr 12, 2026

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

github-actions Bot added the deepseek label Apr 12, 2026

hnyls2002 changed the title ~~[CI] Update B200 est_times to prevent Innomatrix timeouts~~ [CI] Update B200 est_times to prevent timeouts on slower machine Apr 12, 2026

hnyls2002 merged commit d6c9d91 into main Apr 12, 2026
104 of 115 checks passed

hnyls2002 deleted the ci/update-b200-est-times branch April 12, 2026 04:40

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

[CI] Update B200 est_times to prevent timeouts on slower machine (sgl…

f722643

…-project#22609) Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[CI] Update B200 est_times to prevent timeouts on slower machine (sgl…

4c94fdf

…-project#22609) Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>

wangfakang mentioned this pull request Apr 29, 2026

Refactor: decouple segment tracking from comm registration #21392

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Update B200 est_times to prevent timeouts on slower machine#22609

[CI] Update B200 est_times to prevent timeouts on slower machine#22609
hnyls2002 merged 2 commits intomainfrom
ci/update-b200-est-times

alisonshao commented Apr 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Kangyan-Zhou commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Kangyan-Zhou commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alisonshao commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Kangyan-Zhou commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Kangyan-Zhou commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alisonshao commented Apr 12, 2026 •

edited

Loading