[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x by michaelzhang-ai · Pull Request #21773 · sgl-project/sglang

michaelzhang-ai · 2026-03-31T18:21:11Z

Summary

Add nightly accuracy test (GSM8K 5-shot) and perf benchmark (`bench_one_batch`) for `amd/GLM-5-MXFP4` on MI35x 8-GPU
Remove obsolete base GLM-5 (BF16 NSA) CI jobs superseded by GLM-5.1 and GLM-5-MXFP4
Register combined Accuracy + Performance jobs in both workflow files

Model Details

Property	Value
Model	amd/GLM-5-MXFP4
Architecture	`GlmMoeDsaForCausalLM` (MoE, 408B)
Quantization	MOE-only OCP MXFP4 (Quark, auto-detected as `quant_method: "quark"`)
GSM8K	~92-93% (threshold set to 0.90)

Files Changed (4 files, +528/-130)

File	Change
`test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py`	New: GSM8K accuracy test (threshold 0.90)
`test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py`	New: bench_one_batch (1024 in / 1024 out)
`nightly-test-amd.yml`	Add `nightly-8-gpu-mi35x-glm5-mxfp4`, remove base GLM-5 jobs
`nightly-test-amd-rocm720.yml`	Add `nightly-8-gpu-mi35x-glm5-mxfp4-rocm720`, remove base GLM-5 jobs

Test Runs

Run	ROCm	Status	Link
Default ROCm	MI35x	✅ Passed	#24361489209
ROCm 7.2	MI35x	✅ Passed	#24361490385

Test Plan

CI job `nightly-8-gpu-mi35x-glm5-mxfp4` passes accuracy + perf on MI35x (default ROCm)
ROCm 7.2 variant passes accuracy + perf
YAML validation passes
`black`, `ruff`, `isort` pass

gemini-code-assist

Code Review

This pull request adds GSM8K accuracy evaluation and performance benchmarking scripts for the GLM-5-MXFP4 model on AMD MI35x GPUs. The review feedback suggests moving module-level environment variable configurations to setUpClass or passing them directly to the runner to avoid side effects. Other improvements include replacing ast.literal_eval with int() for more robust numerical parsing and adding a safety check for zero division when calculating Inter-Token Latency (ITL).

michaelzhang-ai · 2026-04-11T08:01:26Z

Addressed @1am9trash's review: added --reasoning-parser glm45 --tool-call-parser glm47 to perf test configs (matching GLM-5-FP8 and NV tests). Also changed perf input/output lens to 1024/1024 (was 4096/512 which exceeded context-length=4096).

gemini-code-assist · 2026-04-13T06:29:57Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw

@michaelzhang-ai check comment.

michaelzhang-ai · 2026-04-14T07:19:26Z

@HaiShaw This follows the existing pattern used across all AMD perf tests (test_deepseek_r1_mxfp4_perf_mi35x.py, test_grok2_perf_mi35x.py, etc.).

Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K, threshold 0.90) and performance (bench_one_batch, 1024 in / 1024 out) benchmarks. Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Note: Workflow entries and engine fixes already merged via earlier PRs.

michaelzhang-ai · 2026-04-14T23:57:28Z

@amd-bot ci-status

michaelzhang-ai · 2026-04-15T00:03:23Z

@HaiShaw All 12 errors are pre-existing AMD CI issues, none related to this PR:

Error	Type	Our fault?
`stage-b-test-1-gpu-small-amd` (partitions 2,4,10,11,12,13)	Exit code 255 or 30-min timeout	No -- 1-GPU unit tests on MI325, our PR doesn't touch any tested code
`stage-b-test-1-gpu-large-amd` (partition 1)	30-min timeout	No -- same
`stage-b-test-1-gpu-small-amd-mi35x`	Exit code 255	No -- MI35x 1-GPU tests
`stage-b-test-1-gpu-small-amd-nondeterministic`	Exit code 255	No -- known flaky by definition
`wait-for-stage-b-amd`	Gate failed because upstream failed	No -- cascading failure
`pr-test-amd-finish`	Gate failed	No -- cascading

The 4 warnings are all Node.js 20 deprecation notices on GitHub Actions runners -- infrastructure-level, affects all PRs.

Our PR only adds 2 test files and modifies 2 workflow YAML files. It doesn't change any engine code, model code, or existing tests.

amd-bot · 2026-04-15T00:06:18Z

@michaelzhang-ai

CI Status for PR #21773

PR: [AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x
Changed files: .github/workflows/nightly-test-amd-rocm720.yml (+30/-64), .github/workflows/nightly-test-amd.yml (+30/-66), test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py (+281/-0), test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py (+187/-0)

This PR only modifies nightly workflow definitions and adds new nightly test files. It does not change any runtime code, PR CI test files, or test infrastructure. None of the CI failures below are related to this PR.

AMD CI: 10 failures (0 likely related) | Others: 9 failures (0 related)

AMD CI Failures

Job	Error	Related?	Explanation	Log
stage-b-test-1-gpu-small-amd (2)	`RuntimeError: Rank 0 scheduler died during initialization (exit code: -6)`	🟢 Unlikely	Scheduler crash in `test_lora_load_from_tensor.py` — unrelated to nightly workflow changes	Log
stage-b-test-1-gpu-small-amd (4)	`Server process exited with code -9` (OOM/SIGKILL)	🟢 Unlikely	LLaDA2 model OOM in `test_llada2_mini_amd.py` — unrelated to this PR	Log
stage-b-test-1-gpu-small-amd (8)	`Timed out after 30 minutes` (watchdog timeouts)	🟢 Unlikely	Hang in `test_reasoning.py` — unrelated to this PR	Log
stage-b-test-1-gpu-small-amd (10)	`Timed out after 30 minutes` (watchdog timeouts)	🟢 Unlikely	Hang in `test_eval_accuracy_large.py` — unrelated to this PR	Log
stage-b-test-1-gpu-small-amd (11)	`HFRunner subprocess died with exit code 1`	🟢 Unlikely	`test_multi_lora_backend.py` HFRunner crash — unrelated to this PR	Log
stage-b-test-1-gpu-small-amd (12)	`Server process exited with code -9` (OOM/SIGKILL)	🟢 Unlikely	LLaDA2 model OOM in `test_llada2_mini.py` — unrelated to this PR	Log
stage-b-test-1-gpu-small-amd (13)	`Timed out after 30 minutes` + `HW Exception by GPU node-2 reason: GPU Hang`	🟢 Unlikely	GPU hardware hang — infrastructure issue, unrelated to this PR	Log
stage-b-test-1-gpu-large-amd (1)	`Timed out after 30 minutes` (1800s test timeout)	🟢 Unlikely	Hang in `test_bench_serving_1gpu_part2.py` — unrelated to this PR	Log
stage-b-test-1-gpu-small-amd-nondeterministic	`Memory access fault by GPU node-2` + `Fatal Python error: Aborted`	🟢 Unlikely	GPU memory fault in `test_reward_models.py` during Qwen3 model — infrastructure/HW issue	Log
stage-b-test-1-gpu-small-amd-mi35x	`AssertionError: False is not true` + timeout	🟢 Unlikely	Streaming response empty in `test_gpt_oss_1gpu.py` — unrelated to nightly workflow changes	Log

Other CI Failures

Job	Error	Related?	Explanation	Log
stage-c-test-deepep-8-gpu-h200	`CUDA version mismatch (13.0 vs 12.8)`	🟢 Unlikely	CUDA version mismatch on runner during DeepEP install — infrastructure issue	Log
stage-c-test-8-gpu-h200 (0)	`Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200`	🟢 Unlikely	Cascade from deepep CUDA mismatch	Log
stage-c-test-8-gpu-h200 (1)	`Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200`	🟢 Unlikely	Cascade from deepep CUDA mismatch	Log
stage-c-test-8-gpu-h200 (2)	`Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200`	🟢 Unlikely	Cascade from deepep CUDA mismatch	Log
stage-c-test-8-gpu-h200 (3)	`Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200`	🟢 Unlikely	Cascade from deepep CUDA mismatch	Log
stage-c-test-4-gpu-b200 (1)	`Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200`	🟢 Unlikely	Cascade from deepep CUDA mismatch	Log
stage-c-test-4-gpu-b200 (2)	`Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200`	🟢 Unlikely	Cascade from deepep CUDA mismatch	Log
stage-c-test-4-gpu-b200 (3)	`Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200`	🟢 Unlikely	Cascade from deepep CUDA mismatch	Log
build-test (all)	`Failed to parse benchmark output` in `test_latency_fp8_qwen` (Intel AMX)	🟢 Unlikely	Intel AMX CPU backend test failure — unrelated to AMD nightly workflow changes	Log

Details

All 19 failures are unrelated to this PR. This PR only modifies nightly AMD workflow definitions (adding GLM-5-MXFP4 test jobs, removing old GLM-5 jobs, reorganizing GLM-5.1 jobs) and adds two new nightly test files. None of the changed files are executed during PR CI.

The failures fall into these pre-existing categories:

AMD OOM/crashes (shards 2, 4, 12): LLaDA2 and LoRA tests hitting OOM on MI325 runners
AMD timeouts/hangs (shards 8, 10, 13, large-1): Watchdog timeouts during reasoning, eval, perf, and observability tests
AMD hardware issues (shard 13, nondeterministic): GPU Hang and memory access fault — infrastructure problems
AMD MI35x (mi35x): Streaming response assertion failure in test_gpt_oss_1gpu.py
Nvidia CUDA mismatch (deepep + 7 cascades): Runner has CUDA 13.0 but PyTorch was compiled with CUDA 12.8
CPU backend (build-test): Intel AMX FP8 quantization benchmark parse failure

Verdict: No action needed from the PR author. All failures are pre-existing infrastructure or flaky test issues on `main`.

Generated by amd-bot using Claude Code CLI

…gl-project#21773)

michaelzhang-ai requested review from Fridge003, Kangyan-Zhou, bingxche, ispobock and merrymercy as code owners March 31, 2026 18:21

github-actions Bot added the amd label Mar 31, 2026

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

michaelzhang-ai requested a review from yctseng0211 March 31, 2026 18:22

michaelzhang-ai requested review from 1am9trash, hubertlu-tw, kkHuang-amd and yichiche as code owners March 31, 2026 18:52

1am9trash reviewed Apr 2, 2026

View reviewed changes

Comment thread test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

michaelzhang-ai changed the title ~~[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x~~ [AMD][CI][WIP] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x Apr 2, 2026

michaelzhang-ai mentioned this pull request Apr 2, 2026

GLM-5-MXFP4 checkpoint incompatible with SGLang: exclude-layer naming and shared expert fusion amd/Quark#25

Closed

michaelzhang-ai marked this pull request as draft April 3, 2026 21:27

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch from cb68ea7 to 899c460 Compare April 10, 2026 19:40

github-actions Bot added the deepseek label Apr 10, 2026

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 4 times, most recently from ef0fc1b to 4cd9139 Compare April 11, 2026 07:53

michaelzhang-ai changed the title ~~[AMD][CI][WIP] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x~~ [AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x Apr 11, 2026

michaelzhang-ai marked this pull request as ready for review April 11, 2026 07:56

michaelzhang-ai requested review from ch-wan and fzyzcjy as code owners April 11, 2026 07:56

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 2 times, most recently from 14ab3b2 to 8d43fd8 Compare April 11, 2026 08:01

michaelzhang-ai marked this pull request as draft April 11, 2026 08:09

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 4 times, most recently from a3b92d6 to 81524b0 Compare April 13, 2026 06:29

michaelzhang-ai marked this pull request as ready for review April 13, 2026 06:29

michaelzhang-ai requested a review from 1am9trash April 13, 2026 06:31

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch from 81524b0 to b7c261c Compare April 13, 2026 19:02

michaelzhang-ai requested a review from HaiShaw April 13, 2026 20:40

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch from b7c261c to cfc8a76 Compare April 14, 2026 02:17

1am9trash approved these changes Apr 14, 2026

View reviewed changes

HaiShaw reviewed Apr 14, 2026

View reviewed changes

Comment thread test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 2 times, most recently from 874fa67 to 4542d8f Compare April 14, 2026 19:41

michaelzhang-ai requested a review from HaiShaw April 14, 2026 19:42

michaelzhang-ai added the run-ci label Apr 14, 2026

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 2 times, most recently from 851996e to 2c9b9b5 Compare April 14, 2026 19:51

michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch from 2c9b9b5 to 206b3d3 Compare April 14, 2026 19:54

HaiShaw merged commit 39c6bf7 into main Apr 15, 2026
94 of 116 checks passed

HaiShaw deleted the feat/glm5-mxfp4-mi35x-nightly branch April 15, 2026 01:55

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x (s…

2c4a631

…gl-project#21773)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x#21773

[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x#21773
HaiShaw merged 1 commit intomainfrom
feat/glm5-mxfp4-mi35x-nightly

michaelzhang-ai commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelzhang-ai commented Apr 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 13, 2026

Uh oh!

HaiShaw left a comment

Uh oh!

Uh oh!

michaelzhang-ai commented Apr 14, 2026 •

edited

Loading

Uh oh!

michaelzhang-ai commented Apr 14, 2026

Uh oh!

michaelzhang-ai commented Apr 15, 2026

Uh oh!

amd-bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

michaelzhang-ai commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Model Details

Files Changed (4 files, +528/-130)

Test Runs

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelzhang-ai commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Apr 13, 2026

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelzhang-ai commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelzhang-ai commented Apr 14, 2026

Uh oh!

michaelzhang-ai commented Apr 15, 2026

Uh oh!

amd-bot commented Apr 15, 2026

CI Status for PR #21773

AMD CI Failures

Other CI Failures

Details

Verdict: No action needed from the PR author. All failures are pre-existing infrastructure or flaky test issues on main.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelzhang-ai commented Mar 31, 2026 •

edited

Loading

michaelzhang-ai commented Apr 11, 2026 •

edited

Loading

michaelzhang-ai commented Apr 14, 2026 •

edited

Loading

Verdict: No action needed from the PR author. All failures are pre-existing infrastructure or flaky test issues on `main`.