Skip to content

[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x#21773

Merged
HaiShaw merged 1 commit intomainfrom
feat/glm5-mxfp4-mi35x-nightly
Apr 15, 2026
Merged

[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x#21773
HaiShaw merged 1 commit intomainfrom
feat/glm5-mxfp4-mi35x-nightly

Conversation

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

@michaelzhang-ai michaelzhang-ai commented Mar 31, 2026

Summary

  • Add nightly accuracy test (GSM8K 5-shot) and perf benchmark (`bench_one_batch`) for `amd/GLM-5-MXFP4` on MI35x 8-GPU
  • Remove obsolete base GLM-5 (BF16 NSA) CI jobs superseded by GLM-5.1 and GLM-5-MXFP4
  • Register combined Accuracy + Performance jobs in both workflow files

Model Details

Property Value
Model amd/GLM-5-MXFP4
Architecture `GlmMoeDsaForCausalLM` (MoE, 408B)
Quantization MOE-only OCP MXFP4 (Quark, auto-detected as `quant_method: "quark"`)
GSM8K ~92-93% (threshold set to 0.90)

Files Changed (4 files, +528/-130)

File Change
`test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py` New: GSM8K accuracy test (threshold 0.90)
`test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py` New: bench_one_batch (1024 in / 1024 out)
`nightly-test-amd.yml` Add `nightly-8-gpu-mi35x-glm5-mxfp4`, remove base GLM-5 jobs
`nightly-test-amd-rocm720.yml` Add `nightly-8-gpu-mi35x-glm5-mxfp4-rocm720`, remove base GLM-5 jobs

Test Runs

Run ROCm Status Link
Default ROCm MI35x ✅ Passed #24361489209
ROCm 7.2 MI35x ✅ Passed #24361490385

Test Plan

  • CI job `nightly-8-gpu-mi35x-glm5-mxfp4` passes accuracy + perf on MI35x (default ROCm)
  • ROCm 7.2 variant passes accuracy + perf
  • YAML validation passes
  • `black`, `ruff`, `isort` pass

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds GSM8K accuracy evaluation and performance benchmarking scripts for the GLM-5-MXFP4 model on AMD MI35x GPUs. The review feedback suggests moving module-level environment variable configurations to setUpClass or passing them directly to the runner to avoid side effects. Other improvements include replacing ast.literal_eval with int() for more robust numerical parsing and adding a safety check for zero division when calculating Inter-Token Latency (ITL).

Comment thread test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
Comment thread test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
Comment thread test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py
Comment thread test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Outdated
Comment thread test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py
Comment thread test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py
@michaelzhang-ai michaelzhang-ai changed the title [AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x [AMD][CI][WIP] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x Apr 2, 2026
@michaelzhang-ai michaelzhang-ai marked this pull request as draft April 3, 2026 21:27
@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch from cb68ea7 to 899c460 Compare April 10, 2026 19:40
@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 4 times, most recently from ef0fc1b to 4cd9139 Compare April 11, 2026 07:53
@michaelzhang-ai michaelzhang-ai changed the title [AMD][CI][WIP] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x [AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x Apr 11, 2026
@michaelzhang-ai michaelzhang-ai marked this pull request as ready for review April 11, 2026 07:56
@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 2 times, most recently from 14ab3b2 to 8d43fd8 Compare April 11, 2026 08:01
@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

michaelzhang-ai commented Apr 11, 2026

Addressed @1am9trash's review: added --reasoning-parser glm45 --tool-call-parser glm47 to perf test configs (matching GLM-5-FP8 and NV tests). Also changed perf input/output lens to 1024/1024 (was 4096/512 which exceeded context-length=4096).

@michaelzhang-ai michaelzhang-ai marked this pull request as draft April 11, 2026 08:09
@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 4 times, most recently from a3b92d6 to 81524b0 Compare April 13, 2026 06:29
@michaelzhang-ai michaelzhang-ai marked this pull request as ready for review April 13, 2026 06:29
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch from 81524b0 to b7c261c Compare April 13, 2026 19:02
@michaelzhang-ai michaelzhang-ai requested a review from HaiShaw April 13, 2026 20:40
@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch from b7c261c to cfc8a76 Compare April 14, 2026 02:17
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelzhang-ai check comment.

Comment thread test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py
@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

michaelzhang-ai commented Apr 14, 2026

@HaiShaw This follows the existing pattern used across all AMD perf tests (test_deepseek_r1_mxfp4_perf_mi35x.py, test_grok2_perf_mi35x.py, etc.).

@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 2 times, most recently from 874fa67 to 4542d8f Compare April 14, 2026 19:41
@michaelzhang-ai michaelzhang-ai requested a review from HaiShaw April 14, 2026 19:42
@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch 2 times, most recently from 851996e to 2c9b9b5 Compare April 14, 2026 19:51
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K, threshold 0.90) and performance
(bench_one_batch, 1024 in / 1024 out) benchmarks.

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Note: Workflow entries and engine fixes already merged via earlier PRs.
@michaelzhang-ai michaelzhang-ai force-pushed the feat/glm5-mxfp4-mi35x-nightly branch from 2c9b9b5 to 206b3d3 Compare April 14, 2026 19:54
@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

@amd-bot ci-status

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

@HaiShaw All 12 errors are pre-existing AMD CI issues, none related to this PR:

Error Type Our fault?
stage-b-test-1-gpu-small-amd (partitions 2,4,10,11,12,13) Exit code 255 or 30-min timeout No -- 1-GPU unit tests on MI325, our PR doesn't touch any tested code
stage-b-test-1-gpu-large-amd (partition 1) 30-min timeout No -- same
stage-b-test-1-gpu-small-amd-mi35x Exit code 255 No -- MI35x 1-GPU tests
stage-b-test-1-gpu-small-amd-nondeterministic Exit code 255 No -- known flaky by definition
wait-for-stage-b-amd Gate failed because upstream failed No -- cascading failure
pr-test-amd-finish Gate failed No -- cascading

The 4 warnings are all Node.js 20 deprecation notices on GitHub Actions runners -- infrastructure-level, affects all PRs.

Our PR only adds 2 test files and modifies 2 workflow YAML files. It doesn't change any engine code, model code, or existing tests.

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 15, 2026

@michaelzhang-ai

CI Status for PR #21773

PR: [AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x
Changed files: .github/workflows/nightly-test-amd-rocm720.yml (+30/-64), .github/workflows/nightly-test-amd.yml (+30/-66), test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py (+281/-0), test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py (+187/-0)

This PR only modifies nightly workflow definitions and adds new nightly test files. It does not change any runtime code, PR CI test files, or test infrastructure. None of the CI failures below are related to this PR.

AMD CI: 10 failures (0 likely related) | Others: 9 failures (0 related)

AMD CI Failures

Job Error Related? Explanation Log
stage-b-test-1-gpu-small-amd (2) RuntimeError: Rank 0 scheduler died during initialization (exit code: -6) 🟢 Unlikely Scheduler crash in test_lora_load_from_tensor.py — unrelated to nightly workflow changes Log
stage-b-test-1-gpu-small-amd (4) Server process exited with code -9 (OOM/SIGKILL) 🟢 Unlikely LLaDA2 model OOM in test_llada2_mini_amd.py — unrelated to this PR Log
stage-b-test-1-gpu-small-amd (8) Timed out after 30 minutes (watchdog timeouts) 🟢 Unlikely Hang in test_reasoning.py — unrelated to this PR Log
stage-b-test-1-gpu-small-amd (10) Timed out after 30 minutes (watchdog timeouts) 🟢 Unlikely Hang in test_eval_accuracy_large.py — unrelated to this PR Log
stage-b-test-1-gpu-small-amd (11) HFRunner subprocess died with exit code 1 🟢 Unlikely test_multi_lora_backend.py HFRunner crash — unrelated to this PR Log
stage-b-test-1-gpu-small-amd (12) Server process exited with code -9 (OOM/SIGKILL) 🟢 Unlikely LLaDA2 model OOM in test_llada2_mini.py — unrelated to this PR Log
stage-b-test-1-gpu-small-amd (13) Timed out after 30 minutes + HW Exception by GPU node-2 reason: GPU Hang 🟢 Unlikely GPU hardware hang — infrastructure issue, unrelated to this PR Log
stage-b-test-1-gpu-large-amd (1) Timed out after 30 minutes (1800s test timeout) 🟢 Unlikely Hang in test_bench_serving_1gpu_part2.py — unrelated to this PR Log
stage-b-test-1-gpu-small-amd-nondeterministic Memory access fault by GPU node-2 + Fatal Python error: Aborted 🟢 Unlikely GPU memory fault in test_reward_models.py during Qwen3 model — infrastructure/HW issue Log
stage-b-test-1-gpu-small-amd-mi35x AssertionError: False is not true + timeout 🟢 Unlikely Streaming response empty in test_gpt_oss_1gpu.py — unrelated to nightly workflow changes Log

Other CI Failures

Job Error Related? Explanation Log
stage-c-test-deepep-8-gpu-h200 CUDA version mismatch (13.0 vs 12.8) 🟢 Unlikely CUDA version mismatch on runner during DeepEP install — infrastructure issue Log
stage-c-test-8-gpu-h200 (0) Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200 🟢 Unlikely Cascade from deepep CUDA mismatch Log
stage-c-test-8-gpu-h200 (1) Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200 🟢 Unlikely Cascade from deepep CUDA mismatch Log
stage-c-test-8-gpu-h200 (2) Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200 🟢 Unlikely Cascade from deepep CUDA mismatch Log
stage-c-test-8-gpu-h200 (3) Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200 🟢 Unlikely Cascade from deepep CUDA mismatch Log
stage-c-test-4-gpu-b200 (1) Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200 🟢 Unlikely Cascade from deepep CUDA mismatch Log
stage-c-test-4-gpu-b200 (2) Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200 🟢 Unlikely Cascade from deepep CUDA mismatch Log
stage-c-test-4-gpu-b200 (3) Fast-fail: skipping -- root cause: stage-c-test-deepep-8-gpu-h200 🟢 Unlikely Cascade from deepep CUDA mismatch Log
build-test (all) Failed to parse benchmark output in test_latency_fp8_qwen (Intel AMX) 🟢 Unlikely Intel AMX CPU backend test failure — unrelated to AMD nightly workflow changes Log

Details

All 19 failures are unrelated to this PR. This PR only modifies nightly AMD workflow definitions (adding GLM-5-MXFP4 test jobs, removing old GLM-5 jobs, reorganizing GLM-5.1 jobs) and adds two new nightly test files. None of the changed files are executed during PR CI.

The failures fall into these pre-existing categories:

  • AMD OOM/crashes (shards 2, 4, 12): LLaDA2 and LoRA tests hitting OOM on MI325 runners
  • AMD timeouts/hangs (shards 8, 10, 13, large-1): Watchdog timeouts during reasoning, eval, perf, and observability tests
  • AMD hardware issues (shard 13, nondeterministic): GPU Hang and memory access fault — infrastructure problems
  • AMD MI35x (mi35x): Streaming response assertion failure in test_gpt_oss_1gpu.py
  • Nvidia CUDA mismatch (deepep + 7 cascades): Runner has CUDA 13.0 but PyTorch was compiled with CUDA 12.8
  • CPU backend (build-test): Intel AMX FP8 quantization benchmark parse failure

Verdict: No action needed from the PR author. All failures are pre-existing infrastructure or flaky test issues on main.

Generated by amd-bot using Claude Code CLI

@HaiShaw HaiShaw merged commit 39c6bf7 into main Apr 15, 2026
94 of 116 checks passed
@HaiShaw HaiShaw deleted the feat/glm5-mxfp4-mi35x-nightly branch April 15, 2026 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants