GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix by ColinZ22 · Pull Request #22543 · sgl-project/sglang

ColinZ22 · 2026-04-10T20:38:09Z

Motivation

Addresses this issue regarding AMD Quark-quantized GLM-5 and GLM-5.1 MXFP4 checkpoints when using with SGLang (Exclude-layer names don't match SGLang internal names & Weight shape mismatch during MoE loading).

Modifications

Added packed_modules_mapping for DeepseekV2ForCausalLM
Added guard to ensure only models with "DeepseekV3ForCausalLM" architecture are fed through the quark_post_load_weights function (previously causing GlmMoeDsaForCausalLM models breaking)

Accuracy Tests

Using lm_eval in SGLang

GLM-5-MXFP4 GSM8K: 0.9363
GLM-5.1-MXFP4 GSM8K: 0.9447

(Fixes amd/Quark#25)

gemini-code-assist

Code Review

This pull request introduces several updates to the DeepSeek model implementation and server argument handling. Specifically, it adds a check to ensure only 'DeepseekV3ForCausalLM' models are processed during certain quantization steps, updates the packed module mapping for DeepSeek V2, and normalizes device strings by stripping indices. A review comment suggests adding a safety check when accessing the architectures list in the weight loader to prevent a potential IndexError.

…uard Cherry-pick critical fixes from PR #22543 (ColinZ22): 1. Add packed_modules_mapping for DeepseekV2ForCausalLM so Quark's should_ignore_layer() can resolve fused gate_up_proj -> [gate_proj, up_proj] against the exclude list 2. Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM, preventing GlmMoeDsaForCausalLM from hitting the wrong weight transformation path Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>

HaiShaw · 2026-04-11T06:26:06Z

-    packed_modules_mapping = {}
+    packed_modules_mapping = {
+        "gate_up_proj": ["gate_proj", "up_proj"],
+    }


Wrong place to introduce quark specific need changes.
Please refer to _get_quantization_config() in model_loader/loader.py for proper code change.

tbf this is not quark specific. As can be seen in various other models.

sglang/python/sglang/srt/models/qwen3_moe.py

Lines 936 to 942 in e9d6b9e

# Mapping from fused module names to their component weight names.

# Required for quantization configs (e.g., ModelOpt FP4) to correctly identify

# which layers should be skipped based on the exclude_modules/ignore list.

packed_modules_mapping = {

"qkv_proj": ["q_proj", "k_proj", "v_proj"],

"gate_up_proj": ["gate_proj", "up_proj"],

}

sglang/python/sglang/srt/models/mllama4.py

Lines 417 to 421 in e9d6b9e

class Llama4ForConditionalGeneration(nn.Module):

packed_modules_mapping = {

"qkv_proj": ["q_proj", "k_proj", "v_proj"],

"gate_up_proj": ["gate_proj", "up_proj"],

}

Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (cherry-picked from PR #22543 by ColinZ22): - Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj]) - Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed) https://github.com/sgl-project/sglang/actions/runs/24268460251 Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>

…ixes

Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (aligned with PR #22543 by ColinZ22, per HaiShaw review): - loader.py: Add packed_modules_mapping for Quark in _get_quantization_config() - deepseek_weight_loader.py: Guard quark_post_load_weights to DeepseekV3 only Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>

michaelzhang-ai · 2026-04-14T01:22:02Z

LGTM

Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (aligned with PR #22543 by ColinZ22, per HaiShaw review): - loader.py: Add packed_modules_mapping for Quark in _get_quantization_config() - deepseek_weight_loader.py: Guard quark_post_load_weights to DeepseekV3 only Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>

…ks for MI30x and MI35x Add nightly CI tests for amd/GLM-5.1-MXFP4 (408B MoE, Quark MXFP4) on both MI30x and MI35x GPUs. Includes pre-download step for the 425GB model. Depends on PR #22543 for engine fixes (cherry-picked above): - packed_modules_mapping for Quark exclude-layer resolution - Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

HaiShaw · 2026-04-14T03:42:41Z

/tag-and-rerun-ci

HaiShaw · 2026-04-14T05:39:58Z

@amd-bot ci-status

amd-bot · 2026-04-14T05:45:32Z

@HaiShaw

CI Status for PR #22543

PR: GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix
Changed files: python/sglang/srt/model_loader/loader.py (+3/-0), python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py (+3/-0), python/sglang/srt/server_args.py (+2/-0)

Job	Error	Related?	Explanation	Log
stage-b-test-1-gpu-large (3)	`libcudart.so.13: cannot open shared object file` in flashinfer JIT cache	🟢 Unlikely	CUDA runtime version mismatch in CI runner image; flashinfer's `quantization.so` was compiled against a newer CUDA than installed. Unrelated to PR changes.	Log
stage-b-test-1-gpu-large (4,5,6,7,8,9,10,11,12,13)	`Fast-fail: skipping — root cause job(s): stage-b-test-1-gpu-large (3)`	🟢 Unlikely	Cascade fast-fail triggered by shard 3's failure above. No tests ran.	—
stage-b-test-1-gpu-small (3,4,5,6,7)	`Fast-fail: skipping — root cause job(s): stage-b-test-1-gpu-large (3)`	🟢 Unlikely	Cascade fast-fail triggered by shard 3's failure above. No tests ran.	—
wait-for-stage-b	Upstream stage-b jobs failed	🟢 Unlikely	Gate job; fails because upstream jobs failed.	—
pr-test-finish	Upstream jobs failed	🟢 Unlikely	Aggregation gate job.	—
build-and-test (XPU)	`TimeoutError: Server failed to start within the timeout period` in `test_deepseek_ocr_triton.py`	🟢 Unlikely	XPU triton attention backend timed out after 600s loading DeepSeek-OCR. Non-triton variant passed. Unrelated to PR's quark/GLM changes.	Log
finish (XPU)	Upstream build-and-test failed	🟢 Unlikely	Aggregation gate job.	—
stage-b-test-1-gpu-small-amd (shard 9)	Step timeout (30 min) exceeded by ~6s; all 8/8 tests passed	🟢 Unlikely	All tests passed but JIT kernel compilation overhead on MI325 pushed wall time to 1806s vs 1800s limit.	Log
stage-b-test-2-gpu-large-amd (shard 1)	`429 Too Many Requests` from HuggingFace API in `test_hicache_storage_file_backend.py`	🟢 Unlikely	HuggingFace rate limit (3000 req/5min) hit during tokenizer download. Infrastructure issue.	Log
stage-b-test-1-gpu-small-amd-mi35x	`TIMEOUT: test_gpt_oss_1gpu.py after 1200 seconds`	🟢 Unlikely	MI35X hardware too slow for MXFP4 GPT-OSS 20B eval; test was still generating tokens when killed.	Log
pr-test-amd-finish	Upstream AMD jobs failed	🟢 Unlikely	Aggregation gate job.	—

Details

This PR makes 3 small, targeted changes:

loader.py: Adds gate_up_proj to packed_modules_mapping when quantization is "quark" (GLM model compatibility)
deepseek_weight_loader.py: Adds architecture check (DeepseekV3ForCausalLM) to skip quark MXFP4 post-load weight processing for non-DeepSeek models like GLM
server_args.py: Strips device index from --device argument (e.g., "cuda:0" -> "cuda")

None of the 4 distinct root-cause failures are related to this PR:

NVIDIA flashinfer libcudart.so.13 crash — CI runner image has a CUDA runtime mismatch. The failing test (test_eagle_infer_b.py) tests EAGLE speculative decoding with Llama-2, which has nothing to do with quark quantization or GLM models. This failure cascaded to 13+ other NVIDIA jobs via fast-fail.
XPU triton timeout — test_deepseek_ocr_triton.py timed out starting a DeepSeek-OCR server on XPU with triton backend. The PR doesn't touch XPU or triton attention code.
AMD MI325 step timeout — All 8 tests passed but the 30-minute step timeout was exceeded by 6 seconds due to JIT compilation overhead. Not a code issue.
AMD HuggingFace 429 rate limit — test_hicache_storage_file_backend.py failed because the CI hit HuggingFace's API rate limit. Not a code issue.
AMD MI35X test timeout — test_gpt_oss_1gpu.py was still actively processing when the 1200s per-file timeout hit. Hardware speed issue on MI35X.

Verdict: All failures are pre-existing infrastructure issues. No action needed from the PR author.

Generated by amd-bot using Claude Code CLI

Add test files for amd/GLM-5.1-MXFP4 (408B MoE, Quark MXFP4) on MI30x and MI35x. Workflow jobs and pre-download steps are already on main. Engine fixes from PR #22543 are already merged. Test suites: - nightly-amd-accuracy-8-gpu-glm51-mxfp4 (MI30x accuracy) - nightly-amd-8-gpu-mi35x-glm51-mxfp4 (MI35x accuracy) - nightly-perf-8-gpu-glm51-mxfp4 (MI30x perf) - nightly-perf-8-gpu-mi35x-glm51-mxfp4 (MI35x perf)

…ks for MI30x and MI35x Add nightly CI tests for amd/GLM-5.1-MXFP4 (408B MoE, Quark MXFP4) on MI30x and MI35x. Includes pre-download step (120min timeout) to cache the 425GB model on persistent runner storage before server start. Test files: - test/registered/amd/accuracy/mi30x/test_glm51_mxfp4_eval_amd.py - test/registered/amd/accuracy/mi35x/test_glm51_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi30x/test_glm51_mxfp4_perf_amd.py - test/registered/amd/perf/mi35x/test_glm51_mxfp4_perf_mi35x.py Workflow: MI30x + MI35x jobs for default ROCm and ROCm 7.2 with accuracy + perf steps, dropdown entries, and check-all-jobs needs. Engine fixes already merged via PR #22543.

…2543) Co-authored-by: HAI <hixiao@gmail.com>

GLM-5(.1) MXFP4 inference fix

4798882

ColinZ22 requested review from Fridge003, ch-wan, fzyzcjy, ispobock and merrymercy as code owners April 10, 2026 20:38

github-actions Bot added the deepseek label Apr 10, 2026

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py

michaelzhang-ai requested review from 1am9trash and HaiShaw April 10, 2026 22:42

HaiShaw requested changes Apr 11, 2026

View reviewed changes

michaelzhang-ai mentioned this pull request Apr 11, 2026

[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x #21773

Merged

4 tasks

changed packed_modules_mapping update location and other PR comment f…

21a308c

…ixes

michaelzhang-ai requested a review from HaiShaw April 13, 2026 18:56

BowenBao approved these changes Apr 14, 2026

View reviewed changes

HaiShaw approved these changes Apr 14, 2026

View reviewed changes

Merge branch 'main' into GLM-5-Fix

7d9bbc6

github-actions Bot added the run-ci label Apr 14, 2026

HaiShaw merged commit b10f852 into sgl-project:main Apr 14, 2026
152 of 192 checks passed

michaelzhang-ai mentioned this pull request Apr 14, 2026

[AMD][CI] Add GLM-5.1-MXFP4 nightly accuracy and performance benchmarks for MI30x and MI35x #22409

Draft

3 tasks

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix (sgl-project#2…

12c6077

…2543) Co-authored-by: HAI <hixiao@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix#22543

GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix#22543
HaiShaw merged 3 commits intosgl-project:mainfrom
ColinZ22:GLM-5-Fix

ColinZ22 commented Apr 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HaiShaw Apr 11, 2026

Uh oh!

ColinZ22 Apr 13, 2026

Uh oh!

BowenBao Apr 14, 2026

Uh oh!

michaelzhang-ai commented Apr 14, 2026

Uh oh!

HaiShaw commented Apr 14, 2026

Uh oh!

HaiShaw commented Apr 14, 2026

Uh oh!

amd-bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	# Mapping from fused module names to their component weight names.
	# Required for quantization configs (e.g., ModelOpt FP4) to correctly identify
	# which layers should be skipped based on the exclude_modules/ignore list.
	packed_modules_mapping = {
	"qkv_proj": ["q_proj", "k_proj", "v_proj"],
	"gate_up_proj": ["gate_proj", "up_proj"],
	}

	class Llama4ForConditionalGeneration(nn.Module):
	packed_modules_mapping = {
	"qkv_proj": ["q_proj", "k_proj", "v_proj"],
	"gate_up_proj": ["gate_proj", "up_proj"],
	}

Conversation

ColinZ22 commented Apr 10, 2026

Motivation

Modifications

Accuracy Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HaiShaw Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

ColinZ22 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

BowenBao Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

michaelzhang-ai commented Apr 14, 2026

Uh oh!

HaiShaw commented Apr 14, 2026

Uh oh!

HaiShaw commented Apr 14, 2026

Uh oh!

amd-bot commented Apr 14, 2026

CI Status for PR #22543

Details

Verdict: All failures are pre-existing infrastructure issues. No action needed from the PR author.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants