Skip to content

GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix#22543

Merged
HaiShaw merged 3 commits intosgl-project:mainfrom
ColinZ22:GLM-5-Fix
Apr 14, 2026
Merged

GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix#22543
HaiShaw merged 3 commits intosgl-project:mainfrom
ColinZ22:GLM-5-Fix

Conversation

@ColinZ22
Copy link
Copy Markdown
Contributor

Motivation

Addresses this issue regarding AMD Quark-quantized GLM-5 and GLM-5.1 MXFP4 checkpoints when using with SGLang (Exclude-layer names don't match SGLang internal names & Weight shape mismatch during MoE loading).

Modifications

  • Added packed_modules_mapping for DeepseekV2ForCausalLM
  • Added guard to ensure only models with "DeepseekV3ForCausalLM" architecture are fed through the quark_post_load_weights function (previously causing GlmMoeDsaForCausalLM models breaking)

Accuracy Tests

Using lm_eval in SGLang

(Fixes amd/Quark#25)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several updates to the DeepSeek model implementation and server argument handling. Specifically, it adds a check to ensure only 'DeepseekV3ForCausalLM' models are processed during certain quantization steps, updates the packed module mapping for DeepSeek V2, and normalizes device strings by stripping indices. A review comment suggests adding a safety check when accessing the architectures list in the weight loader to prevent a potential IndexError.

Comment thread python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py
michaelzhang-ai added a commit that referenced this pull request Apr 10, 2026
…uard

Cherry-pick critical fixes from PR #22543 (ColinZ22):

1. Add packed_modules_mapping for DeepseekV2ForCausalLM so Quark's
   should_ignore_layer() can resolve fused gate_up_proj -> [gate_proj,
   up_proj] against the exclude list

2. Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM,
   preventing GlmMoeDsaForCausalLM from hitting the wrong weight
   transformation path

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
Comment thread python/sglang/srt/server_args.py Outdated
Comment thread python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py
Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
packed_modules_mapping = {}
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong place to introduce quark specific need changes.
Please refer to _get_quantization_config() in model_loader/loader.py for proper code change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbf this is not quark specific. As can be seen in various other models.

# Mapping from fused module names to their component weight names.
# Required for quantization configs (e.g., ModelOpt FP4) to correctly identify
# which layers should be skipped based on the exclude_modules/ignore list.
packed_modules_mapping = {
"qkv_proj": ["q_proj", "k_proj", "v_proj"],
"gate_up_proj": ["gate_proj", "up_proj"],
}

class Llama4ForConditionalGeneration(nn.Module):
packed_modules_mapping = {
"qkv_proj": ["q_proj", "k_proj", "v_proj"],
"gate_up_proj": ["gate_proj", "up_proj"],
}

michaelzhang-ai added a commit that referenced this pull request Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
@michaelzhang-ai michaelzhang-ai requested a review from HaiShaw April 13, 2026 18:56
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (aligned with PR #22543 by ColinZ22, per HaiShaw review):
- loader.py: Add packed_modules_mapping for Quark in _get_quantization_config()
- deepseek_weight_loader.py: Guard quark_post_load_weights to DeepseekV3 only

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

LGTM

michaelzhang-ai added a commit that referenced this pull request Apr 14, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (aligned with PR #22543 by ColinZ22, per HaiShaw review):
- loader.py: Add packed_modules_mapping for Quark in _get_quantization_config()
- deepseek_weight_loader.py: Guard quark_post_load_weights to DeepseekV3 only

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 14, 2026
…ks for MI30x and MI35x

Add nightly CI tests for amd/GLM-5.1-MXFP4 (408B MoE, Quark MXFP4) on
both MI30x and MI35x GPUs. Includes pre-download step for the 425GB model.

Depends on PR #22543 for engine fixes (cherry-picked above):
- packed_modules_mapping for Quark exclude-layer resolution
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 14, 2026

/tag-and-rerun-ci

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 14, 2026

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 14, 2026

@HaiShaw

CI Status for PR #22543

PR: GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix
Changed files: python/sglang/srt/model_loader/loader.py (+3/-0), python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py (+3/-0), python/sglang/srt/server_args.py (+2/-0)

Job Error Related? Explanation Log
stage-b-test-1-gpu-large (3) libcudart.so.13: cannot open shared object file in flashinfer JIT cache 🟢 Unlikely CUDA runtime version mismatch in CI runner image; flashinfer's quantization.so was compiled against a newer CUDA than installed. Unrelated to PR changes. Log
stage-b-test-1-gpu-large (4,5,6,7,8,9,10,11,12,13) Fast-fail: skipping — root cause job(s): stage-b-test-1-gpu-large (3) 🟢 Unlikely Cascade fast-fail triggered by shard 3's failure above. No tests ran.
stage-b-test-1-gpu-small (3,4,5,6,7) Fast-fail: skipping — root cause job(s): stage-b-test-1-gpu-large (3) 🟢 Unlikely Cascade fast-fail triggered by shard 3's failure above. No tests ran.
wait-for-stage-b Upstream stage-b jobs failed 🟢 Unlikely Gate job; fails because upstream jobs failed.
pr-test-finish Upstream jobs failed 🟢 Unlikely Aggregation gate job.
build-and-test (XPU) TimeoutError: Server failed to start within the timeout period in test_deepseek_ocr_triton.py 🟢 Unlikely XPU triton attention backend timed out after 600s loading DeepSeek-OCR. Non-triton variant passed. Unrelated to PR's quark/GLM changes. Log
finish (XPU) Upstream build-and-test failed 🟢 Unlikely Aggregation gate job.
stage-b-test-1-gpu-small-amd (shard 9) Step timeout (30 min) exceeded by ~6s; all 8/8 tests passed 🟢 Unlikely All tests passed but JIT kernel compilation overhead on MI325 pushed wall time to 1806s vs 1800s limit. Log
stage-b-test-2-gpu-large-amd (shard 1) 429 Too Many Requests from HuggingFace API in test_hicache_storage_file_backend.py 🟢 Unlikely HuggingFace rate limit (3000 req/5min) hit during tokenizer download. Infrastructure issue. Log
stage-b-test-1-gpu-small-amd-mi35x TIMEOUT: test_gpt_oss_1gpu.py after 1200 seconds 🟢 Unlikely MI35X hardware too slow for MXFP4 GPT-OSS 20B eval; test was still generating tokens when killed. Log
pr-test-amd-finish Upstream AMD jobs failed 🟢 Unlikely Aggregation gate job.

Details

This PR makes 3 small, targeted changes:

  1. loader.py: Adds gate_up_proj to packed_modules_mapping when quantization is "quark" (GLM model compatibility)
  2. deepseek_weight_loader.py: Adds architecture check (DeepseekV3ForCausalLM) to skip quark MXFP4 post-load weight processing for non-DeepSeek models like GLM
  3. server_args.py: Strips device index from --device argument (e.g., "cuda:0" -> "cuda")

None of the 4 distinct root-cause failures are related to this PR:

  1. NVIDIA flashinfer libcudart.so.13 crash — CI runner image has a CUDA runtime mismatch. The failing test (test_eagle_infer_b.py) tests EAGLE speculative decoding with Llama-2, which has nothing to do with quark quantization or GLM models. This failure cascaded to 13+ other NVIDIA jobs via fast-fail.

  2. XPU triton timeouttest_deepseek_ocr_triton.py timed out starting a DeepSeek-OCR server on XPU with triton backend. The PR doesn't touch XPU or triton attention code.

  3. AMD MI325 step timeout — All 8 tests passed but the 30-minute step timeout was exceeded by 6 seconds due to JIT compilation overhead. Not a code issue.

  4. AMD HuggingFace 429 rate limittest_hicache_storage_file_backend.py failed because the CI hit HuggingFace's API rate limit. Not a code issue.

  5. AMD MI35X test timeouttest_gpt_oss_1gpu.py was still actively processing when the 1200s per-file timeout hit. Hardware speed issue on MI35X.

Verdict: All failures are pre-existing infrastructure issues. No action needed from the PR author.

Generated by amd-bot using Claude Code CLI

@HaiShaw HaiShaw merged commit b10f852 into sgl-project:main Apr 14, 2026
152 of 192 checks passed
michaelzhang-ai added a commit that referenced this pull request Apr 14, 2026
Add test files for amd/GLM-5.1-MXFP4 (408B MoE, Quark MXFP4) on MI30x
and MI35x. Workflow jobs and pre-download steps are already on main.
Engine fixes from PR #22543 are already merged.

Test suites:
- nightly-amd-accuracy-8-gpu-glm51-mxfp4 (MI30x accuracy)
- nightly-amd-8-gpu-mi35x-glm51-mxfp4 (MI35x accuracy)
- nightly-perf-8-gpu-glm51-mxfp4 (MI30x perf)
- nightly-perf-8-gpu-mi35x-glm51-mxfp4 (MI35x perf)
michaelzhang-ai added a commit that referenced this pull request Apr 18, 2026
…ks for MI30x and MI35x

Add nightly CI tests for amd/GLM-5.1-MXFP4 (408B MoE, Quark MXFP4) on
MI30x and MI35x. Includes pre-download step (120min timeout) to cache
the 425GB model on persistent runner storage before server start.

Test files:
- test/registered/amd/accuracy/mi30x/test_glm51_mxfp4_eval_amd.py
- test/registered/amd/accuracy/mi35x/test_glm51_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi30x/test_glm51_mxfp4_perf_amd.py
- test/registered/amd/perf/mi35x/test_glm51_mxfp4_perf_mi35x.py

Workflow: MI30x + MI35x jobs for default ROCm and ROCm 7.2 with
accuracy + perf steps, dropdown entries, and check-all-jobs needs.

Engine fixes already merged via PR #22543.
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GLM-5-MXFP4 checkpoint incompatible with SGLang: exclude-layer naming and shared expert fusion

5 participants