Skip to content

[AMD] Enable MTP for GLM-5-mxfp4 model#23219

Merged
HaiShaw merged 2 commits intosgl-project:mainfrom
1am9trash:fix-glm-5-mxfp4-mtp
Apr 20, 2026
Merged

[AMD] Enable MTP for GLM-5-mxfp4 model#23219
HaiShaw merged 2 commits intosgl-project:mainfrom
1am9trash:fix-glm-5-mxfp4-mtp

Conversation

@1am9trash
Copy link
Copy Markdown
Collaborator

@1am9trash 1am9trash commented Apr 20, 2026

Motivation

Fix #23142.

Quark-quantized GLM-5-MXFP4 checkpoints store MTP (NextN) weights — including eh_proj — in FP4-packed format. The existing code always creates eh_proj as nn.Linear, causing a shape mismatch ([6144, 6144] vs [6144, 12288]) during weight loading.

Modifications

  • In DeepseekV3ForCausalLMNextN.__init__, check whether the MTP layer is listed in quark's exclude_layers. If so, set quant_config to None so the NextN model falls back to bf16 parameters.
  • In DeepseekModelNextN.__init__, use ReplicatedLinear instead of nn.Linear for eh_proj when the quant config is quark, so that QuarkLinearMethod creates parameters matching the FP4-packed checkpoint format.
  • Update forward() to handle the (output, output_bias) tuple returned by ReplicatedLinear.
  • The existing modelopt_fp4 (NVFP4) code path is unchanged.

Accuracy Tests

GLM-5.1-mxfp4 launch cmd
python3 -m sglang.launch_server \
    --model-path /data/models/GLM-5-mxfp4/ \
    --port 9000 --trust-remote-code --tp 8 --chunked-prefill-size 131072 \
    --disable-radix-cache \
    --mem-fraction-static 0.85 \
    --model-loader-extra-config '{"enable_multithread_load": true}' \
    --watchdog-timeout 1200 \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4
Model Accept Len GSM8k acc
glm-5.1-mxfp4 (mtp) 2.813, 2.926, 2.960 0.941
glm-5-fp8 (mtp) 3.180, 3.122 , 3.084 0.954
dpsk-r1-0528-mxfp4 (mtp) 2.738, 2.860 , 2.813 0.942

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@1am9trash 1am9trash changed the title [AMD][DO-NOT-MERGE] Enable MTP for GLM-5-mxfp4 model [AMD] Enable MTP for GLM-5-mxfp4 model Apr 20, 2026
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 20, 2026

/tag-and-rerun-ci

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 20, 2026

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 20, 2026

@HaiShaw

CI Status for PR #23219

PR: [AMD] Enable MTP for GLM-5-mxfp4 model
Changed files: python/sglang/srt/models/deepseek_nextn.py (+41/-15)

AMD: 4 failures (0 likely related) | Others: 10 failures (0 likely related)

This PR only modifies deepseek_nextn.py to support quark quantization in the NextN MTP layer (using ReplicatedLinear instead of nn.Linear for eh_proj, and checking quark's exclude_layers). None of the failing tests exercise DeepSeek NextN with quark quantization.

AMD CI Failures

Job Test File Test Function Error Related? Explanation Log
stage-b-test-1-gpu-small-amd (5) test/registered/scheduler/test_mixed_chunked_prefill.py setUpClass Server exit -9 (model download failure) 🟢 Unlikely Llama-3.1-8B server crash; unrelated model/codepath Log
stage-b-test-1-gpu-small-amd (10) test/registered/tokenizer/test_multi_tokenizer.py test_multi_tokenizer_ttft 88.07 not less than 86 (TTFT threshold) 🟢 Unlikely Marginal perf flake on tokenizer TTFT; unrelated to model code Log
stage-b-test-1-gpu-small-amd-mi35x test/registered/core/test_gpt_oss_1gpu.py test_mxfp4_20b Empty streaming response for MXFP4 model 🟢 Unlikely Tests GPT-OSS-20B (not DeepSeek NextN); different model and quant path Log
stage-b-test-1-gpu-small-amd-nondeterministic test/registered/models/test_vlm_models.py test_vlm_mmmu_benchmark Accuracy 0.336 < 0.400 threshold 🟢 Unlikely VLM accuracy benchmark for MiniCPM-V-2_6; unrelated model Log

Other CI Failures

Job Test File Test Function Error Related? Explanation Log
stage-b-test-1-gpu-large (3) test/registered/dllm/test_llada2_mini.py setUpClass tvm/ffi/container/tensor.h: No such file (flashinfer JIT fail) 🟢 Unlikely Flashinfer JIT compilation env issue; unrelated to model code Log
stage-b-test-1-gpu-large (7,8,9,10,11,12,13) N/A N/A Fast-fail cascade from shard 3 🟢 Unlikely Did not run any tests; killed by CI fast-fail gate Log
stage-b-test-4-gpu-b200 N/A N/A Fast-fail cascade from shard 3 🟢 Unlikely Did not run any tests; killed by CI fast-fail gate Log
build-and-test (XPU) test/srt/xpu/test_deepseek_ocr.py setUpClass ImportError: matplotlib not installed 🟢 Unlikely XPU Docker image missing dependency; unrelated to this PR Log
multimodal-gen-test-8-npu-a3 test/server/ascend/test_server_8_npu.py test_diffusion_generation[wan2_2_t2v_14b_w8a8_8npu] DecodingStage 1205ms vs 233ms expected 🟢 Unlikely NPU diffusion perf regression; unrelated to DeepSeek NextN model Log

Details

All failures are unrelated to this PR. The PR makes a narrowly-scoped change to deepseek_nextn.py that only activates when quant_config.get_name() == "quark". No failing test loads a DeepSeek model with quark quantization. The failures break down as:

  • Infrastructure issues: flashinfer missing TVM headers (shard 3 + 7 cascading fast-fails), XPU image missing matplotlib, server crash on model download (shard 5)
  • Performance flakes: TTFT threshold barely missed (88ms vs 86ms limit), VLM accuracy below threshold (33.6% vs 40%), NPU diffusion decoding stage 5x slower than expected
  • Pre-existing AMD issue: MXFP4 GPT-OSS-20B returning empty streaming response on MI35X

These are safe to ignore for this PR — they are pre-existing CI issues or environment problems on main.

Generated by amd-bot using Claude Code CLI

@HaiShaw HaiShaw merged commit 57ecce9 into sgl-project:main Apr 20, 2026
139 of 169 checks passed
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] glm5 mxfp4 mtp is broken

3 participants