[AMD] Enable MTP for GLM-5-mxfp4 model by 1am9trash · Pull Request #23219 · sgl-project/sglang

1am9trash · 2026-04-20T06:53:20Z

Motivation

Quark-quantized GLM-5-MXFP4 checkpoints store MTP (NextN) weights — including eh_proj — in FP4-packed format. The existing code always creates eh_proj as nn.Linear, causing a shape mismatch ([6144, 6144] vs [6144, 12288]) during weight loading.

Modifications

In DeepseekV3ForCausalLMNextN.__init__, check whether the MTP layer is listed in quark's exclude_layers. If so, set quant_config to None so the NextN model falls back to bf16 parameters.
In DeepseekModelNextN.__init__, use ReplicatedLinear instead of nn.Linear for eh_proj when the quant config is quark, so that QuarkLinearMethod creates parameters matching the FP4-packed checkpoint format.
Update forward() to handle the (output, output_bias) tuple returned by ReplicatedLinear.
The existing modelopt_fp4 (NVFP4) code path is unchanged.

Accuracy Tests

GLM-5.1-mxfp4 launch cmd

python3 -m sglang.launch_server \
    --model-path /data/models/GLM-5-mxfp4/ \
    --port 9000 --trust-remote-code --tp 8 --chunked-prefill-size 131072 \
    --disable-radix-cache \
    --mem-fraction-static 0.85 \
    --model-loader-extra-config '{"enable_multithread_load": true}' \
    --watchdog-timeout 1200 \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Model	Accept Len	GSM8k acc
glm-5.1-mxfp4 (mtp)	2.813, 2.926, 2.960	0.941
glm-5-fp8 (mtp)	3.180, 3.122 , 3.084	0.954
dpsk-r1-0528-mxfp4 (mtp)	2.738, 2.860 , 2.813	0.942

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-20T06:53:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw · 2026-04-20T09:12:52Z

/tag-and-rerun-ci

HaiShaw · 2026-04-20T17:00:11Z

@amd-bot ci-status

amd-bot · 2026-04-20T17:04:52Z

@HaiShaw

CI Status for PR #23219

PR: [AMD] Enable MTP for GLM-5-mxfp4 model
Changed files: python/sglang/srt/models/deepseek_nextn.py (+41/-15)

AMD: 4 failures (0 likely related) | Others: 10 failures (0 likely related)

This PR only modifies deepseek_nextn.py to support quark quantization in the NextN MTP layer (using ReplicatedLinear instead of nn.Linear for eh_proj, and checking quark's exclude_layers). None of the failing tests exercise DeepSeek NextN with quark quantization.

AMD CI Failures

Job	Test File	Test Function	Error	Related?	Explanation	Log
stage-b-test-1-gpu-small-amd (5)	`test/registered/scheduler/test_mixed_chunked_prefill.py`	`setUpClass`	Server exit -9 (model download failure)	🟢 Unlikely	Llama-3.1-8B server crash; unrelated model/codepath	Log
stage-b-test-1-gpu-small-amd (10)	`test/registered/tokenizer/test_multi_tokenizer.py`	`test_multi_tokenizer_ttft`	`88.07 not less than 86` (TTFT threshold)	🟢 Unlikely	Marginal perf flake on tokenizer TTFT; unrelated to model code	Log
stage-b-test-1-gpu-small-amd-mi35x	`test/registered/core/test_gpt_oss_1gpu.py`	`test_mxfp4_20b`	Empty streaming response for MXFP4 model	🟢 Unlikely	Tests GPT-OSS-20B (not DeepSeek NextN); different model and quant path	Log
stage-b-test-1-gpu-small-amd-nondeterministic	`test/registered/models/test_vlm_models.py`	`test_vlm_mmmu_benchmark`	Accuracy 0.336 < 0.400 threshold	🟢 Unlikely	VLM accuracy benchmark for MiniCPM-V-2_6; unrelated model	Log

Other CI Failures

Job	Test File	Test Function	Error	Related?	Explanation	Log
stage-b-test-1-gpu-large (3)	`test/registered/dllm/test_llada2_mini.py`	`setUpClass`	`tvm/ffi/container/tensor.h: No such file` (flashinfer JIT fail)	🟢 Unlikely	Flashinfer JIT compilation env issue; unrelated to model code	Log
stage-b-test-1-gpu-large (7,8,9,10,11,12,13)	N/A	N/A	Fast-fail cascade from shard 3	🟢 Unlikely	Did not run any tests; killed by CI fast-fail gate	Log
stage-b-test-4-gpu-b200	N/A	N/A	Fast-fail cascade from shard 3	🟢 Unlikely	Did not run any tests; killed by CI fast-fail gate	Log
build-and-test (XPU)	`test/srt/xpu/test_deepseek_ocr.py`	`setUpClass`	`ImportError: matplotlib` not installed	🟢 Unlikely	XPU Docker image missing dependency; unrelated to this PR	Log
multimodal-gen-test-8-npu-a3	`test/server/ascend/test_server_8_npu.py`	`test_diffusion_generation[wan2_2_t2v_14b_w8a8_8npu]`	DecodingStage 1205ms vs 233ms expected	🟢 Unlikely	NPU diffusion perf regression; unrelated to DeepSeek NextN model	Log

Details

All failures are unrelated to this PR. The PR makes a narrowly-scoped change to deepseek_nextn.py that only activates when quant_config.get_name() == "quark". No failing test loads a DeepSeek model with quark quantization. The failures break down as:

Infrastructure issues: flashinfer missing TVM headers (shard 3 + 7 cascading fast-fails), XPU image missing matplotlib, server crash on model download (shard 5)
Performance flakes: TTFT threshold barely missed (88ms vs 86ms limit), VLM accuracy below threshold (33.6% vs 40%), NPU diffusion decoding stage 5x slower than expected
Pre-existing AMD issue: MXFP4 GPT-OSS-20B returning empty streaming response on MI35X

These are safe to ignore for this PR — they are pre-existing CI issues or environment problems on `main`.

Generated by amd-bot using Claude Code CLI

1am9trash added 2 commits April 20, 2026 13:50

Fix glm-5-mxfp4 mtp

6d1eb02

Add comment

69d58ed

github-actions Bot added the deepseek label Apr 20, 2026

1am9trash changed the title ~~[AMD][DO-NOT-MERGE] Enable MTP for GLM-5-mxfp4 model~~ [AMD] Enable MTP for GLM-5-mxfp4 model Apr 20, 2026

github-actions Bot added the run-ci label Apr 20, 2026

HaiShaw approved these changes Apr 20, 2026

View reviewed changes

HaiShaw merged commit 57ecce9 into sgl-project:main Apr 20, 2026
139 of 169 checks passed

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

[AMD] Enable MTP for GLM-5-mxfp4 model (sgl-project#23219)

3519169

kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026

[AMD] Enable MTP for GLM-5-mxfp4 model (sgl-project#23219)

0b3cdfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Enable MTP for GLM-5-mxfp4 model#23219

[AMD] Enable MTP for GLM-5-mxfp4 model#23219
HaiShaw merged 2 commits intosgl-project:mainfrom
1am9trash:fix-glm-5-mxfp4-mtp

1am9trash commented Apr 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 20, 2026

Uh oh!

HaiShaw commented Apr 20, 2026

Uh oh!

HaiShaw commented Apr 20, 2026

Uh oh!

amd-bot commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

1am9trash commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 20, 2026

Uh oh!

HaiShaw commented Apr 20, 2026

Uh oh!

HaiShaw commented Apr 20, 2026

Uh oh!

amd-bot commented Apr 20, 2026

CI Status for PR #23219

AMD CI Failures

Other CI Failures

Details

These are safe to ignore for this PR — they are pre-existing CI issues or environment problems on main.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1am9trash commented Apr 20, 2026 •

edited

Loading

These are safe to ignore for this PR — they are pre-existing CI issues or environment problems on `main`.