[AMD] Enable MTP for GLM-5-mxfp4 model#23219
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
@amd-bot ci-status |
CI Status for PR #23219PR: [AMD] Enable MTP for GLM-5-mxfp4 model AMD: 4 failures (0 likely related) | Others: 10 failures (0 likely related) This PR only modifies AMD CI Failures
Other CI Failures
DetailsAll failures are unrelated to this PR. The PR makes a narrowly-scoped change to
These are safe to ignore for this PR — they are pre-existing CI issues or environment problems on
|
Motivation
Fix #23142.
Quark-quantized GLM-5-MXFP4 checkpoints store MTP (NextN) weights — including
eh_proj— in FP4-packed format. The existing code always createseh_projasnn.Linear, causing a shape mismatch ([6144, 6144] vs [6144, 12288]) during weight loading.Modifications
DeepseekV3ForCausalLMNextN.__init__, check whether the MTP layer is listed in quark's exclude_layers. If so, setquant_configto None so the NextN model falls back to bf16 parameters.DeepseekModelNextN.__init__, useReplicatedLinearinstead ofnn.Linearforeh_projwhen the quant config is quark, so thatQuarkLinearMethodcreates parameters matching theFP4-packedcheckpoint format.forward()to handle the (output, output_bias) tuple returned byReplicatedLinear.modelopt_fp4(NVFP4) code path is unchanged.Accuracy Tests
GLM-5.1-mxfp4 launch cmd
python3 -m sglang.launch_server \ --model-path /data/models/GLM-5-mxfp4/ \ --port 9000 --trust-remote-code --tp 8 --chunked-prefill-size 131072 \ --disable-radix-cache \ --mem-fraction-static 0.85 \ --model-loader-extra-config '{"enable_multithread_load": true}' \ --watchdog-timeout 1200 \ --reasoning-parser glm45 \ --tool-call-parser glm47 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci