Checklist
Describe the bug
Currently when using MTP with FP4 Deepseek, the server will crash with
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 381, in load_model
self.load_weights_and_postprocess(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 389, in load_weights_and_postprocess
model.load_weights(weights)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 161, in load_weights
super().load_weights(weights, is_nextn=True)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099, in load_weights
weight_loader(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 594, in weight_loader
self._load_model_weight_or_group_weight_scale(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 401, in _load_model_weight_or_group_weight_scale
self._load_w13(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 455, in _load_w13
expert_data.copy_(loaded_weight)
RuntimeError: The size of tensor a (3584) must match the size of tensor b (7168) at non-singleton dimension 1
I think the reason is MTP module in FP4 model is not quantized, or not stored in the same way, but the same quant config is reused for both MTP and main model.
After switching linear and MoE quant method to UnquantizedLinearMethod and UnquantizedFusedMoEMethod. Something like https://github.com/pyc96/sglang/blob/7c8ce7870ab4a7ab918b288e661ad182e9e21e13/python/sglang/srt/layers/quantization/modelopt_quant.py#L348
I am not sure if this is the right solution. Maybe need to consult NV folks on what is the exact format for MTP in FP4 ckpt.
Reproduction
python3 -m sglang.launch_server --port=7080 --model-path=/root/.cache/huggingface/hub/models--nvidia--DeepSeek-R1-0528-FP4/snapshots/91cfc7c35acd8ecfc769205989310208b8b81c9c/ --trust-remote-code --tp=4 --host=0.0.0.0 --speculative-algorithm=EAGLE --speculative-num-steps=3 --speculative-eagle-topk=1 --speculative-num-draft-tokens=4
Environment
root@predictor-resource-pool-1907436070400688128-bcc64d749-kccns:/sgl-workspace/sglang# python3 -m sglang.check_env
Python: 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.41
CUDA Driver Version: 570.124.06
PyTorch: 2.7.1+cu128
sglang: 0.4.7.post1
sgl_kernel: 0.1.9
flashinfer_python: 0.2.6.post1
triton: 3.3.1
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.1.2
aiohttp: 3.12.12
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.33.0
interegular: 0.3.3
modelscope: 1.27.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.88.0
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: Module Not Found
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 56-111,168-223 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 1048576
Checklist
Describe the bug
Currently when using MTP with FP4 Deepseek, the server will crash with
I think the reason is MTP module in FP4 model is not quantized, or not stored in the same way, but the same quant config is reused for both MTP and main model.
After switching linear and MoE quant method to
UnquantizedLinearMethodandUnquantizedFusedMoEMethod. Something like https://github.com/pyc96/sglang/blob/7c8ce7870ab4a7ab918b288e661ad182e9e21e13/python/sglang/srt/layers/quantization/modelopt_quant.py#L348I am not sure if this is the right solution. Maybe need to consult NV folks on what is the exact format for MTP in FP4 ckpt.
Reproduction
Environment
root@predictor-resource-pool-1907436070400688128-bcc64d749-kccns:/sgl-workspace/sglang# python3 -m sglang.check_env
Python: 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.41
CUDA Driver Version: 570.124.06
PyTorch: 2.7.1+cu128
sglang: 0.4.7.post1
sgl_kernel: 0.1.9
flashinfer_python: 0.2.6.post1
triton: 3.3.1
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.1.2
aiohttp: 3.12.12
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.33.0
interegular: 0.3.3
modelscope: 1.27.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.88.0
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: Module Not Found
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 56-111,168-223 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 1048576