Skip to content

[Bug] Deepseek R1 FP4 model quality drop #7166

@pyc96

Description

@pyc96

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I believe this issue applies to both R1 FP4 and R1-0528 FP4.

For R1 FP4, GSM8k score is only 0.886. Not trying to reproduce official result, but it should be something around 0.95. Also Nvidia reports much higher gsm8k score with trtllm here.

Any help is really appreciated!

Reproduction

To reproduce:

python3 -m sglang.launch_server --port=7080 --model-path=nvidia/DeepSeek-R1-FP4  --trust-remote-code --tp=8  --host=0.0.0.0 --quantization=modelopt_fp4 --kv-cache-dtype=auto

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=7080
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:42<00:00, 31.20it/s]
Accuracy: 0.886
Invalid: 0.001
Latency: 42.603 s
Output throughput: 3734.827 token/s

Environment

python3 -m sglang.check_env
Python: 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.41
CUDA Driver Version: 570.124.06
PyTorch: 2.7.1+cu128
sglang: 0.4.7
sgl_kernel: 0.1.8
flashinfer_python: 0.2.6.post1
triton: 3.3.1
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.1.2
aiohttp: 3.12.12
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.33.0
interegular: 0.3.3
modelscope: 1.27.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.86.0
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: Module Not Found
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-55,112-167    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-55,112-167    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-55,112-167    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-55,112-167    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    56-111,168-223  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    56-111,168-223  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    56-111,168-223  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      56-111,168-223  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions