Skip to content

[Bug] GLM-5 accuracy drop on B200 with flash_mla_with_kvcache kernel #21291

@Fridge003

Description

@Fridge003

Describe the bug

As title

Reproduction

# Launch
sglang serve --model-path zai-org/GLM-5-FP8 --tp 8 --trust-remote-code --dp 8 --enable-dp-attention --kv-cache-dtype fp8_e4m3 --nsa-prefill-backend flashmla_sparse --nsa-decode-backend flashmla_kv

# Benchmark: 20-shots gsm8k
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319

The accuracy result is

Accuracy: 0.919
Invalid: 0.000
Latency: 29.930 s
Output throughput: 4294.924 token/s

However the expected result should be about 0.95

Environment

Latest main branch, 8*B200

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdeepseek

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions