[Bug] GLM-5 accuracy drop on B200 with flash_mla_with_kvcache kernel

### Describe the bug

As title

### Reproduction

```
# Launch
sglang serve --model-path zai-org/GLM-5-FP8 --tp 8 --trust-remote-code --dp 8 --enable-dp-attention --kv-cache-dtype fp8_e4m3 --nsa-prefill-backend flashmla_sparse --nsa-decode-backend flashmla_kv

# Benchmark: 20-shots gsm8k
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319
```

The accuracy result is
```
Accuracy: 0.919
Invalid: 0.000
Latency: 29.930 s
Output throughput: 4294.924 token/s
```

However the expected result should be about 0.95

### Environment

Latest main branch, 8*B200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] GLM-5 accuracy drop on B200 with flash_mla_with_kvcache kernel #21291

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] GLM-5 accuracy drop on B200 with flash_mla_with_kvcache kernel #21291

Description

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions