[Bug] DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

# DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9
Tag 0.5.9 and the latest commit have the same performance for deepseek v3.1.

machine: H20*8  TP8

---

**tag 0.5.1 cmd:**
```python
python3 -m sglang.launch_server --mem-fraction-static 0.8 --model-path /data/jianshu-models/DeepSeek-V3.1 --attention-backend fa3 --log-level info --log-requests --log-requests-level 0 --collect-tokens-histogram --enable-metrics --enable-cache-report --tp-size 8 --watchdog-timeout 3600 --host 0.0.0.0 --port 64100 --trust-remote-code --served-model-name deepseek-v31 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}'
```

**0.5.9 cmd:**
```python
python3 -m sglang.launch_server --mem-fraction-static 0.8 --model-path /data/jianshu-models/DeepSeek-V3.1 --attention-backend fa3 --log-level info --log-requests --log-requests-level 0 --collect-tokens-histogram --enable-metrics --enable-cache-report --tp-size 8 --watchdog-timeout 3600 --host 0.0.0.0 --port 64100 --trust-remote-code --served-model-name deepseek-v31 --disable-piecewise-cuda-graph --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}'
```
Actually pcg has little influence to the 0.5.9 benchmark result, so just keep disable.

---

**Benchmark cmd:**
```python
python3 bench_serving.py --backend sglang-oai-chat --port 64100 --host 127.0.0.1 --model /data/jianshu-models/DeepSeek-V3.1 --dataset-path /nfs/ofs-llm-ssd/user/gogongxt/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --seed 1234 --extra-request-body {"stream_options": {"include_usage": true}} --output-details --dataset-name random --random-input-len 8192 --random-output-len 1 --random-range-ratio 1 --max-concurrency 16 --num-prompts 160
```

---

**0.5.1 benchmark:**

```
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     160
Benchmark duration (s):                  120.58
Total input tokens:                      1310720
Total cached tokens:                     8597
Cache hit rate:                          0.01
Total generated tokens:                  160
Total generated tokens (retokenized):    160
Request throughput (req/s):              1.33
Input token throughput (tok/s):          10870.00
Output token throughput (tok/s):         1.33
Total token throughput (tok/s):          10871.33
Concurrency:                             15.33
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11551.31
Median E2E Latency (ms):                 12062.75
---------------Time to First Token----------------
Mean TTFT (ms):                          11551.27
Median TTFT (ms):                        12062.72
P95 TTFT (ms):                           12130.80
P99 TTFT (ms):                           12133.51
--------------Time Per Output Token---------------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P95 TPOT (ms):                           0.00
P99 TPOT (ms):                           0.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================
```

**0.5.9 benchmark:**

```
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     160
Benchmark duration (s):                  168.39
Total input tokens:                      1310720
Total cached tokens:                     8598
Cache hit rate:                          0.01
Total generated tokens:                  160
Total generated tokens (retokenized):    160
Request throughput (req/s):              0.95
Input token throughput (tok/s):          7783.69
Output token throughput (tok/s):         0.95
Total token throughput (tok/s):          7784.64
Concurrency:                             15.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16104.57
Median E2E Latency (ms):                 17085.06
---------------Time to First Token----------------
Mean TTFT (ms):                          16104.36
Median TTFT (ms):                        17084.87
P95 TTFT (ms):                           17192.83
P99 TTFT (ms):                           17202.63
--------------Time Per Output Token---------------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P95 TPOT (ms):                           0.00
P99 TPOT (ms):                           0.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================
```

About 25% performance regression in 0.5.9.

### Reproduction

none

### Environment

none

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9 #21012

Checklist

Describe the bug

DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9 #21012

Description

Checklist

Describe the bug

DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions