Skip to content

[Bug] DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9 #21012

@gogongxt

Description

@gogongxt

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9

Tag 0.5.9 and the latest commit have the same performance for deepseek v3.1.

machine: H20*8 TP8


tag 0.5.1 cmd:

python3 -m sglang.launch_server --mem-fraction-static 0.8 --model-path /data/jianshu-models/DeepSeek-V3.1 --attention-backend fa3 --log-level info --log-requests --log-requests-level 0 --collect-tokens-histogram --enable-metrics --enable-cache-report --tp-size 8 --watchdog-timeout 3600 --host 0.0.0.0 --port 64100 --trust-remote-code --served-model-name deepseek-v31 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}'

0.5.9 cmd:

python3 -m sglang.launch_server --mem-fraction-static 0.8 --model-path /data/jianshu-models/DeepSeek-V3.1 --attention-backend fa3 --log-level info --log-requests --log-requests-level 0 --collect-tokens-histogram --enable-metrics --enable-cache-report --tp-size 8 --watchdog-timeout 3600 --host 0.0.0.0 --port 64100 --trust-remote-code --served-model-name deepseek-v31 --disable-piecewise-cuda-graph --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}'

Actually pcg has little influence to the 0.5.9 benchmark result, so just keep disable.


Benchmark cmd:

python3 bench_serving.py --backend sglang-oai-chat --port 64100 --host 127.0.0.1 --model /data/jianshu-models/DeepSeek-V3.1 --dataset-path /nfs/ofs-llm-ssd/user/gogongxt/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --seed 1234 --extra-request-body {"stream_options": {"include_usage": true}} --output-details --dataset-name random --random-input-len 8192 --random-output-len 1 --random-range-ratio 1 --max-concurrency 16 --num-prompts 160

0.5.1 benchmark:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     160
Benchmark duration (s):                  120.58
Total input tokens:                      1310720
Total cached tokens:                     8597
Cache hit rate:                          0.01
Total generated tokens:                  160
Total generated tokens (retokenized):    160
Request throughput (req/s):              1.33
Input token throughput (tok/s):          10870.00
Output token throughput (tok/s):         1.33
Total token throughput (tok/s):          10871.33
Concurrency:                             15.33
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11551.31
Median E2E Latency (ms):                 12062.75
---------------Time to First Token----------------
Mean TTFT (ms):                          11551.27
Median TTFT (ms):                        12062.72
P95 TTFT (ms):                           12130.80
P99 TTFT (ms):                           12133.51
--------------Time Per Output Token---------------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P95 TPOT (ms):                           0.00
P99 TPOT (ms):                           0.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

0.5.9 benchmark:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     160
Benchmark duration (s):                  168.39
Total input tokens:                      1310720
Total cached tokens:                     8598
Cache hit rate:                          0.01
Total generated tokens:                  160
Total generated tokens (retokenized):    160
Request throughput (req/s):              0.95
Input token throughput (tok/s):          7783.69
Output token throughput (tok/s):         0.95
Total token throughput (tok/s):          7784.64
Concurrency:                             15.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16104.57
Median E2E Latency (ms):                 17085.06
---------------Time to First Token----------------
Mean TTFT (ms):                          16104.36
Median TTFT (ms):                        17084.87
P95 TTFT (ms):                           17192.83
P99 TTFT (ms):                           17202.63
--------------Time Per Output Token---------------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P95 TPOT (ms):                           0.00
P99 TPOT (ms):                           0.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

About 25% performance regression in 0.5.9.

Reproduction

none

Environment

none

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions