Checklist
Describe the bug
DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9
Tag 0.5.9 and the latest commit have the same performance for deepseek v3.1.
machine: H20*8 TP8
tag 0.5.1 cmd:
python3 -m sglang.launch_server --mem-fraction-static 0.8 --model-path /data/jianshu-models/DeepSeek-V3.1 --attention-backend fa3 --log-level info --log-requests --log-requests-level 0 --collect-tokens-histogram --enable-metrics --enable-cache-report --tp-size 8 --watchdog-timeout 3600 --host 0.0.0.0 --port 64100 --trust-remote-code --served-model-name deepseek-v31 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}'
0.5.9 cmd:
python3 -m sglang.launch_server --mem-fraction-static 0.8 --model-path /data/jianshu-models/DeepSeek-V3.1 --attention-backend fa3 --log-level info --log-requests --log-requests-level 0 --collect-tokens-histogram --enable-metrics --enable-cache-report --tp-size 8 --watchdog-timeout 3600 --host 0.0.0.0 --port 64100 --trust-remote-code --served-model-name deepseek-v31 --disable-piecewise-cuda-graph --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}'
Actually pcg has little influence to the 0.5.9 benchmark result, so just keep disable.
Benchmark cmd:
python3 bench_serving.py --backend sglang-oai-chat --port 64100 --host 127.0.0.1 --model /data/jianshu-models/DeepSeek-V3.1 --dataset-path /nfs/ofs-llm-ssd/user/gogongxt/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --seed 1234 --extra-request-body {"stream_options": {"include_usage": true}} --output-details --dataset-name random --random-input-len 8192 --random-output-len 1 --random-range-ratio 1 --max-concurrency 16 --num-prompts 160
0.5.1 benchmark:
============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 160
Benchmark duration (s): 120.58
Total input tokens: 1310720
Total cached tokens: 8597
Cache hit rate: 0.01
Total generated tokens: 160
Total generated tokens (retokenized): 160
Request throughput (req/s): 1.33
Input token throughput (tok/s): 10870.00
Output token throughput (tok/s): 1.33
Total token throughput (tok/s): 10871.33
Concurrency: 15.33
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 11551.31
Median E2E Latency (ms): 12062.75
---------------Time to First Token----------------
Mean TTFT (ms): 11551.27
Median TTFT (ms): 12062.72
P95 TTFT (ms): 12130.80
P99 TTFT (ms): 12133.51
--------------Time Per Output Token---------------
Mean TPOT (ms): 0.00
Median TPOT (ms): 0.00
P95 TPOT (ms): 0.00
P99 TPOT (ms): 0.00
---------------Inter-Token Latency----------------
Mean ITL (ms): 0.00
Median ITL (ms): 0.00
P95 ITL (ms): 0.00
P99 ITL (ms): 0.00
Max ITL (ms): 0.00
==================================================
0.5.9 benchmark:
============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 160
Benchmark duration (s): 168.39
Total input tokens: 1310720
Total cached tokens: 8598
Cache hit rate: 0.01
Total generated tokens: 160
Total generated tokens (retokenized): 160
Request throughput (req/s): 0.95
Input token throughput (tok/s): 7783.69
Output token throughput (tok/s): 0.95
Total token throughput (tok/s): 7784.64
Concurrency: 15.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 16104.57
Median E2E Latency (ms): 17085.06
---------------Time to First Token----------------
Mean TTFT (ms): 16104.36
Median TTFT (ms): 17084.87
P95 TTFT (ms): 17192.83
P99 TTFT (ms): 17202.63
--------------Time Per Output Token---------------
Mean TPOT (ms): 0.00
Median TPOT (ms): 0.00
P95 TPOT (ms): 0.00
P99 TPOT (ms): 0.00
---------------Inter-Token Latency----------------
Mean ITL (ms): 0.00
Median ITL (ms): 0.00
P95 ITL (ms): 0.00
P99 ITL (ms): 0.00
Max ITL (ms): 0.00
==================================================
About 25% performance regression in 0.5.9.
Reproduction
none
Environment
none
Checklist
Describe the bug
DeepSeek v3.1 Performance Regression: 0.5.1 vs 0.5.9
Tag 0.5.9 and the latest commit have the same performance for deepseek v3.1.
machine: H20*8 TP8
tag 0.5.1 cmd:
0.5.9 cmd:
Actually pcg has little influence to the 0.5.9 benchmark result, so just keep disable.
Benchmark cmd:
0.5.1 benchmark:
0.5.9 benchmark:
About 25% performance regression in 0.5.9.
Reproduction
none
Environment
none