Skip to content

[Piecewise CUDA Graph] Support W4A8#13179

Merged
ispobock merged 4 commits intosgl-project:mainfrom
bzhng-development:support-per-tensor-fp8-piecewise
Nov 16, 2025
Merged

[Piecewise CUDA Graph] Support W4A8#13179
ispobock merged 4 commits intosgl-project:mainfrom
bzhng-development:support-per-tensor-fp8-piecewise

Conversation

@b8zhong
Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong commented Nov 13, 2025

python -m sglang.launch_server --model-path novita/Deepseek-V3.1-W4AFP8 --tp 8 --trust-remote-code --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 1024 --random-output-len 16 --random-range-ratio 1.0 --num-prompts 16   --max-concurrency 1  --output-file res_before.jsonl

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 1024 --random-output-len 16 --random-range-ratio 1.0 --num-prompts 128  --max-concurrency 4  --output-file res_before.jsonl

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 1024 --random-output-len 16 --random-range-ratio 1.0 --num-prompts 256 --max-concurrency 16 --output-file res_before.jsonl

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 1024 --random-output-len 16 --random-range-ratio 1.0 --num-prompts 512 --max-concurrency 32 --output-file res_before.jsonl

With:


+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3219.288 |              50.301 |        100.573 |          102.870 |       109.065 |         14.399 |           14.401 |        14.430 |                50.301 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |           7960.947 |             124.390 |        233.179 |          231.712 |       342.188 |         18.635 |           17.810 |        26.559 |                31.097 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |          13965.303 |             218.208 |        576.722 |          568.496 |       834.307 |         39.458 |           37.841 |        65.370 |                13.638 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |          16320.776 |             255.012 |        962.902 |          935.258 |      1558.532 |         69.244 |           72.061 |       124.247 |                 7.969 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

Without:

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           2628.782 |              41.075 |        170.988 |          164.228 |       229.640 |         14.474 |           14.478 |        14.565 |                41.075 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |           8173.645 |             127.713 |        231.960 |          226.408 |       383.227 |         17.849 |           17.810 |        17.952 |                31.928 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |          13898.583 |             217.165 |        715.059 |          759.297 |       921.404 |         30.729 |           27.112 |        44.104 |                13.573 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |          16325.506 |             255.086 |       1165.311 |         1197.516 |      1642.109 |         55.758 |           52.371 |       103.868 |                 7.971 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
Metric Concurrency Before After Δ (%)
Input Throughput 1 2628.782 3219.288 +22.5%
4 8173.645 7960.947 -2.6%
16 13898.583 13965.303 +0.48%
32 16325.506 16320.776 -0.03%
Output Throughput 1 41.075 50.301 +22.5%
4 127.713 124.390 -2.6%
16 217.165 218.208 +0.48%
32 255.086 255.012 -0.03%
Mean TTFT (ms) 1 170.988 100.573 -41.2%
4 231.960 233.179 +0.5%
16 715.059 576.722 -19.4%
32 1165.311 962.902 -17.4%
Mean TPOT (ms) 1 14.474 14.399 -0.5%
4 17.849 18.635 +4.4%
16 30.729 39.458 +28.4%
32 55.758 69.244 +24.1%
Per-user Throughput 1 41.075 50.301 +22.5%
4 31.928 31.097 -2.6%
16 13.573 13.638 +0.48%
32 7.971 7.969 -0.03%

Acc:

Before:

root@ip-10-40-12-14:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 200
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:40<00:00, 13.16it/s]
Accuracy: 0.958
Invalid: 0.000
Latency: 100.401 s
Output throughput: 1359.047 token/s

After:

root@ip-10-40-12-14:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 200
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:38<00:00, 13.44it/s]
Accuracy: 0.956
Invalid: 0.000
Latency: 98.303 s
Output throughput: 1387.402 token/s

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@FlamingoPg
Copy link
Copy Markdown
Collaborator

Could also add w4a8 test here?

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Nov 13, 2025

@FlamingoPg This Deepseek model is still prohibitively large for CI, so I don't think we can add that one. However, I'm pretty sure sgl_per_tensor_quant_fp8 is used by some other w8a8 per tensor quant model that may also be made compatible, I can try to test one later

@FlamingoPg
Copy link
Copy Markdown
Collaborator

@FlamingoPg This Deepseek model is still prohibitively large for CI, so I don't think we can add that one. However, I'm pretty sure sgl_per_tensor_quant_fp8 is used by some other w8a8 per tensor quant model that may also be made compatible, I can try to test one later

I see, never mind.

@ispobock ispobock merged commit f35f7f1 into sgl-project:main Nov 16, 2025
67 of 78 checks passed
@b8zhong b8zhong deleted the support-per-tensor-fp8-piecewise branch November 21, 2025 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants