[Piecewise CUDA Graph] Support W4A8 by b8zhong · Pull Request #13179 · sgl-project/sglang

b8zhong · 2025-11-13T03:29:28Z

python -m sglang.launch_server --model-path novita/Deepseek-V3.1-W4AFP8 --tp 8 --trust-remote-code --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 1024 --random-output-len 16 --random-range-ratio 1.0 --num-prompts 16   --max-concurrency 1  --output-file res_before.jsonl

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 1024 --random-output-len 16 --random-range-ratio 1.0 --num-prompts 128  --max-concurrency 4  --output-file res_before.jsonl

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 1024 --random-output-len 16 --random-range-ratio 1.0 --num-prompts 256 --max-concurrency 16 --output-file res_before.jsonl

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 1024 --random-output-len 16 --random-range-ratio 1.0 --num-prompts 512 --max-concurrency 32 --output-file res_before.jsonl

With:


+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3219.288 |              50.301 |        100.573 |          102.870 |       109.065 |         14.399 |           14.401 |        14.430 |                50.301 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |           7960.947 |             124.390 |        233.179 |          231.712 |       342.188 |         18.635 |           17.810 |        26.559 |                31.097 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |          13965.303 |             218.208 |        576.722 |          568.496 |       834.307 |         39.458 |           37.841 |        65.370 |                13.638 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |          16320.776 |             255.012 |        962.902 |          935.258 |      1558.532 |         69.244 |           72.061 |       124.247 |                 7.969 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

Without:

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           2628.782 |              41.075 |        170.988 |          164.228 |       229.640 |         14.474 |           14.478 |        14.565 |                41.075 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |           8173.645 |             127.713 |        231.960 |          226.408 |       383.227 |         17.849 |           17.810 |        17.952 |                31.928 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |          13898.583 |             217.165 |        715.059 |          759.297 |       921.404 |         30.729 |           27.112 |        44.104 |                13.573 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |          16325.506 |             255.086 |       1165.311 |         1197.516 |      1642.109 |         55.758 |           52.371 |       103.868 |                 7.971 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

Metric	Concurrency	Before	After	Δ (%)
Input Throughput	1	2628.782	3219.288	+22.5%
	4	8173.645	7960.947	-2.6%
	16	13898.583	13965.303	+0.48%
	32	16325.506	16320.776	-0.03%
Output Throughput	1	41.075	50.301	+22.5%
	4	127.713	124.390	-2.6%
	16	217.165	218.208	+0.48%
	32	255.086	255.012	-0.03%
Mean TTFT (ms)	1	170.988	100.573	-41.2%
	4	231.960	233.179	+0.5%
	16	715.059	576.722	-19.4%
	32	1165.311	962.902	-17.4%
Mean TPOT (ms)	1	14.474	14.399	-0.5%
	4	17.849	18.635	+4.4%
	16	30.729	39.458	+28.4%
	32	55.758	69.244	+24.1%
Per-user Throughput	1	41.075	50.301	+22.5%
	4	31.928	31.097	-2.6%
	16	13.573	13.638	+0.48%
	32	7.971	7.969	-0.03%

Acc:

Before:

root@ip-10-40-12-14:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 200
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:40<00:00, 13.16it/s]
Accuracy: 0.958
Invalid: 0.000
Latency: 100.401 s
Output throughput: 1359.047 token/s

After:

root@ip-10-40-12-14:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 200
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:38<00:00, 13.44it/s]
Accuracy: 0.956
Invalid: 0.000
Latency: 98.303 s
Output throughput: 1387.402 token/s

gemini-code-assist · 2025-11-13T03:29:31Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

FlamingoPg · 2025-11-13T07:06:17Z

Could also add w4a8 test here?

b8zhong · 2025-11-13T07:18:11Z

@FlamingoPg This Deepseek model is still prohibitively large for CI, so I don't think we can add that one. However, I'm pretty sure sgl_per_tensor_quant_fp8 is used by some other w8a8 per tensor quant model that may also be made compatible, I can try to test one later

FlamingoPg · 2025-11-13T07:19:52Z

@FlamingoPg This Deepseek model is still prohibitively large for CI, so I don't think we can add that one. However, I'm pretty sure sgl_per_tensor_quant_fp8 is used by some other w8a8 per tensor quant model that may also be made compatible, I can try to test one later

I see, never mind.

b8zhong added 2 commits November 12, 2025 16:51

more

2bbbd1d

more

067bb84

b8zhong added the piecewise-cuda-graph label Nov 13, 2025

b8zhong requested review from BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners November 13, 2025 03:29

sglang-bot added the run-ci label Nov 13, 2025

b8zhong mentioned this pull request Nov 13, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Closed

34 tasks

b8zhong added 2 commits November 14, 2025 10:18

Merge branch 'main' into support-per-tensor-fp8-piecewise

578a857

Merge branch 'main' into support-per-tensor-fp8-piecewise

e7b62e6

ispobock approved these changes Nov 16, 2025

View reviewed changes

ispobock merged commit f35f7f1 into sgl-project:main Nov 16, 2025
67 of 78 checks passed

b8zhong deleted the support-per-tensor-fp8-piecewise branch November 21, 2025 04:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Piecewise CUDA Graph] Support W4A8#13179

[Piecewise CUDA Graph] Support W4A8#13179
ispobock merged 4 commits intosgl-project:mainfrom
bzhng-development:support-per-tensor-fp8-piecewise

b8zhong commented Nov 13, 2025

Uh oh!

gemini-code-assist Bot commented Nov 13, 2025

Uh oh!

FlamingoPg commented Nov 13, 2025

Uh oh!

b8zhong commented Nov 13, 2025

Uh oh!

FlamingoPg commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

b8zhong commented Nov 13, 2025

Acc:

Uh oh!

gemini-code-assist Bot commented Nov 13, 2025

Uh oh!

FlamingoPg commented Nov 13, 2025

Uh oh!

b8zhong commented Nov 13, 2025

Uh oh!

FlamingoPg commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants