[Piecewise CUDA Graph] Support INT8 by b8zhong · Pull Request #14918 · sgl-project/sglang

b8zhong · 2025-12-11T20:39:17Z

python -m sglang.launch_server --model-path /opt/dlami/nvme/models/meituan-DeepSeek-R1-Channel-INT8 --tp 8 --trust-remote-code --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --quantization w8a8_int8

python test/srt/parse_results.py res_before.jsonl

Saved summary to: res_before_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3612.799 |              56.450 |        114.991 |          108.471 |       185.450 |         11.150 |           11.142 |        11.222 |                56.450 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |           9106.442 |             142.288 |        231.972 |          233.104 |       314.956 |         14.437 |           13.144 |        18.080 |                35.572 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |          18268.669 |             285.448 |        493.734 |          417.657 |      1285.484 |         26.635 |           21.658 |        51.481 |                17.840 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |          23181.802 |             362.216 |        673.587 |          391.989 |      1606.803 |         48.652 |           34.264 |       129.803 |                11.319 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

python test/srt/parse_results.py res_after.jsonl

Saved summary to: res_after_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3818.573 |              59.665 |         99.747 |          104.391 |       106.430 |         11.143 |           11.144 |        11.167 |                59.665 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |           9048.437 |             141.382 |        230.229 |          242.815 |       283.490 |         14.737 |           15.660 |        18.005 |                35.345 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |          20859.118 |             325.924 |        389.773 |          257.043 |       875.292 |         26.141 |           22.239 |        50.873 |                20.370 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |          24257.005 |             379.016 |        664.484 |          424.061 |      1673.387 |         45.111 |           32.186 |       114.908 |                11.844 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

python3 test_piecewise_cuda_graph.py TestPiecewiseCudaGraphW8A8Int8
Writing report to /tmp/mgsm_en_RedHatAI_Llama-3.2-1B-Instruct-quantized.w8a8.html
{'en': 0.416, 'en:std': 0.4928934976239796, 'group_latin': 0.416, 'group_latin:std': 0.4928934976239796, 'score:std': 0.4928934976239796, 'score': 0.416}
Writing results to /tmp/mgsm_en_RedHatAI_Llama-3.2-1B-Instruct-quantized.w8a8.json
Total latency: 3.873 s
Score: 0.416
MGSM Accuracy: 0.416
.
----------------------------------------------------------------------
Ran 1 test in 36.229s

OK

gemini-code-assist · 2025-12-11T20:39:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

b8zhong · 2025-12-11T21:12:12Z

/tag-and-rerun-ci agai

ispobock · 2025-12-16T02:57:53Z

The w8a8 quantization test case seems failed. https://github.com/sgl-project/sglang/actions/runs/20211842450/job/58051125653?pr=14918

b8zhong · 2025-12-16T03:04:57Z

@ispobock > AssertionError: 0.88 not greater than 0.88

I think probably flakiness (we will see after rerun)

b8zhong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners December 11, 2025 20:39

github-actions Bot added the run-ci label Dec 11, 2025

ispobock approved these changes Dec 16, 2025

View reviewed changes

b8zhong enabled auto-merge (squash) December 16, 2025 22:06

b8zhong mentioned this pull request Dec 16, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Closed

34 tasks

more

dbd950e

b8zhong force-pushed the brayden/int8-piecewise branch from c206793 to dbd950e Compare December 17, 2025 04:11

ispobock disabled auto-merge December 17, 2025 10:20

ispobock merged commit ffa7e03 into sgl-project:main Dec 17, 2025
83 of 90 checks passed

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025

[Piecewise CUDA Graph] Support INT8 (sgl-project#14918)

e771524

jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025

[Piecewise CUDA Graph] Support INT8 (sgl-project#14918)

33acf47

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[Piecewise CUDA Graph] Support INT8 (sgl-project#14918)

18d3055

b8zhong deleted the brayden/int8-piecewise branch February 6, 2026 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Piecewise CUDA Graph] Support INT8#14918

[Piecewise CUDA Graph] Support INT8#14918
ispobock merged 1 commit intosgl-project:mainfrom
bzhng-development:brayden/int8-piecewise

b8zhong commented Dec 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Dec 11, 2025

Uh oh!

b8zhong commented Dec 11, 2025 •

edited

Loading

Uh oh!

ispobock commented Dec 16, 2025

Uh oh!

b8zhong commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

b8zhong commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Dec 11, 2025

Uh oh!

b8zhong commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ispobock commented Dec 16, 2025

Uh oh!

b8zhong commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

b8zhong commented Dec 11, 2025 •

edited

Loading

b8zhong commented Dec 11, 2025 •

edited

Loading