Skip to content

[Piecewise CUDA Graph] Support INT8#14918

Merged
ispobock merged 1 commit intosgl-project:mainfrom
bzhng-development:brayden/int8-piecewise
Dec 17, 2025
Merged

[Piecewise CUDA Graph] Support INT8#14918
ispobock merged 1 commit intosgl-project:mainfrom
bzhng-development:brayden/int8-piecewise

Conversation

@b8zhong
Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong commented Dec 11, 2025

python -m sglang.launch_server --model-path /opt/dlami/nvme/models/meituan-DeepSeek-R1-Channel-INT8 --tp 8 --trust-remote-code --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --quantization w8a8_int8 
python test/srt/parse_results.py res_before.jsonl

Saved summary to: res_before_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3612.799 |              56.450 |        114.991 |          108.471 |       185.450 |         11.150 |           11.142 |        11.222 |                56.450 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |           9106.442 |             142.288 |        231.972 |          233.104 |       314.956 |         14.437 |           13.144 |        18.080 |                35.572 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |          18268.669 |             285.448 |        493.734 |          417.657 |      1285.484 |         26.635 |           21.658 |        51.481 |                17.840 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |          23181.802 |             362.216 |        673.587 |          391.989 |      1606.803 |         48.652 |           34.264 |       129.803 |                11.319 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
python test/srt/parse_results.py res_after.jsonl

Saved summary to: res_after_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |           3818.573 |              59.665 |         99.747 |          104.391 |       106.430 |         11.143 |           11.144 |        11.167 |                59.665 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |           9048.437 |             141.382 |        230.229 |          242.815 |       283.490 |         14.737 |           15.660 |        18.005 |                35.345 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |          20859.118 |             325.924 |        389.773 |          257.043 |       875.292 |         26.141 |           22.239 |        50.873 |                20.370 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |          24257.005 |             379.016 |        664.484 |          424.061 |      1673.387 |         45.111 |           32.186 |       114.908 |                11.844 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
python3 test_piecewise_cuda_graph.py TestPiecewiseCudaGraphW8A8Int8
Writing report to /tmp/mgsm_en_RedHatAI_Llama-3.2-1B-Instruct-quantized.w8a8.html
{'en': 0.416, 'en:std': 0.4928934976239796, 'group_latin': 0.416, 'group_latin:std': 0.4928934976239796, 'score:std': 0.4928934976239796, 'score': 0.416}
Writing results to /tmp/mgsm_en_RedHatAI_Llama-3.2-1B-Instruct-quantized.w8a8.json
Total latency: 3.873 s
Score: 0.416
MGSM Accuracy: 0.416
.
----------------------------------------------------------------------
Ran 1 test in 36.229s

OK

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Dec 11, 2025

/tag-and-rerun-ci agai

@ispobock
Copy link
Copy Markdown
Collaborator

The w8a8 quantization test case seems failed. https://github.com/sgl-project/sglang/actions/runs/20211842450/job/58051125653?pr=14918

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Dec 16, 2025

@ispobock > AssertionError: 0.88 not greater than 0.88

I think probably flakiness (we will see after rerun)

@b8zhong b8zhong force-pushed the brayden/int8-piecewise branch from c206793 to dbd950e Compare December 17, 2025 04:11
@ispobock ispobock disabled auto-merge December 17, 2025 10:20
@ispobock ispobock merged commit ffa7e03 into sgl-project:main Dec 17, 2025
83 of 90 checks passed
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
@b8zhong b8zhong deleted the brayden/int8-piecewise branch February 6, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants