Apply sgl w8a8 fp8 kernel by HandH1998 · Pull Request #3148 · sgl-project/sglang

HandH1998 · 2025-01-26T10:55:41Z

Following #3047, we replace w8a8 fp8 vllm kernel with sgl-kernel. Generally, the w8a8 fp8 sgl-kernel yields higher accuracy on gsm8k. On sm89-L40, the w8a8 fp8 sgl-kernel delivers a 14% higher throughput than the vllm kernel. On sm90-H100, both kernels exhibit similar performance.

Benchmark

model: neuralmagic/Meta-Llama-3-8B-Instruct-FP8

sm89-L40

gsm8k

# w8a8 fp8 vllm-kernel

python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-8B-Instruct-FP8  --trust-remote-code --disable-radix --port 33333
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 500 --num-shots 8 --port 33333

Accuracy: 0.752
Invalid: 0.002
Latency: 225.308 s
Output throughput: 616.437 token/s

# w8a8 fp8 sgl-kernel

python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-8B-Instruct-FP8  --trust-remote-code --disable-radix --port 33333
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 500 --num-shots 8 --port 33333

Accuracy: 0.761
Invalid: 0.001
Latency: 190.474 s
Output throughput: 702.757 token/s

throughput under various request rates

tok/s

request rate	8	16	32	inf
vllm kernel	1319.32	1942.67	2120.59	2122.10
sgl kernel	1347.00	2132.91	2416.50	2422.71

sm90-H100

gsm8k

# w8a8 fp8 vllm-kernel

python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-8B-Instruct-FP8  --trust-remote-code --disable-radix --port 33333
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 33333

Accuracy: 0.754
Invalid: 0.000
Latency: 49.751 s
Output throughput: 2801.533 token/s

# w8a8 fp8 sgl-kernel

python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-8B-Instruct-FP8  --trust-remote-code --disable-radix --port 33333
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 33333

Accuracy: 0.759
Invalid: 0.001
Latency: 49.257 s
Output throughput: 2805.524 token/s

throughput under various request rates

tok/s

request rate	8	16	32	inf
vllm kernel	1468.51	2775.83	4209.25	7121.02
sgl kernel	1468.41	2767.79	4236.97	7168.12

model: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic

activation dynamic quantization

sm89-L40

gsm8k

# w8a8 fp8 vllm-kernel

python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic  --trust-remote-code --disable-radix --port 33333
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 500 --num-shots 8 --port 33333

Accuracy: 0.779
Invalid: 0.001
Latency: 226.505 s
Output throughput: 590.980 token/s

# w8a8 fp8 sgl-kernel

python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --quantization w8a8_fp8 --trust-remote-code --disable-radix --port 33333
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 500 --num-shots 8 --port 33333

Accuracy: 0.795
Invalid: 0.002
Latency: 194.464 s
Output throughput: 694.344 token/s

throughput under various request rates

tok/s

request rate	8	16	32	inf
vllm kernel	1317.03	1928.51	2098.10	2104.52
sgl kernel	1344.21	2124.57	2420.53	2445.45

sm90-H100

gsm8k

# w8a8 fp8 vllm-kernel

python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic  --trust-remote-code --disable-radix --port 33333
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 33333

Accuracy: 0.754
Invalid: 0.000
Latency: 49.751 s
Output throughput: 2801.533 token/s

# w8a8 fp8 sgl-kernel

python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --quantization w8a8_fp8  --trust-remote-code --disable-radix --port 33333
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 33333

Accuracy: 0.796
Invalid: 0.001
Latency: 47.991 s
Output throughput: 2801.793 token/s

throughput under various request rates

tok/s

request rate	8	16	32	inf
vllm kernel	1454.58	2692.00	4231.36	6664.01
sgl kernel	1454.70	2692.19	4217.78	6693.59

zhyncs · 2025-01-26T10:57:58Z

Let me bump a new sgl-kernel version to unblock this PR.

merrymercy · 2025-01-26T12:37:19Z

Can we still provide a flag to fallback to vllm's implementation? Similar to the custom allreduce kernel

sglang/python/sglang/srt/_custom_ops.py

Line 15 in 7e09761

use_vllm_custom_allreduce = os.environ.get("USE_VLLM_CUSTOM_ALLREDUCE", default=True)

. We need this at the beginning to quickly debug possible regression and AMD compatibility.
How is the performance on dynamic quantization?

…rnel

zhyncs · 2025-02-10T05:03:43Z

@HandH1998 What is the progress of this PR? Please let me know when it is ready.

HandH1998 · 2025-02-10T06:26:54Z

@zhyncs two days later

HandH1998 · 2025-02-12T07:07:00Z

@merrymercy @zhyncs
I have added support for falling back to the vLLM cutlass w8a8 fp8 kernel and have benchmarked dynamic quantization. The benchmark results of dynamic quantization are appended to the static quantization benchmark results, showing similar performance. However, as the batch size increases, the end-to-end performance using the sgl_kernel is worse than using the vllm_kernel. I believe the bottleneck is at the Triton kernel per_token_group_quant_fp8 as @BBuf mentioned in #3493. If we replace the Triton implementation with the CUDA implementation, the performance gap should decrease further.

HandH1998 · 2025-02-12T07:18:03Z

I also added a quantization config w8a8_fp8 to support the inference of quantized model underactivation dynamic per-token quantization, weight static per-channel quantization following #2881. You can use --quantization w8a8_fp8 to load the quantized checkpoint then perform the inference directly without any modification to config.json.

zhyncs · 2025-02-12T10:41:46Z

#3493 @HandH1998 this has been merged

…rnel

HandH1998 · 2025-03-07T09:18:54Z

@zhyncs

This PR depends on the latest sgl-kernel, but the CI doesn't use the latest sgl-kernel.

zhyncs · 2025-03-07T09:23:06Z

@HandH1998 wait for this https://github.com/sgl-project/sglang/actions/runs/13717498755/job/38365392684

zhyncs · 2025-03-07T09:24:24Z

update this to 0.0.3.post7

sglang/python/pyproject.toml

Line 47 in 96263f2

"sgl-kernel==0.0.3.post6",

HandH1998 · 2025-03-07T09:45:20Z

update this to 0.0.3.post7

sglang/python/pyproject.toml

Line 47 in 96263f2

"sgl-kernel==0.0.3.post6",

need to upload to pypi?

zhyncs · 2025-03-07T10:23:23Z

update this to 0.0.3.post7

sglang/python/pyproject.toml

Line 47 in 96263f2

"sgl-kernel==0.0.3.post6",

need to upload to pypi?

done https://pypi.org/project/sgl-kernel/0.0.3.post7/

HandH1998 · 2025-03-07T15:20:15Z

@zhyncs

The two falied CIs seems are related with DSv3. I tried to reproduce them locally. But I can't find lmsys/sglang-ci-dsv3-test model in huggingface.

zhyncs · 2025-03-07T18:44:14Z

@HandH1998 You can give me the HF user name or use DeepSeek V3/R1 for testing. I have also updated this, so if you wish to upgrade, please update this as well.

sglang/scripts/ci_install_dependency.sh

Line 29 in 70866b6

pip install sgl-kernel==0.0.3.post6 --force-reinstall --no-deps

hebiao064 · 2025-03-07T19:03:49Z

@HandH1998 Do you think we should support similar api like scaled_fp8_quant
https://github.com/vllm-project/vllm/blob/8ca7a71df787ad711ad3ac70a5bd2eb2bb398938/vllm/_custom_ops.py#L842

HandH1998 · 2025-03-08T04:45:57Z

@HandH1998 Do you think we should support similar api like scaled_fp8_quant https://github.com/vllm-project/vllm/blob/8ca7a71df787ad711ad3ac70a5bd2eb2bb398938/vllm/_custom_ops.py#L842

The cutlass w8a8 fp8 kernel only support per-channel activation scales, so I only apply per_token_quant. The scaled_fp8_quant also also per_tensor_quant, but it is not compatible with the cutlass w8a8 fp8 kernel and also bring a worse accuracy than per_token_quant and only a little better speed. So I don't think it is necessary.

…rnel

HandH1998 · 2025-03-08T05:28:29Z

sglang/scripts/ci_install_dependency.sh

My HF user name is HandH1998.

HandH1998 · 2025-03-08T14:10:32Z

The CI failures are caused by 0.0.3.post7 sgl-kernel. Ref to #4214. @zhyncs

…rnel

HandH1998 added 2 commits January 26, 2025 08:52

apply w8a8 fp8 sgl-kernel

3047c51

fix import

a5d3130

HandH1998 requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners January 26, 2025 10:55

zhyncs mentioned this pull request Jan 26, 2025

chore: bump 0.0.2.post18 for sgl-kernel #3149

Merged

4 tasks

zhyncs added the high priority label Jan 26, 2025

zhyncs assigned ispobock and zhyncs Jan 26, 2025

HaiShaw self-requested a review January 26, 2025 11:12

zhyncs added 2 commits January 26, 2025 19:28

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

eb7aa84

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

c0f805c

Merge remote-tracking branch 'origin/main' into apply_sgl_w8a8_fp8_ke…

2a01f12

…rnel

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

0cda3af

HandH1998 added 4 commits February 11, 2025 19:32

support falling back to vllm w8a8 fp8 kernel

5048341

support w8a8 fp8 config

6cbe4cc

fix bug

f59d06d

format

e560b2f

zhyncs added 2 commits February 12, 2025 18:41

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

193a92b

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

bf36e77

HandH1998 added 2 commits March 7, 2025 17:07

use per_token_quant_fp8 of sgl_kernel

8aa28b1

Merge remote-tracking branch 'origin/main' into apply_sgl_w8a8_fp8_ke…

37dd9c3

…rnel

HandH1998 added 2 commits March 7, 2025 17:29

update sgl-kernel to 0.0.3.post7

d51f952

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

0d7a620

zhyncs reviewed Mar 7, 2025

View reviewed changes

Comment thread python/sglang/test/test_block_fp8.py

Comment thread python/pyproject.toml Outdated

fix

345eda1

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

7ad2666

fix

c4695dc

Merge remote-tracking branch 'origin/main' into apply_sgl_w8a8_fp8_ke…

ef72590

…rnel

HandH1998 and others added 3 commits March 8, 2025 05:29

fix

ed6a8fe

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

5b21c8c

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

443a3b9

HandH1998 and others added 3 commits March 9, 2025 06:03

Merge remote-tracking branch 'origin/main' into apply_sgl_w8a8_fp8_ke…

eb58d72

…rnel

Merge remote-tracking branch 'origin/main' into apply_sgl_w8a8_fp8_ke…

7c8ef91

…rnel

Merge branch 'main' into apply_sgl_w8a8_fp8_kernel

9e7e3b9

zhyncs approved these changes Mar 9, 2025

View reviewed changes

zhyncs merged commit 0dd6cda into sgl-project:main Mar 9, 2025

hebiao064 mentioned this pull request Mar 15, 2025

[Accuracy] [Online Quantization] Llama 1B FP16/FP8/W8A8_FP8 accuracy #4434

Closed

merrymercy mentioned this pull request Apr 26, 2025

Development Roadmap (2025 H1) #4042

Closed

67 tasks

Conversation

HandH1998 commented Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

model: neuralmagic/Meta-Llama-3-8B-Instruct-FP8

sm89-L40

gsm8k

throughput under various request rates

sm90-H100

gsm8k

throughput under various request rates

model: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic

sm89-L40

gsm8k

throughput under various request rates

sm90-H100

gsm8k

throughput under various request rates

Uh oh!

zhyncs commented Jan 26, 2025

Uh oh!

merrymercy commented Jan 26, 2025

Uh oh!

zhyncs commented Feb 10, 2025

Uh oh!

HandH1998 commented Feb 10, 2025

Uh oh!

HandH1998 commented Feb 12, 2025

Uh oh!

HandH1998 commented Feb 12, 2025

Uh oh!

zhyncs commented Feb 12, 2025

Uh oh!

HandH1998 commented Mar 7, 2025

Uh oh!

zhyncs commented Mar 7, 2025

Uh oh!

zhyncs commented Mar 7, 2025

Uh oh!

Uh oh!

Uh oh!

HandH1998 commented Mar 7, 2025

Uh oh!

zhyncs commented Mar 7, 2025

Uh oh!

HandH1998 commented Mar 7, 2025

Uh oh!

zhyncs commented Mar 7, 2025

Uh oh!

hebiao064 commented Mar 7, 2025

Uh oh!

HandH1998 commented Mar 8, 2025

Uh oh!

HandH1998 commented Mar 8, 2025

Uh oh!

HandH1998 commented Mar 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

HandH1998 commented Jan 26, 2025 •

edited

Loading