[quant kernel] sgl-kernel support per_tensor_quant fp8 by BBuf · Pull Request #3786 · sgl-project/sglang

BBuf · 2025-02-22T10:21:25Z

add sgl_per_tensor_quant_fp8 to replace vllm ops.scaled_fp8_quant, remove a dependency of a vllm kernel.
a more efficient per tensor quant fp8 kernel use warp reduce rather than cub::Reduce . kernel speed up 10%-60% in h100 and 20%-100% in h200。
add tests and benchmark in sgl-kernel.
Additionally, I limited the maximum number of blocks for kernel startup. This can help avoid the following error that occurs in the vllm kernel when the number of tokens is very large: Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunchKernel.

h100:

✅ All implementations match
per-tensor-quant-fp8-performance:
    batch_size  seq_len         VLLM   SGL Kernel
0         16.0     64.0    37.919998    27.840000
1         16.0    128.0    61.888002    40.959999
2         16.0    256.0   121.023998    74.400000
3         16.0    512.0   233.967990   148.079991
4         16.0   1024.0   444.415987   273.887992
5         16.0   2048.0   862.720013   523.808002
6         32.0     64.0    62.080000    41.152000
7         32.0    128.0   120.768003    74.239999
8         32.0    256.0   233.855993   148.240000
9         32.0    512.0   444.512010   273.983985
10        32.0   1024.0   862.768054   523.775995
11        32.0   2048.0  1704.031944  1022.799969
12        64.0     64.0   120.576002    74.207999
13        64.0    128.0   233.088002   148.256004
14        64.0    256.0   443.136007   274.623990
15        64.0    512.0   860.224009   522.783995
16        64.0   1024.0  1700.991988  1021.952033
17        64.0   2048.0  3366.015911  2000.895977
18       128.0     64.0   233.375996   148.128003
19       128.0    128.0   443.071991   274.015993
20       128.0    256.0   860.239983   522.144020
21       128.0    512.0  1700.688004  1022.207975
22       128.0   1024.0  3364.768028  2003.328085
23       128.0   2048.0  6704.607964  3968.960047

h200:

✅ All implementations match
per-tensor-quant-fp8-performance:
    batch_size  seq_len         VLLM   SGL Kernel
0         16.0     64.0    35.200000    25.024001
1         16.0    128.0    57.312001    36.352001
2         16.0    256.0   110.239998    63.263997
3         16.0    512.0   220.384002   116.544001
4         16.0   1024.0   423.103988   211.935997
5         16.0   2048.0   822.607994   399.199992
6         32.0     64.0    57.567999    36.320001
7         32.0    128.0   110.335998    63.295998
8         32.0    256.0   220.223993   116.448000
9         32.0    512.0   423.119992   211.888000
10        32.0   1024.0   821.503997   399.183989
11        32.0   2048.0  1627.104044   775.839984
12        64.0     64.0   110.431999    63.327998
13        64.0    128.0   220.543995   116.608001
14        64.0    256.0   423.168004   211.935997
15        64.0    512.0   821.695983   399.071991
16        64.0   1024.0  1625.263929   776.368022
17        64.0   2048.0  3222.768068  1522.783995
18       128.0     64.0   220.128000   116.576001
19       128.0    128.0   423.135996   211.840004
20       128.0    256.0   821.407974   399.183989
21       128.0    512.0  1625.616074   775.776029
22       128.0   1024.0  3221.407890  1522.495985
23       128.0   2048.0  6415.455818  3015.967846

HaiShaw

@BBuf Thanks for the PR.
per_tensor_quant_fp8.cu is Nvidia/CUDA only, it will break ROCm, unlike solution in vLLM.
Let's make it AMD/ROCm applicable as well (refer to vllm/csrc/quantization/fp8/common.cuh).

BBuf · 2025-02-25T12:53:19Z

@BBuf Thanks for the PR. per_tensor_quant_fp8.cu is Nvidia/CUDA only, it will break ROCm, unlike solution in vLLM. Let's make it AMD/ROCm applicable as well (refer to vllm/csrc/quantization/fp8/common.cuh).

Ok.

BBuf · 2025-02-25T14:07:49Z

@BBuf Thanks for the PR. per_tensor_quant_fp8.cu is Nvidia/CUDA only, it will break ROCm, unlike solution in vLLM. Let's make it AMD/ROCm applicable as well (refer to vllm/csrc/quantization/fp8/common.cuh).

@HaiShaw I have added AMD support in the kernel, could you please help me review it? thanks!

…t/sglang into sgl_per_token_group_quant_fp8

BBuf added 3 commits February 22, 2025 03:19

replace vllm scaled_fp8_quant

4ba6e82

refine

b83427d

lint

3392b9c

BBuf requested review from HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners February 22, 2025 10:21

BBuf added 2 commits February 22, 2025 18:21

upd

7930498

upd

cd04be0

BBuf mentioned this pull request Feb 22, 2025

Apply sgl w8a8 fp8 kernel #3148

Merged

upd

9cd8c39

HaiShaw self-requested a review February 23, 2025 10:36

HaiShaw requested changes Feb 23, 2025

View reviewed changes

BBuf added 2 commits February 25, 2025 13:54

upd

e504057

lint

3bc9e76

Merge branch 'main' into sgl_per_token_group_quant_fp8

012f552

BBuf mentioned this pull request Mar 5, 2025

[Feature] remove vllm _custom_ops #2965

Closed

18 tasks

zcnrex mentioned this pull request Mar 5, 2025

Awq dequantization kernel porting to sglang #4084

Closed

6 tasks

zhyncs added the high priority label Mar 5, 2025

zcnrex mentioned this pull request Mar 5, 2025

Add awq dequantize kernel to sgl with 1x to 3x speedup #4104

Merged

6 tasks

BBuf added 3 commits March 6, 2025 01:54

merge

9c5c6cb

upd

c1ca108

upd

cf307f5

hebiao064 mentioned this pull request Mar 6, 2025

Add sgl_per_token_quant_fp8 #4089

Merged

6 tasks

hebiao064 reviewed Mar 6, 2025

View reviewed changes

Comment thread sgl-kernel/benchmark/bench_per_tensor_quant_fp8.py Outdated

hebiao064 reviewed Mar 6, 2025

View reviewed changes

Comment thread sgl-kernel/tests/test_per_tensor_quant_fp8.py Outdated

hebiao064 reviewed Mar 6, 2025

View reviewed changes

Comment thread sgl-kernel/tests/test_per_tensor_quant_fp8.py Outdated

hebiao064 reviewed Mar 6, 2025

View reviewed changes

Comment thread sgl-kernel/benchmark/bench_per_tensor_quant_fp8.py

yiakwy-xpu-ml-framework-team reviewed Mar 6, 2025

View reviewed changes

Comment thread sgl-kernel/src/sgl-kernel/csrc/gemm/per_tensor_quant_fp8.cu

BBuf and others added 4 commits March 6, 2025 08:45

upd

36d50ff

Merge branch 'sgl_per_token_group_quant_fp8' of github.com:sgl-projec…

965d1a1

…t/sglang into sgl_per_token_group_quant_fp8

Merge branch 'main' into sgl_per_token_group_quant_fp8

fe7e1b6

Merge branch 'main' into sgl_per_token_group_quant_fp8

c8e7587

zhyncs merged commit ad55f17 into main Mar 7, 2025

zhyncs deleted the sgl_per_token_group_quant_fp8 branch March 7, 2025 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quant kernel] sgl-kernel support per_tensor_quant fp8#3786

[quant kernel] sgl-kernel support per_tensor_quant fp8#3786
zhyncs merged 16 commits intomainfrom
sgl_per_token_group_quant_fp8

BBuf commented Feb 22, 2025 •

edited

Loading

Uh oh!

HaiShaw left a comment •

edited

Loading

Uh oh!

BBuf commented Feb 25, 2025

Uh oh!

BBuf commented Feb 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

BBuf commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaiShaw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 25, 2025

Uh oh!

BBuf commented Feb 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BBuf commented Feb 22, 2025 •

edited

Loading

HaiShaw left a comment •

edited

Loading