Add awq dequantize kernel to sgl with 1x to 3x speedup by zcnrex · Pull Request #4104 · sgl-project/sglang

zcnrex · 2025-03-05T19:27:39Z

Motivation

Achieved 1x - 4x speed up compared to vllm awq_dequantize cuda implementation.

The performance improvement comes from removing the copy & for loop in the end when assigning the output.

✅ All implementations match
awq-dequantize-performance:
    qweight_row  qweight_col         VLLM  SGL Kernel
0        3584.0        448.0    80.863997   27.519999
1        3584.0        576.0    96.832000   32.736000
2        3584.0       4736.0   669.727981  182.528004
3        3584.0         16.0    15.584000   13.296000
4        3584.0         32.0    17.983999   13.888000
5        3584.0         64.0    20.800000   14.784000
6        3584.0        128.0    29.472001   17.759999
7       18944.0        448.0   356.128007   99.232003
8       18944.0        576.0   441.632003  123.007998
9       18944.0       4736.0  3465.471983  897.119999
10      18944.0         16.0    24.992000   16.640000
11      18944.0         32.0    33.567999   19.040000
12      18944.0         64.0    49.791999   23.871999
13      18944.0        128.0    82.176000   36.511999
14        128.0        448.0    15.823999   13.584000
15        128.0        576.0    16.783999   13.824000
16        128.0       4736.0    39.584000   19.616000
17        128.0         16.0    12.992000   12.800000
18        128.0         32.0    12.896000   12.704000
19        128.0         64.0    12.992000   12.896000
20        128.0        128.0    13.504000   13.200000
21        256.0        448.0    19.040000   14.240000
22        256.0        576.0    20.352000   14.640000
23        256.0       4736.0    63.391998   24.480000
24        256.0         16.0    12.832000   12.720000
25        256.0         32.0    12.912000   12.832000
26        256.0         64.0    13.376000   13.056000
27        256.0        128.0    14.144000   13.312000
28        512.0        448.0    23.871999   15.328000
29        512.0        576.0    28.031999   16.832000
30        512.0       4736.0   110.048003   37.023999
31        512.0         16.0    12.896000   12.704000
32        512.0         32.0    13.264000   12.992000
33        512.0         64.0    14.144000   13.312000
34        512.0        128.0    15.552000   13.328000
35       1024.0        448.0    35.007998   18.224001
36       1024.0        576.0    39.391998   19.328000
37       1024.0       4736.0   203.455999   62.015999
38       1024.0         16.0    13.280000   13.024000
39       1024.0         32.0    14.464000   13.280000
40       1024.0         64.0    15.488000   13.216000
41       1024.0        128.0    18.271999   14.368000

There's also a triton implementation that we could port later.

Slightly better performance compared to triton implementation on H100

$ export VLLM_USE_TRITON_AWQ=1

✅ All implementations match
awq-dequantize-performance:
    qweight_row  qweight_col        VLLM  SGL Kernel
0        3584.0        448.0   29.200001   28.160000
1        3584.0        576.0   33.696000   33.376001
2        3584.0       4736.0  183.824003  183.295995
3        3584.0         16.0   14.816000   13.920000
4        3584.0         32.0   15.424000   14.848000
5        3584.0         64.0   16.319999   15.520000
6        3584.0        128.0   19.743999   18.880000
7       18944.0        448.0   99.583998   99.808000
8       18944.0        576.0  123.680003  123.712003
9       18944.0       4736.0  898.576021  897.840023
10      18944.0         16.0   19.007999   17.664000
11      18944.0         32.0   20.608000   19.776000
12      18944.0         64.0   26.303999   24.607999
13      18944.0        128.0   37.344001   37.280001
14        128.0        448.0   15.040000   14.144000
15        128.0        576.0   15.200000   14.304000
16        128.0       4736.0   21.183999   20.288000
17        128.0         16.0   14.272000   13.472000
18        128.0         32.0   14.432000   13.488000
19        128.0         64.0   14.592000   13.664000
20        128.0        128.0   14.784000   13.728000
21        256.0        448.0   15.840000   14.976000
22        256.0        576.0   16.031999   15.168000
23        256.0       4736.0   26.880000   25.119999
24        256.0         16.0   14.368000   13.472000
25        256.0         32.0   14.560000   13.600000
26        256.0         64.0   14.752000   13.664000
27        256.0        128.0   14.976000   13.920000
28        512.0        448.0   16.480001   15.807999
29        512.0        576.0   18.656000   17.759999
30        512.0       4736.0   37.888002   37.664000
31        512.0         16.0   14.464000   13.568000
32        512.0         32.0   14.752000   13.664000
33        512.0         64.0   14.976000   13.888000
34        512.0        128.0   15.008000   14.016000
35       1024.0        448.0   19.904001   19.168001
36       1024.0        576.0   20.864001   19.936001
37       1024.0       4736.0   63.167997   62.816001
38       1024.0         16.0   14.560000   13.664000
39       1024.0         32.0   14.976000   13.920000
40       1024.0         64.0   14.848000   13.952000
41       1024.0        128.0   15.680000   14.944000

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

BBuf · 2025-03-07T01:46:16Z

Rebase main please. @zcnrex

zcnrex · 2025-03-07T06:07:00Z

@BBuf Rebased main, ready for review. Thanks.

BBuf · 2025-03-07T09:14:28Z

@BBuf Rebased main, ready for review. Thanks.

Thanks, I'll review it when I am free.

)

BBuf mentioned this pull request Mar 5, 2025

[Feature] remove vllm _custom_ops #2965

Closed

18 tasks

zcnrex marked this pull request as ready for review March 6, 2025 19:20

zcnrex requested review from BBuf, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners March 6, 2025 19:20

zcnrex changed the title ~~Add awq dequantize kernel to sql~~ Add awq dequantize kernel to sgl Mar 6, 2025

zcnrex changed the title ~~Add awq dequantize kernel to sgl~~ Add awq dequantize kernel to sgl with 1x to 3x speedup Mar 6, 2025

zcnrex force-pushed the awq-dequantize branch 4 times, most recently from a77798e to 5a96f94 Compare March 7, 2025 06:05

zcnrex force-pushed the awq-dequantize branch 2 times, most recently from d78b4e0 to dde3583 Compare March 8, 2025 01:04

hebiao064 reviewed Mar 8, 2025

View reviewed changes

Comment thread sgl-kernel/benchmark/bench_awq_dequant.py Outdated

zcnrex force-pushed the awq-dequantize branch 2 times, most recently from 76656f2 to 3c6eb12 Compare March 8, 2025 06:01

merrymercy force-pushed the main branch from e7dd5ee to d4017a6 Compare March 8, 2025 06:12

merrymercy requested review from ByronHsu, HaiShaw, Ying1123, hnyls2002, kssteven418 and rkooo567 as code owners March 8, 2025 06:12

zcnrex force-pushed the awq-dequantize branch from 3c6eb12 to c2fad64 Compare March 8, 2025 07:21

zcnrex force-pushed the awq-dequantize branch 5 times, most recently from 850c9b0 to 6b2dffb Compare March 9, 2025 05:26

BBuf reviewed Mar 9, 2025

View reviewed changes

Comment thread sgl-kernel/src/sgl-kernel/csrc/gemm/awq_kernel.cu Outdated

zcnrex force-pushed the awq-dequantize branch from 3d32f23 to 58f0e70 Compare March 9, 2025 20:05

BBuf approved these changes Mar 10, 2025

View reviewed changes

zcnrex and others added 5 commits March 9, 2025 20:03

Add awq dequantize kernel to sgl with 1x to 3x speedup

5af67cb

add unit tests

2b00d91

formated

b20fe9e

pre-commit

ed5369e

update comments

8ae573e

zcnrex force-pushed the awq-dequantize branch from 58f0e70 to 8ae573e Compare March 10, 2025 03:03

zcnrex and others added 2 commits March 9, 2025 21:54

Merge branch 'sgl-project:main' into awq-dequantize

69aaf2e

Merge branch 'main' into awq-dequantize

e5d6fa2

zhyncs added the high priority label Mar 10, 2025

zhyncs merged commit 07f9446 into sgl-project:main Mar 12, 2025

hebiao064 pushed a commit to hebiao064/sglang that referenced this pull request Mar 13, 2025

Add awq dequantize kernel to sgl with 1x to 3x speedup (sgl-project#4104

db9e704

)

zcnrex mentioned this pull request Mar 18, 2025

AWQ dequant bf16 support #4537

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add awq dequantize kernel to sgl with 1x to 3x speedup#4104

Add awq dequantize kernel to sgl with 1x to 3x speedup#4104
zhyncs merged 7 commits intosgl-project:mainfrom
zcnrex:awq-dequantize

zcnrex commented Mar 5, 2025 •

edited

Loading

Uh oh!

BBuf commented Mar 7, 2025

Uh oh!

zcnrex commented Mar 7, 2025

Uh oh!

BBuf commented Mar 7, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zcnrex commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

BBuf commented Mar 7, 2025

Uh oh!

zcnrex commented Mar 7, 2025

Uh oh!

BBuf commented Mar 7, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zcnrex commented Mar 5, 2025 •

edited

Loading