Skip to content

Add awq dequantize kernel to sgl with 1x to 3x speedup#4104

Merged
zhyncs merged 7 commits intosgl-project:mainfrom
zcnrex:awq-dequantize
Mar 12, 2025
Merged

Add awq dequantize kernel to sgl with 1x to 3x speedup#4104
zhyncs merged 7 commits intosgl-project:mainfrom
zcnrex:awq-dequantize

Conversation

@zcnrex
Copy link
Copy Markdown
Collaborator

@zcnrex zcnrex commented Mar 5, 2025

Motivation

#2965

Adapted from vllm awq_dequant

Achieved 1x - 4x speed up compared to vllm awq_dequantize cuda implementation.

The performance improvement comes from removing the copy & for loop in the end when assigning the output.

✅ All implementations match
awq-dequantize-performance:
    qweight_row  qweight_col         VLLM  SGL Kernel
0        3584.0        448.0    80.863997   27.519999
1        3584.0        576.0    96.832000   32.736000
2        3584.0       4736.0   669.727981  182.528004
3        3584.0         16.0    15.584000   13.296000
4        3584.0         32.0    17.983999   13.888000
5        3584.0         64.0    20.800000   14.784000
6        3584.0        128.0    29.472001   17.759999
7       18944.0        448.0   356.128007   99.232003
8       18944.0        576.0   441.632003  123.007998
9       18944.0       4736.0  3465.471983  897.119999
10      18944.0         16.0    24.992000   16.640000
11      18944.0         32.0    33.567999   19.040000
12      18944.0         64.0    49.791999   23.871999
13      18944.0        128.0    82.176000   36.511999
14        128.0        448.0    15.823999   13.584000
15        128.0        576.0    16.783999   13.824000
16        128.0       4736.0    39.584000   19.616000
17        128.0         16.0    12.992000   12.800000
18        128.0         32.0    12.896000   12.704000
19        128.0         64.0    12.992000   12.896000
20        128.0        128.0    13.504000   13.200000
21        256.0        448.0    19.040000   14.240000
22        256.0        576.0    20.352000   14.640000
23        256.0       4736.0    63.391998   24.480000
24        256.0         16.0    12.832000   12.720000
25        256.0         32.0    12.912000   12.832000
26        256.0         64.0    13.376000   13.056000
27        256.0        128.0    14.144000   13.312000
28        512.0        448.0    23.871999   15.328000
29        512.0        576.0    28.031999   16.832000
30        512.0       4736.0   110.048003   37.023999
31        512.0         16.0    12.896000   12.704000
32        512.0         32.0    13.264000   12.992000
33        512.0         64.0    14.144000   13.312000
34        512.0        128.0    15.552000   13.328000
35       1024.0        448.0    35.007998   18.224001
36       1024.0        576.0    39.391998   19.328000
37       1024.0       4736.0   203.455999   62.015999
38       1024.0         16.0    13.280000   13.024000
39       1024.0         32.0    14.464000   13.280000
40       1024.0         64.0    15.488000   13.216000
41       1024.0        128.0    18.271999   14.368000

There's also a triton implementation that we could port later.

Slightly better performance compared to triton implementation on H100

$ export VLLM_USE_TRITON_AWQ=1

✅ All implementations match
awq-dequantize-performance:
    qweight_row  qweight_col        VLLM  SGL Kernel
0        3584.0        448.0   29.200001   28.160000
1        3584.0        576.0   33.696000   33.376001
2        3584.0       4736.0  183.824003  183.295995
3        3584.0         16.0   14.816000   13.920000
4        3584.0         32.0   15.424000   14.848000
5        3584.0         64.0   16.319999   15.520000
6        3584.0        128.0   19.743999   18.880000
7       18944.0        448.0   99.583998   99.808000
8       18944.0        576.0  123.680003  123.712003
9       18944.0       4736.0  898.576021  897.840023
10      18944.0         16.0   19.007999   17.664000
11      18944.0         32.0   20.608000   19.776000
12      18944.0         64.0   26.303999   24.607999
13      18944.0        128.0   37.344001   37.280001
14        128.0        448.0   15.040000   14.144000
15        128.0        576.0   15.200000   14.304000
16        128.0       4736.0   21.183999   20.288000
17        128.0         16.0   14.272000   13.472000
18        128.0         32.0   14.432000   13.488000
19        128.0         64.0   14.592000   13.664000
20        128.0        128.0   14.784000   13.728000
21        256.0        448.0   15.840000   14.976000
22        256.0        576.0   16.031999   15.168000
23        256.0       4736.0   26.880000   25.119999
24        256.0         16.0   14.368000   13.472000
25        256.0         32.0   14.560000   13.600000
26        256.0         64.0   14.752000   13.664000
27        256.0        128.0   14.976000   13.920000
28        512.0        448.0   16.480001   15.807999
29        512.0        576.0   18.656000   17.759999
30        512.0       4736.0   37.888002   37.664000
31        512.0         16.0   14.464000   13.568000
32        512.0         32.0   14.752000   13.664000
33        512.0         64.0   14.976000   13.888000
34        512.0        128.0   15.008000   14.016000
35       1024.0        448.0   19.904001   19.168001
36       1024.0        576.0   20.864001   19.936001
37       1024.0       4736.0   63.167997   62.816001
38       1024.0         16.0   14.560000   13.664000
39       1024.0         32.0   14.976000   13.920000
40       1024.0         64.0   14.848000   13.952000
41       1024.0        128.0   15.680000   14.944000

Modifications

Checklist

@BBuf BBuf mentioned this pull request Mar 5, 2025
18 tasks
@zcnrex zcnrex marked this pull request as ready for review March 6, 2025 19:20
@zcnrex zcnrex changed the title Add awq dequantize kernel to sql Add awq dequantize kernel to sgl Mar 6, 2025
@zcnrex zcnrex changed the title Add awq dequantize kernel to sgl Add awq dequantize kernel to sgl with 1x to 3x speedup Mar 6, 2025
@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 7, 2025

Rebase main please. @zcnrex

@zcnrex zcnrex force-pushed the awq-dequantize branch 4 times, most recently from a77798e to 5a96f94 Compare March 7, 2025 06:05
@zcnrex
Copy link
Copy Markdown
Collaborator Author

zcnrex commented Mar 7, 2025

@BBuf Rebased main, ready for review. Thanks.

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 7, 2025

@BBuf Rebased main, ready for review. Thanks.

Thanks, I'll review it when I am free.

@zcnrex zcnrex force-pushed the awq-dequantize branch 2 times, most recently from d78b4e0 to dde3583 Compare March 8, 2025 01:04
Comment thread sgl-kernel/benchmark/bench_awq_dequant.py Outdated
@zcnrex zcnrex force-pushed the awq-dequantize branch 5 times, most recently from 850c9b0 to 6b2dffb Compare March 9, 2025 05:26
Comment thread sgl-kernel/src/sgl-kernel/csrc/gemm/awq_kernel.cu Outdated
@zhyncs zhyncs merged commit 07f9446 into sgl-project:main Mar 12, 2025
hebiao064 pushed a commit to hebiao064/sglang that referenced this pull request Mar 13, 2025
@zcnrex zcnrex mentioned this pull request Mar 18, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants