[ROCm] Enable silu_and_mul, gelu_and_mul, gelu_tanh_and_mul in amd platform V2#4432
Conversation
|
@yiakwy-xpu-ml-framework-team would you please share some key learning about how you speed the kernel up? |
Primarily with flashinfer::vec_t for 128 bit vectorization load. Note vllm does not do this (https://github.com/vllm-project/vllm/blob/977a16772c9d9717c4224fe7bd5b7d8699595449/csrc/activation_kernels.cu#L28) . Note , accessing to continous elements may be coalesced automatically by compiler. And instead they use the technique to cache HBM visit in L2/Tex, but it is apprently each element will be used only once. I don't see any benefits to use it in elementwise operation. |
62d466e to
13b8d72
Compare
Thanks! Good to know that! Me and @zcnrex noticed that flashinfer:vec_t doesn's help in per_xxx_fp8_quant kernels, maybe we need more data points. |
860893d to
0c4a1d7
Compare
|
Great job! Once you solved the review comments, I think it can be merged. cc @zhyncs |
a7a9f7d to
a1d8eaa
Compare
|
@zhyncs could you have a look at this ? |
|
@BruceXcluding @HaiShaw could you have a look ? |
@HaiShaw Could you review it. |
a1d8eaa to
1fa4413
Compare
|
@HaiShaw Rebased. the CI is not steady : test_per_token_xx_quant problem (NV GPU) introduced in previous PRs.
Will fix it later. cc @zcnrex |
b79a97c to
8cbabe2
Compare
add silu_and_mul support in amd platform add activation support in amd platform apply clang-format16.0.0 manually add rocm support for blockwise reduction rebase on main add activation benchmark apply clang16 manually add castFrom cuda symbols apply clang-format16 manually fix clang format add back SGLANG_SHFL_XOR_SYNC, SGLANG_SHFL_XOR_SYNC_WIDTH fix review issue remove flashinfer namespace
8cbabe2 to
d98452e
Compare
|
fixing new build problems introduced in #4706 after rebase CI log : https://github.com/sgl-project/sglang/actions/runs/14103841740/job/39505761635?pr=4432 |
- fix rebase error
d98452e to
ecc6084
Compare
HaiShaw
left a comment
There was a problem hiding this comment.
Please remove flashinfer namespace and related code.
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed. |




Motivation
This is follow up of #4150
Modifications
Verifed both in ROCM and CUDA:
Benchmark
ROCM
CUDA
Checklist