[ROCm] Enable silu_and_mul, gelu_and_mul, gelu_tanh_and_mul in amd platform V2 by yiakwy-xpu-ml-framework-team · Pull Request #4432 · sgl-project/sglang

yiakwy-xpu-ml-framework-team · 2025-03-14T15:06:18Z

Motivation

This is follow up of #4150

Modifications

Verifed both in ROCM and CUDA:

Benchmark

ROCM

silu_and_mul (x 44%)

gelu_and_mul (x 34%)

gelu_tanh_and_mul (x 34% )

CUDA

silu_and_mul （x 57%）

gelu_and_mul (x 47%)

gelu_tanh_and_mul (x 49%)

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

hebiao064 · 2025-03-14T17:44:15Z

@yiakwy-xpu-ml-framework-team would you please share some key learning about how you speed the kernel up?

yiakwy-xpu-ml-framework-team · 2025-03-14T17:58:26Z

@yiakwy-xpu-ml-framework-team would you please share some key learning about how you speed the kernel up?

Primarily with flashinfer::vec_t for 128 bit vectorization load. Note vllm does not do this (https://github.com/vllm-project/vllm/blob/977a16772c9d9717c4224fe7bd5b7d8699595449/csrc/activation_kernels.cu#L28) .

Note , accessing to continous elements may be coalesced automatically by compiler.

And instead they use the technique to cache HBM visit in L2/Tex, but it is apprently each element will be used only once. I don't see any benefits to use it in elementwise operation.

hebiao064 · 2025-03-14T19:47:24Z

@yiakwy-xpu-ml-framework-team would you please share some key learning about how you speed the kernel up?

Primarily with flashinfer::vec_t for 128 bit vectorization load. Note vllm does not do this (https://github.com/vllm-project/vllm/blob/977a16772c9d9717c4224fe7bd5b7d8699595449/csrc/activation_kernels.cu#L28) .

Note , accessing to continous elements may be coalesced automatically by compiler.

And instead they use the technique to cache HBM visit in L2/Tex, but it is apprently each element will be used only once. I don't see any benefits to use it in elementwise operation.

Thanks! Good to know that!

Me and @zcnrex noticed that flashinfer:vec_t doesn's help in per_xxx_fp8_quant kernels, maybe we need more data points.

BBuf · 2025-03-15T13:36:11Z

Great job! Once you solved the review comments, I think it can be merged. cc @zhyncs

yiakwy-xpu-ml-framework-team · 2025-03-21T23:56:55Z

@zhyncs could you have a look at this ?

yiakwy-xpu-ml-framework-team · 2025-03-22T19:09:18Z

@BruceXcluding @HaiShaw could you have a look ?

BBuf

I have no extra advices, good work. cc @zhyncs

BruceXcluding · 2025-03-24T04:18:40Z

@BruceXcluding @HaiShaw could you have a look ?

@HaiShaw Could you review it.

yiakwy-xpu-ml-framework-team · 2025-03-25T18:50:27Z

@HaiShaw Rebased.

the CI is not steady : test_per_token_xx_quant problem (NV GPU) introduced in previous PRs.

https://github.com/sgl-project/sglang/actions/runs/14065157148/job/39386727361

Will fix it later.

cc @zcnrex

add silu_and_mul support in amd platform add activation support in amd platform apply clang-format16.0.0 manually add rocm support for blockwise reduction rebase on main add activation benchmark apply clang16 manually add castFrom cuda symbols apply clang-format16 manually fix clang format add back SGLANG_SHFL_XOR_SYNC, SGLANG_SHFL_XOR_SYNC_WIDTH fix review issue remove flashinfer namespace

yiakwy-xpu-ml-framework-team · 2025-03-27T10:17:18Z

fixing new build problems introduced in #4706 after rebase

cc @zhyncs @HaiShaw

CI log : https://github.com/sgl-project/sglang/actions/runs/14103841740/job/39505761635?pr=4432

- fix rebase error

yiakwy-xpu-ml-framework-team · 2025-03-27T12:40:18Z

Local Test :

ROCM

Correctness

Benchmark

CUDA

Correctness

Benchmark

HaiShaw

Please remove flashinfer namespace and related code.

github-actions · 2025-05-30T08:32:17Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed.

yiakwy-xpu-ml-framework-team requested review from BBuf, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners March 14, 2025 15:06

yiakwy-xpu-ml-framework-team marked this pull request as draft March 14, 2025 15:33

yiakwy-xpu-ml-framework-team mentioned this pull request Mar 14, 2025

[ROCm] Enable silu_and_mul, gelu_and_mul, gelu_tanh_and_mul in amd platform #4150

Closed

6 tasks

yiakwy-xpu-ml-framework-team force-pushed the enable_silu_and_mul_in_amd_v2 branch from 62d466e to 13b8d72 Compare March 14, 2025 18:07

yiakwy-xpu-ml-framework-team marked this pull request as ready for review March 14, 2025 18:15

yiakwy-xpu-ml-framework-team mentioned this pull request Mar 15, 2025

Add moe topk softmax templated from vllm #4302

Merged

6 tasks

yiakwy-xpu-ml-framework-team force-pushed the enable_silu_and_mul_in_amd_v2 branch from 860893d to 0c4a1d7 Compare March 15, 2025 04:14