Skip to content

[ROCm] Enable silu_and_mul, gelu_and_mul, gelu_tanh_and_mul in amd platform V2#4432

Closed
yiakwy-xpu-ml-framework-team wants to merge 3 commits intosgl-project:mainfrom
yiakwy-xpu-ml-framework-team:enable_silu_and_mul_in_amd_v2
Closed

[ROCm] Enable silu_and_mul, gelu_and_mul, gelu_tanh_and_mul in amd platform V2#4432
yiakwy-xpu-ml-framework-team wants to merge 3 commits intosgl-project:mainfrom
yiakwy-xpu-ml-framework-team:enable_silu_and_mul_in_amd_v2

Conversation

@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Mar 14, 2025

Motivation

This is follow up of #4150

Modifications

Verifed both in ROCM and CUDA:

截屏2025-03-14 17 54 32

Benchmark

ROCM

  • silu_and_mul (x 44%)
截屏2025-03-15 00 35 37
  • gelu_and_mul (x 34%)
截屏2025-03-15 00 36 29
  • gelu_tanh_and_mul (x 34% )
截屏2025-03-15 00 37 09

CUDA

  • silu_and_mul (x 57%)
截屏2025-03-15 02 39 14
  • gelu_and_mul (x 47%)
截屏2025-03-15 02 43 44
  • gelu_tanh_and_mul (x 49%)
截屏2025-03-15 02 39 47

Checklist

@hebiao064
Copy link
Copy Markdown
Collaborator

@yiakwy-xpu-ml-framework-team would you please share some key learning about how you speed the kernel up?

@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor Author

@yiakwy-xpu-ml-framework-team would you please share some key learning about how you speed the kernel up?

Primarily with flashinfer::vec_t for 128 bit vectorization load. Note vllm does not do this (https://github.com/vllm-project/vllm/blob/977a16772c9d9717c4224fe7bd5b7d8699595449/csrc/activation_kernels.cu#L28) .

Note , accessing to continous elements may be coalesced automatically by compiler.

And instead they use the technique to cache HBM visit in L2/Tex, but it is apprently each element will be used only once. I don't see any benefits to use it in elementwise operation.

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team marked this pull request as ready for review March 14, 2025 18:15
@hebiao064
Copy link
Copy Markdown
Collaborator

@yiakwy-xpu-ml-framework-team would you please share some key learning about how you speed the kernel up?

Primarily with flashinfer::vec_t for 128 bit vectorization load. Note vllm does not do this (https://github.com/vllm-project/vllm/blob/977a16772c9d9717c4224fe7bd5b7d8699595449/csrc/activation_kernels.cu#L28) .

Note , accessing to continous elements may be coalesced automatically by compiler.

And instead they use the technique to cache HBM visit in L2/Tex, but it is apprently each element will be used only once. I don't see any benefits to use it in elementwise operation.

Thanks! Good to know that!

Me and @zcnrex noticed that flashinfer:vec_t doesn's help in per_xxx_fp8_quant kernels, maybe we need more data points.

Comment thread sgl-kernel/benchmark/bench_activation.py Outdated
Comment thread sgl-kernel/csrc/elementwise/activation.cu Outdated
Comment thread sgl-kernel/include/hip_math_def.h Outdated
Comment thread sgl-kernel/include/hip_math_def.h Outdated
Comment thread sgl-kernel/include/hip_vec_dtypes.h Outdated
Comment thread sgl-kernel/include/utils.h Outdated
Comment thread sgl-kernel/include/utils.h Outdated
@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 15, 2025

Great job! Once you solved the review comments, I think it can be merged. cc @zhyncs

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team force-pushed the enable_silu_and_mul_in_amd_v2 branch 2 times, most recently from a7a9f7d to a1d8eaa Compare March 21, 2025 17:01
@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor Author

@zhyncs could you have a look at this ?

@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor Author

@BruceXcluding @HaiShaw could you have a look ?

Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no extra advices, good work. cc @zhyncs

@BruceXcluding
Copy link
Copy Markdown
Contributor

@BruceXcluding @HaiShaw could you have a look ?

@HaiShaw Could you review it.

@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor Author

yiakwy-xpu-ml-framework-team commented Mar 25, 2025

@HaiShaw Rebased.

the CI is not steady : test_per_token_xx_quant problem (NV GPU) introduced in previous PRs.

https://github.com/sgl-project/sglang/actions/runs/14065157148/job/39386727361

Will fix it later.

cc @zcnrex

add silu_and_mul support in amd platform

add activation support in amd platform

apply clang-format16.0.0 manually

add rocm support for blockwise reduction

rebase on main

add activation benchmark

apply clang16 manually

add castFrom cuda symbols

apply clang-format16 manually

fix clang format

add back SGLANG_SHFL_XOR_SYNC, SGLANG_SHFL_XOR_SYNC_WIDTH

fix review issue

remove flashinfer namespace
@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor Author

yiakwy-xpu-ml-framework-team commented Mar 27, 2025

fixing new build problems introduced in #4706 after rebase

cc @zhyncs @HaiShaw

CI log : https://github.com/sgl-project/sglang/actions/runs/14103841740/job/39505761635?pr=4432

- fix rebase error
@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor Author

yiakwy-xpu-ml-framework-team commented Mar 27, 2025

Local Test :

ROCM

Correctness

截屏2025-03-27 20 21 09

Benchmark

截屏2025-03-27 20 29 53

CUDA

Correctness

截屏2025-03-27 22 25 13

Benchmark

截屏2025-03-27 22 39 21

Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove flashinfer namespace and related code.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants