[AMD] fused qk gemma norm kernels to reduce four kernels by kkHuang-amd · Pull Request #23575 · sgl-project/sglang

kkHuang-amd · 2026-04-23T14:18:50Z

Motivation

From the profiling data, apply_qk_norm function will bring 4 kernels launch on ROCm platform compared two kernels overlapped on CUDA platform. In order to reduce the e2e time cost, fused 4 kernels into one triton kernel

Modifications

models/utils.py
Add triton kernel implementation for fused kernel
models/qwen3_5.py
Check the path of hip to apply the fused triton kernel

Accuracy Tests

Server launch command

SGLANG_USE_AITER_UNIFIED_ATTN=1 SGLANG_USE_AITER=1 \
python3 -m sglang.launch_server \
  --model-path /dockerx/data/models/Qwen3.5-397B-A17B-FP8/ --tp 8 \
  --attention-backend aiter --trust-remote-code \
  --chunked-prefill-size 32768 \
  --model-loader-extra-config '{"enable_multithread_load": true}' \
  --watchdog-timeout 1200 --mem-fraction-static 0.9 \
  --host 0.0.0.0 --port 8000 --disable-radix-cache \
  --enable-aiter-allreduce-fusion --max-running-requests 128 \
  --page-size 16

Speed Tests and Profiling

Concurrency	total token throughput before	total token throughput after	Ratio
4	864.11	883.77	+2.2%
8	1582.05	1624.43	+2.6%
16	2888.91	2945.39	+1.9%
32	5005.01	5021.38	+0.3%
64	7609.9	7634.51	+0.3%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

…se + 2 norm) into one triton kernel

gemini-code-assist

Code Review

This pull request introduces a fused Triton kernel for Gemma RMSNorm to optimize the QK normalization process in Qwen 3.5 models on HIP-supported hardware. The changes aim to improve efficiency by handling both query and key normalization in a single pass. Feedback indicates that the kernel's hardcoded output types may cause data corruption with float32 inputs and that the use of reshape in the wrapper function might lead to unnecessary memory copies, contradicting the performance goals mentioned in the documentation.

gemini-code-assist · 2026-04-23T14:24:45Z

+    Passes input strides to the kernel so non-contiguous tensors (e.g. from
+    qkv.split()) are read correctly without an extra .contiguous() copy.
+    """
+    q_flat = q.reshape(-1, head_dim)
+    k_flat = k.reshape(-1, head_dim)


The docstring claims to avoid an extra .contiguous() copy by passing strides, but q.reshape(-1, head_dim) will internally trigger a copy if the tensor is non-contiguous (which is common for slices from qkv.split()). To truly avoid a copy, the kernel should be designed to accept the original multi-dimensional tensor and its strides, or you should use view and handle potential contiguity errors explicitly.

gemini-code-assist · 2026-04-24T01:53:15Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw · 2026-04-24T08:34:35Z

/tag-and-rerun-ci

HaiShaw · 2026-04-25T04:57:07Z

@amd-bot ci-status

…#23575) Co-authored-by: root <root@smci355-ccs-aus-g12-26.cs-aus.dcgpu>

[Opt] fused qk gemma norm kernels to reduce four kernels (2 elementwi…

112c3c2

…se + 2 norm) into one triton kernel

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

kkHuang-amd marked this pull request as ready for review April 24, 2026 01:53

kkHuang-amd changed the title ~~[Opt] fused qk gemma norm kernels to reduce four kernels~~ [AMD] fused qk gemma norm kernels to reduce four kernels Apr 24, 2026

github-actions Bot added the run-ci label Apr 24, 2026

HaiShaw approved these changes Apr 25, 2026

View reviewed changes

HaiShaw merged commit 393252f into sgl-project:main Apr 25, 2026
141 of 175 checks passed

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

[AMD] fused qk gemma norm kernels to reduce four kernels (sgl-project…

c500379

…#23575) Co-authored-by: root <root@smci355-ccs-aus-g12-26.cs-aus.dcgpu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] fused qk gemma norm kernels to reduce four kernels #23575

[AMD] fused qk gemma norm kernels to reduce four kernels #23575
HaiShaw merged 1 commit intosgl-project:mainfrom
HaiShaw:fuse/qk_norm_for_qwen3_5

kkHuang-amd commented Apr 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

HaiShaw commented Apr 24, 2026

Uh oh!

HaiShaw commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kkHuang-amd commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

HaiShaw commented Apr 24, 2026

Uh oh!

HaiShaw commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kkHuang-amd commented Apr 23, 2026 •

edited

Loading