[Diffusion] Add qknorm rope fuse kernel by BBuf · Pull Request #21440 · sgl-project/sglang

BBuf · 2026-03-26T01:03:08Z

Summary

Made with Codex and this skills: https://github.com/sgl-project/sglang/tree/main/python/sglang/multimodal_gen/.claude/skills/sglang-diffusion-ako4all-kernel

$Radixark03 SGLang $SGLang AKO4ALL Kernel $Sglang Diffusion Benchmark Profile 帮我在sglang diffusion里面基于AKO4ALL框架继续优化diffusion kernel，运行任何模型和benchmark之前都需要保证使用的gpu是完全空闲的。现在需要你帮我优化diffusion模型里面的一个常见pattern，qk norm+rope fuse，你可以看看diffusion models实现，里面已经有大量这种pattern，只不过目前是分别调用了jit_kernel的qk norm和flashinfer的rope实现，并没有实现fuse的效果，现在我需要你帮我在jit_kernel里面实现这个fuse kernel。

This PR adds a new JIT CUDA kernel that fuses QK RMSNorm and RoPE into a single in-place kernel for diffusion models.

It also wires the fused path into the main diffusion DiT implementations that already use the QK norm + RoPE pattern, while keeping the existing split path as a fallback.

What Changed

Added a new fused JIT kernel for QK RMSNorm + RoPE:
- python/sglang/jit_kernel/csrc/elementwise/qknorm_rope.cuh
- python/sglang/jit_kernel/qknorm_rope.py
Added a shared runtime helper:
- python/sglang/multimodal_gen/runtime/layers/layernorm.py
- apply_qk_norm_rope(...) uses the fused kernel when the shape/dtype/layout is supported, and falls back to split QK norm + FlashInfer RoPE otherwise.
Integrated the fused path into diffusion model implementations:
- python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py
- python/sglang/multimodal_gen/runtime/models/dits/flux.py
- python/sglang/multimodal_gen/runtime/models/dits/flux_2.py
- python/sglang/multimodal_gen/runtime/models/dits/zimage.py
Added correctness coverage and a dedicated micro benchmark:
- python/sglang/jit_kernel/tests/test_qknorm_rope.py
- python/sglang/jit_kernel/benchmark/bench_qknorm_rope.py

Key Optimization Points

Fuse QK RMSNorm and RoPE into a single kernel to remove an extra read/write pass over Q and K.
Keep the operation fully in-place on supported CUDA paths.
Reuse a shared runtime entry point so model code does not need model-specific kernel handling.
Add segmented position offset support so the fused path also works for FLUX / FLUX.2 dual-stream attention blocks, where text and image tokens use different RoPE position ranges.
Keep a safe fallback to the existing split implementation for unsupported cases.

Fused QKNorm+RoPE Kernel Design

The new kernel fuses Q/K RMSNorm and RoPE into a single warp-level in-place CUDA kernel.

Implementation highlights:

Each warp processes one (token, head) work item.
Input values and RMSNorm weights are loaded with vectorized packed loads.
RMSNorm is computed fully within a warp using warp-level reduction, without shared memory.
The normalized values stay in registers and are immediately consumed by RoPE.
RoPE is applied in-register:
- pairwise rotation for the standard layout
- __shfl_xor_sync-based lane exchange for the Neox layout
Results are packed back and written in place.

Key optimizations:

Eliminates the extra global memory round trip between split QKNorm and RoPE.
Merges Q and K processing into one kernel launch.
Uses vectorized loads/stores to reduce memory instructions.
Uses fp32 accumulation for RMSNorm for numerical stability.
Uses occupancy-aware launch sizing and JIT specialization on
head_dim, rope_dim, is_neox, and dtype.

This PR also adds position_offset support in the shared runtime helper so the fused path can be used for segmented RoPE ranges in FLUX / FLUX.2 dual-stream attention blocks.

Micro Benchmark

All numbers below compare the split path (jit qknorm + flashinfer rope) vs the new fused JIT kernel.

Shape notation:

q/k shape = [B*T, H, D]
rope_dim is the applied rotary dimension

Case	Shape	Split (ms)	Fused (ms)	Speedup
`flux_1024`	`B=1, T=4096, H=24, D=128, rope_dim=128`	0.059520	0.043072	1.3819x
`qwen_image_1024`	`B=1, T=4096, H=32, D=128, rope_dim=128`	0.081152	0.055008	1.4753x
`qwen_image_partial`	`B=1, T=4096, H=32, D=128, rope_dim=64`	0.079680	0.054560	1.4604x
`zimage_1024`	`B=1, T=4096, H=30, D=128, rope_dim=128`	0.074112	0.051008	1.4529x
`batch2_medium`	`B=2, T=2048, H=24, D=128, rope_dim=128`	0.059488	0.043232	1.3760x

Weighted micro benchmark speedup: 1.4387x

End-to-End Denoise Stage

Model	Split Denoise (s)	Fused Denoise (s)	Delta	Speedup
`qwen`	14.43	12.36	-14.35%	1.1675x
`qwen-edit`	28.62	28.26	-1.26%	1.0127x
`flux`	6.495	6.421	-1.14%	1.0116x
`flux2`	22.314	22.311	-0.01%	1.0001x
`zimage`	0.723	0.712	-1.47%	1.0149x

Commands used for the end-to-end denoise benchmark:

# Qwen Image (split)
sglang generate \
  --model-path=Qwen/Qwen-Image-2512 \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" \
  --negative-prompt=" " \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/qwen_split.json

# Qwen Image (fused)
sglang generate \
  --model-path=Qwen/Qwen-Image-2512 \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" \
  --negative-prompt=" " \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/qwen_fused.json

# Qwen Image Edit (split)
sglang generate \
  --model-path=Qwen/Qwen-Image-Edit-2511 \
  --prompt="Transform into anime style" \
  --negative-prompt=" " \
  --image-path=<ASSET_DIR>/cat.png \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/qwen_edit_split.json

# Qwen Image Edit (fused)
sglang generate \
  --model-path=Qwen/Qwen-Image-Edit-2511 \
  --prompt="Transform into anime style" \
  --negative-prompt=" " \
  --image-path=<ASSET_DIR>/cat.png \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/qwen_edit_fused.json

# FLUX.1-dev (split)
sglang generate \
  --model-path=black-forest-labs/FLUX.1-dev \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/flux_split.json

# FLUX.1-dev (fused)
sglang generate \
  --model-path=black-forest-labs/FLUX.1-dev \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/flux_fused.json

# FLUX.2-dev (split)
sglang generate \
  --model-path=black-forest-labs/FLUX.2-dev \
  --prompt="A Logo With Bold Large Text: SGL Diffusion" \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --dit-layerwise-offload false \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload true \
  --vae-cpu-offload false \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/flux2_split.json

# FLUX.2-dev (fused)
sglang generate \
  --model-path=black-forest-labs/FLUX.2-dev \
  --prompt="A Logo With Bold Large Text: SGL Diffusion" \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --dit-layerwise-offload false \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload true \
  --vae-cpu-offload false \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/flux2_fused.json

# Z-Image-Turbo (split)
sglang generate \
  --model-path=Tongyi-MAI/Z-Image-Turbo \
  --prompt="A fantasy landscape with mountains and a river, detailed, vibrant colors" \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=9 \
  --guidance-scale=0.0 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/zimage_split.json

# Z-Image-Turbo (fused)
sglang generate \
  --model-path=Tongyi-MAI/Z-Image-Turbo \
  --prompt="A fantasy landscape with mountains and a river, detailed, vibrant colors" \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=9 \
  --guidance-scale=0.0 \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path outputs/qknorm_rope_pr/zimage_fused.json

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-26T01:03:12Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BBuf · 2026-03-26T01:16:14Z

/tag-and-rerun-ci

BBuf · 2026-03-26T01:16:32Z

@mickqian @yingluosanqian It's ready now.

mickqian · 2026-03-26T12:47:31Z

-                head_dim=self.head_dim,
-                allow_inplace=True,
-            )
+            if cos_sin_cache is not None:


could we use a helper function to generalize these logic, and put it in layernorm.py?

mickqian

excellent

yhyang201 · 2026-03-27T04:08:49Z

/tag-and-rerun-ci

BBuf added 5 commits March 24, 2026 23:13

ud

6fe2148

ud

3f98436

ud

b940fa7

ud

b62f43f

ud

5ea0f9b

BBuf requested review from DarkSharpness, HydraQYH, celve, mickqian, ping1jing2, yhyang201, yingluosanqian and yuan-luo as code owners March 26, 2026 01:03

github-actions Bot added diffusion SGLang Diffusion jit-kernel labels Mar 26, 2026

Merge branch 'main' into add_qknorm_rope_fuse_kernel

080b93f

github-actions Bot added the run-ci label Mar 26, 2026

BBuf changed the title ~~Add qknorm rope fuse kernel~~ [Diffusion] Add qknorm rope fuse kernel Mar 26, 2026

lint

a8362c9

DarkSharpness reviewed Mar 26, 2026

View reviewed changes

BBuf and others added 7 commits March 26, 2026 10:49

lint

fa66da7

Merge branch 'main' into add_qknorm_rope_fuse_kernel

b11da77

[diffusion] Move qknorm_rope JIT kernel under diffusion layout

087faec

ud

1ad024c

[diffusion] Register qknorm_rope benchmark in CI

227c844

[diffusion] Ignore CI runner args in qknorm_rope benchmark

987d7f2

ud

2d526dc

mickqian reviewed Mar 26, 2026

View reviewed changes

BBuf added 2 commits March 27, 2026 11:10

ud

ae2d4fb

chore: trigger pr refresh

e10d255

mickqian approved these changes Mar 27, 2026

View reviewed changes

BBuf merged commit d633ab7 into main Mar 27, 2026
143 of 209 checks passed

BBuf deleted the add_qknorm_rope_fuse_kernel branch March 27, 2026 06:27

schetlur-nv mentioned this pull request Apr 2, 2026

[Feature] VisualGen: Add qknorm + rope fuse kernel for cross-head norm (Wan/LTX-2) NVIDIA/TensorRT-LLM#12716

Open

1 task

satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026

[Diffusion] Add qknorm rope fuse kernel (sgl-project#21440)

2db3200

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Diffusion] Add qknorm rope fuse kernel (sgl-project#21440)

6ccd99b

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Diffusion] Add qknorm rope fuse kernel (sgl-project#21440)

51af79f

timzsu mentioned this pull request Apr 27, 2026

[RFC]: Kernel Optimization for Diffusion DiT and MoE LLM vllm-project/vllm-omni#3186

Open

1 task

BBuf mentioned this pull request Apr 29, 2026

SGLang AI Agent Performance Optimization PRs (2026-01-29 to 2026-04-29) BBuf/AI-Infra-Auto-Driven-SKILLS#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] Add qknorm rope fuse kernel#21440

[Diffusion] Add qknorm rope fuse kernel#21440
BBuf merged 16 commits intomainfrom
add_qknorm_rope_fuse_kernel

BBuf commented Mar 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 26, 2026

Uh oh!

BBuf commented Mar 26, 2026

Uh oh!

BBuf commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian Mar 26, 2026

Uh oh!

BBuf Mar 27, 2026

Uh oh!

mickqian left a comment •

edited

Loading

Uh oh!

yhyang201 commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

BBuf commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Key Optimization Points

Fused QKNorm+RoPE Kernel Design

Micro Benchmark

End-to-End Denoise Stage

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 26, 2026

Uh oh!

BBuf commented Mar 26, 2026

Uh oh!

BBuf commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

mickqian left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhyang201 commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BBuf commented Mar 26, 2026 •

edited

Loading

mickqian left a comment •

edited

Loading