Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
|
@mickqian @yingluosanqian It's ready now. |
mickqian
reviewed
Mar 26, 2026
| head_dim=self.head_dim, | ||
| allow_inplace=True, | ||
| ) | ||
| if cos_sin_cache is not None: |
Collaborator
There was a problem hiding this comment.
could we use a helper function to generalize these logic, and put it in layernorm.py?
Collaborator
|
/tag-and-rerun-ci |
1 task
satyamk7054
pushed a commit
to satyamk7054/sglang
that referenced
this pull request
Apr 3, 2026
JustinTong0323
pushed a commit
to JustinTong0323/sglang
that referenced
this pull request
Apr 7, 2026
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Made with Codex and this skills: https://github.com/sgl-project/sglang/tree/main/python/sglang/multimodal_gen/.claude/skills/sglang-diffusion-ako4all-kernel
This PR adds a new JIT CUDA kernel that fuses QK RMSNorm and RoPE into a single in-place kernel for diffusion models.
It also wires the fused path into the main diffusion DiT implementations that already use the QK norm + RoPE pattern, while keeping the existing split path as a fallback.
What Changed
Added a new fused JIT kernel for QK RMSNorm + RoPE:
python/sglang/jit_kernel/csrc/elementwise/qknorm_rope.cuhpython/sglang/jit_kernel/qknorm_rope.pyAdded a shared runtime helper:
python/sglang/multimodal_gen/runtime/layers/layernorm.pyapply_qk_norm_rope(...)uses the fused kernel when the shape/dtype/layout is supported, and falls back to split QK norm + FlashInfer RoPE otherwise.Integrated the fused path into diffusion model implementations:
python/sglang/multimodal_gen/runtime/models/dits/qwen_image.pypython/sglang/multimodal_gen/runtime/models/dits/flux.pypython/sglang/multimodal_gen/runtime/models/dits/flux_2.pypython/sglang/multimodal_gen/runtime/models/dits/zimage.pyAdded correctness coverage and a dedicated micro benchmark:
python/sglang/jit_kernel/tests/test_qknorm_rope.pypython/sglang/jit_kernel/benchmark/bench_qknorm_rope.pyKey Optimization Points
Fused QKNorm+RoPE Kernel Design
The new kernel fuses Q/K RMSNorm and RoPE into a single warp-level in-place CUDA kernel.
Implementation highlights:
(token, head)work item.__shfl_xor_sync-based lane exchange for the Neox layoutKey optimizations:
head_dim,rope_dim,is_neox, anddtype.This PR also adds
position_offsetsupport in the shared runtime helper so the fused path can be used for segmented RoPE ranges in FLUX / FLUX.2 dual-stream attention blocks.Micro Benchmark
All numbers below compare the split path (
jit qknorm + flashinfer rope) vs the new fused JIT kernel.Shape notation:
q/k shape = [B*T, H, D]rope_dimis the applied rotary dimensionflux_1024B=1, T=4096, H=24, D=128, rope_dim=128qwen_image_1024B=1, T=4096, H=32, D=128, rope_dim=128qwen_image_partialB=1, T=4096, H=32, D=128, rope_dim=64zimage_1024B=1, T=4096, H=30, D=128, rope_dim=128batch2_mediumB=2, T=2048, H=24, D=128, rope_dim=128Weighted micro benchmark speedup:
1.4387xEnd-to-End Denoise Stage
qwenqwen-editfluxflux2zimageCommands used for the end-to-end denoise benchmark:
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci