[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_shift_gate_select01_kernel#20395
Merged
[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_shift_gate_select01_kernel#20395
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
The logic could be clearer, we only need consider two cases: index is None or not None. each of case just call `scale_residual_layernorm_scale_shift` or `layernorm_scale_shift`
yingluosanqian
approved these changes
Mar 12, 2026
Collaborator
Author
|
/tag-and-rerun-ci |
5 tasks
Collaborator
Author
|
/rerun-failed-ci |
1 similar comment
Collaborator
Author
|
/rerun-failed-ci |
Collaborator
Author
liubiyongge
pushed a commit
to liubiyongge/sglang
that referenced
this pull request
Mar 13, 2026
…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Mar 15, 2026
…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>
Wangzheee
pushed a commit
to Wangzheee/sglang
that referenced
this pull request
Mar 21, 2026
…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>
0-693
pushed a commit
to 0-693/sglang
that referenced
this pull request
Mar 25, 2026
…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>
JustinTong0323
pushed a commit
to JustinTong0323/sglang
that referenced
this pull request
Apr 7, 2026
…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
main:
pr:
per step:
0.6383s->0.6256s. 2% end2end improvement.
Result is normal too.
test
micro benchmark
python python/sglang/jit_kernel/benchmark/bench_qwen_image_modulation.py ================================================================================ Benchmark: qwen_image layernorm + scale_shift_gate_select01 ================================================================================ qwen_image_layernorm_scale_shift_gate_select01: B S D Split Kernels Fused Triton 0 1.0 128.0 1024.0 38.304001 7.904000 1 1.0 128.0 1536.0 37.071999 8.608000 2 1.0 128.0 3072.0 36.352001 9.792000 3 1.0 512.0 1024.0 35.424002 8.000000 4 1.0 512.0 1536.0 35.583999 8.096000 5 1.0 512.0 3072.0 35.264000 11.008000 6 1.0 2048.0 1024.0 34.976002 11.200000 7 1.0 2048.0 1536.0 34.464002 14.560000 8 1.0 2048.0 3072.0 37.439998 22.816001 9 2.0 128.0 1024.0 36.127999 7.872000 10 2.0 128.0 1536.0 36.031999 7.904000 11 2.0 128.0 3072.0 35.680000 9.824000 12 2.0 512.0 1024.0 35.615999 8.416000 13 2.0 512.0 1536.0 35.392001 10.752000 14 2.0 512.0 3072.0 34.784000 15.776001 15 2.0 2048.0 1024.0 34.944002 15.104000 16 2.0 2048.0 1536.0 37.344001 21.504000 17 2.0 2048.0 3072.0 49.791999 37.120000 ================================================================================ Benchmark: qwen_image residual + layernorm + scale_shift_gate_select01 ================================================================================ qwen_image_residual_layernorm_scale_shift_gate_select01: B S D Split Kernels Fused Triton 0 1.0 128.0 1024.0 49.823999 17.247999 1 1.0 128.0 1536.0 49.120001 17.440001 2 1.0 128.0 3072.0 50.271999 17.632000 3 1.0 512.0 1024.0 48.767999 17.535999 4 1.0 512.0 1536.0 48.831999 16.960001 5 1.0 512.0 3072.0 47.936000 17.344000 6 1.0 2048.0 1024.0 47.807999 17.535999 7 1.0 2048.0 1536.0 47.839999 18.560000 8 1.0 2048.0 3072.0 50.816000 29.247999 9 2.0 128.0 1024.0 48.976000 17.279999 10 2.0 128.0 1536.0 49.247999 17.376000 11 2.0 128.0 3072.0 49.024001 16.736001 12 2.0 512.0 1024.0 48.255999 16.543999 13 2.0 512.0 1536.0 47.968000 17.440001 14 2.0 512.0 3072.0 47.807999 19.904001 15 2.0 2048.0 1024.0 48.032001 20.864001 16 2.0 2048.0 1536.0 52.288000 28.672000 17 2.0 2048.0 3072.0 90.272002 49.472000torch profiler
167us->72us
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci