[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_shift_gate_select01_kernel by BBuf · Pull Request #20395 · sgl-project/sglang

BBuf · 2026-03-12T01:30:27Z

Motivation

main:

sglang generate --model-path=Qwen/Qwen-Image-Edit-2511 '--prompt=Transform into anime style' '--negative-prompt= ' --image-path=/workspace/gen_benchmark/figs/cat.png --width=1024 --height=1024 --num-inference-steps=50  --guidance-scale=4.0 --seed=42 --save-output --warmup --dit-cpu-offload false --text-encoder-cpu-offload false --enable-torch-compile

[03-12 01:07:24] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:31<00:00,  1.57it/s]
[03-12 01:07:56] [DenoisingStage] average time per step: 0.6383 seconds
[03-12 01:07:56] [DenoisingStage] finished in 31.9185 seconds
[03-12 01:07:56] [DecodingStage] started...
[03-12 01:07:56] [DecodingStage] finished in 0.1417 seconds
[03-12 01:07:56] Peak GPU memory: 64.44 GB, Peak allocated: 62.37 GB, Memory pool overhead: 2.07 GB (3.2%), Remaining GPU memory at peak: 75.96 GB. Components that could stay resident (based on the last request workload): []. Related offload server args to disable: None
[03-12 01:07:56] Output saved to outputs/Transform_into_anime_style_20260312-010712_73d9a86a.png
[03-12 01:07:56] Pixel data generated successfully in 44.47 seconds
[03-12 01:07:56] Completed batch processing. Generated 1 outputs in 44.47 seconds
[03-12 01:07:56] Warmed-up request processed in 32.37 seconds (with warmup excluded)
[03-12 01:07:56] Memory usage - Max peak: 65988.00 MB, Avg peak: 65988.00 MB

pr:

sglang generate --model-path=Qwen/Qwen-Image-Edit-2511 '--prompt=Transform into anime style' '--negative-prompt= ' --image-path=/workspace/gen_benchmark/figs/cat.png --width=1024 --height=1024 --num-inference-steps=50  --guidance-scale=4.0 --seed=42 --save-output --warmup --dit-cpu-offload false --text-encoder-cpu-offload false --enable-torch-compile

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:31<00:00,  1.60it/s]
[03-12 01:13:46] [DenoisingStage] average time per step: 0.6256 seconds
[03-12 01:13:46] [DenoisingStage] finished in 31.2824 seconds
[03-12 01:13:46] [DecodingStage] started...
[03-12 01:13:46] [DecodingStage] finished in 0.1477 seconds
[03-12 01:13:46] Peak GPU memory: 64.44 GB, Peak allocated: 62.37 GB, Memory pool overhead: 2.07 GB (3.2%), Remaining GPU memory at peak: 75.96 GB. Components that could stay resident (based on the last request workload): []. Related offload server args to disable: None
[03-12 01:13:47] Output saved to outputs/Transform_into_anime_style_20260312-011258_fa2fd09c.png
[03-12 01:13:47] Pixel data generated successfully in 49.25 seconds
[03-12 01:13:47] Completed batch processing. Generated 1 outputs in 49.25 seconds
[03-12 01:13:47] Warmed-up request processed in 31.73 seconds (with warmup excluded)
[03-12 01:13:47] Memory usage - Max peak: 65988.00 MB, Avg peak: 65988.00 MB

per step:

0.6383s->0.6256s. 2% end2end improvement.

Result is normal too.

test

micro benchmark

python python/sglang/jit_kernel/benchmark/bench_qwen_image_modulation.py

================================================================================
Benchmark: qwen_image layernorm + scale_shift_gate_select01
================================================================================

qwen_image_layernorm_scale_shift_gate_select01:
      B       S       D  Split Kernels  Fused Triton
0   1.0   128.0  1024.0      38.304001      7.904000
1   1.0   128.0  1536.0      37.071999      8.608000
2   1.0   128.0  3072.0      36.352001      9.792000
3   1.0   512.0  1024.0      35.424002      8.000000
4   1.0   512.0  1536.0      35.583999      8.096000
5   1.0   512.0  3072.0      35.264000     11.008000
6   1.0  2048.0  1024.0      34.976002     11.200000
7   1.0  2048.0  1536.0      34.464002     14.560000
8   1.0  2048.0  3072.0      37.439998     22.816001
9   2.0   128.0  1024.0      36.127999      7.872000
10  2.0   128.0  1536.0      36.031999      7.904000
11  2.0   128.0  3072.0      35.680000      9.824000
12  2.0   512.0  1024.0      35.615999      8.416000
13  2.0   512.0  1536.0      35.392001     10.752000
14  2.0   512.0  3072.0      34.784000     15.776001
15  2.0  2048.0  1024.0      34.944002     15.104000
16  2.0  2048.0  1536.0      37.344001     21.504000
17  2.0  2048.0  3072.0      49.791999     37.120000

================================================================================
Benchmark: qwen_image residual + layernorm + scale_shift_gate_select01
================================================================================

qwen_image_residual_layernorm_scale_shift_gate_select01:
      B       S       D  Split Kernels  Fused Triton
0   1.0   128.0  1024.0      49.823999     17.247999
1   1.0   128.0  1536.0      49.120001     17.440001
2   1.0   128.0  3072.0      50.271999     17.632000
3   1.0   512.0  1024.0      48.767999     17.535999
4   1.0   512.0  1536.0      48.831999     16.960001
5   1.0   512.0  3072.0      47.936000     17.344000
6   1.0  2048.0  1024.0      47.807999     17.535999
7   1.0  2048.0  1536.0      47.839999     18.560000
8   1.0  2048.0  3072.0      50.816000     29.247999
9   2.0   128.0  1024.0      48.976000     17.279999
10  2.0   128.0  1536.0      49.247999     17.376000
11  2.0   128.0  3072.0      49.024001     16.736001
12  2.0   512.0  1024.0      48.255999     16.543999
13  2.0   512.0  1536.0      47.968000     17.440001
14  2.0   512.0  3072.0      47.807999     19.904001
15  2.0  2048.0  1024.0      48.032001     20.864001
16  2.0  2048.0  1536.0      52.288000     28.672000
17  2.0  2048.0  3072.0      90.272002     49.472000

torch profiler

main

pr

167us->72us

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-12T01:30:31Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

The logic could be clearer, we only need consider two cases: index is None or not None. each of case just call `scale_residual_layernorm_scale_shift` or `layernorm_scale_shift`

BBuf · 2026-03-12T05:29:24Z

/tag-and-rerun-ci

BBuf · 2026-03-12T08:51:43Z

/rerun-failed-ci

BBuf · 2026-03-12T09:51:26Z

/rerun-failed-ci

BBuf · 2026-03-13T05:15:18Z

https://github.com/sgl-project/sglang/actions/runs/23032175198/job/66892663929?pr=20395

…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>

BBuf added 2 commits March 11, 2026 22:25

ud

6d52fb1

ud

65b7095

BBuf requested review from DarkSharpness, HydraQYH, celve, mickqian, ping1jing2, yhyang201, yingluosanqian and yuan-luo as code owners March 12, 2026 01:30

github-actions Bot added the diffusion SGLang Diffusion label Mar 12, 2026

yingluosanqian reviewed Mar 12, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py Outdated

yingluosanqian reviewed Mar 12, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py Outdated

BBuf and others added 2 commits March 12, 2026 12:11

ud

49c4bcb

Update qwen_image.py

608dbae

The logic could be clearer, we only need consider two cases: index is None or not None. each of case just call `scale_residual_layernorm_scale_shift` or `layernorm_scale_shift`

yingluosanqian approved these changes Mar 12, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 12, 2026

BBuf mentioned this pull request Mar 12, 2026

[Diffusion][Qwen-Image] Kernel fusion on layernorm and fuse_scale_shift_gate_select01 #20429

Open

5 tasks

BBuf added 3 commits March 12, 2026 17:51

Merge branch 'main' into qwen_image_opt

f7ba842

Merge branch 'main' into qwen_image_opt

b8ee25d

Merge branch 'main' into qwen_image_opt

abce9fc

github-actions Bot added the jit-kernel label Mar 13, 2026

BBuf merged commit e00328d into main Mar 13, 2026
95 of 110 checks passed

BBuf deleted the qwen_image_opt branch March 13, 2026 05:15

liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026

[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_sh…

cc1a0a1

…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Mar 15, 2026

[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_sh…

4ff1fff

…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_sh…

25e03ab

…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_sh…

d8d9ceb

…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_sh…

07676bd

…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_sh…

ee0febc

…ift_gate_select01_kernel (sgl-project#20395) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>

BBuf mentioned this pull request Apr 29, 2026

SGLang AI Agent Performance Optimization PRs (2026-01-29 to 2026-04-29) BBuf/AI-Infra-Auto-Driven-SKILLS#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_shift_gate_select01_kernel#20395

[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_shift_gate_select01_kernel#20395
BBuf merged 7 commits intomainfrom
qwen_image_opt

BBuf commented Mar 12, 2026

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

BBuf commented Mar 12, 2026

Uh oh!

BBuf commented Mar 12, 2026

Uh oh!

BBuf commented Mar 12, 2026

Uh oh!

BBuf commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BBuf commented Mar 12, 2026

Motivation

test

micro benchmark

torch profiler

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

BBuf commented Mar 12, 2026

Uh oh!

BBuf commented Mar 12, 2026

Uh oh!

BBuf commented Mar 12, 2026

Uh oh!

BBuf commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants