[Diffusion] Speed up Qwen select01 Triton modulation kernels#21318
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of Qwen-Image denoise operations by refining the underlying Triton modulation kernels. The changes focus on reducing redundant memory operations and standardizing kernel launch parameters, leading to a measurable speedup in both microbenchmarks and end-to-end denoising stages. The improvements are validated through correctness tests and detailed performance profiling, demonstrating a more efficient utilization of GPU resources. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request implements a significant performance optimization for the Qwen select01 Triton modulation kernels. By switching to pointer-select for modulation loads, the kernels now only load the necessary scale/shift/gate branch, avoiding redundant memory accesses. Pinning num_warps=4 and num_stages=4 further refines the kernel launch configuration. The provided Nsight Compute analysis and benchmarks clearly demonstrate the positive impact of these changes, showing reduced GPU time, lower register usage, and improved throughput. The changes are well-justified and directly address the goal of speeding up these kernels.
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
Summary
This PR keeps the Qwen select01 Triton kernel version that showed a stable end-to-end win in Qwen-Image denoise.
The final change set:
scale/shift/gatebranchnum_warps=4,num_stages=48w1s/ residual-only experiments from the active code path because they did not produce stable model-level gainsMade with Codex(AKO4ALL framework and SGLang Diffusion SKILL).
Implementation
The optimized kernels are:
fuse_layernorm_scale_shift_gate_select01_kernelfuse_residual_layernorm_scale_shift_gate_select01_kernelMain code changes:
scale0/1,shift0/1, andgate0/1tl.where(idx, ...)4w4slaunch config for both kernelsValidation
Correctness:
python -m py_compile python/sglang/jit_kernel/diffusion/triton/scale_shift.pypytest -q python/sglang/jit_kernel/tests/test_qwen_image_modulation.py -qPerformance:
(2,2048,3072)(2,2048,3072)Nsight Compute
A representative
ncucheck on the layernorm select01 kernel at(2,2048,3072)shows:gpu__time_duration.avg:35.744 us -> 28.704 uslaunch__registers_per_thread:96 -> 72The optimized kernel reduces single-launch latency from the baseline by 19.7% (28.70 us here), while also increasing L2 and DRAM throughput. This indicates that the win is not just a launch-parameter artifact: the kernel is doing less wasted work and using the memory hierarchy more effectively.
Executed and issued instruction counts both drop by about 18.5%, which is consistent with the kernel rewrite: the pointer-select path avoids loading and computing both modulation branches before selecting one.
This kernel is register-limited, and lowering register pressure improves occupancy materially. In this tuned version, registers per thread drop from 96 to 72, which raises both theoretical and achieved occupancy and improves latency hiding.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci