[Diffusion] Move diffusion time embedding to jit kernel by BBuf · Pull Request #16879 · sgl-project/sglang

BBuf · 2026-01-11T03:47:05Z

tests

Performance

main

 sglang generate --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers --prompt "A curious raccoon peers through a vibrant field of yellow sunflowers, its eyes wide with interest." --warmup  --perf-dump-path main.json

pr

sglang generate --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers --prompt "A curious raccoon peers through a vibrant field of yellow sunflowers, its eyes wide with interest." --warmup  --perf-dump-path pr.json

bbuf python3 /home/lmsys/bbuf/sglang/python/sglang/multimodal_gen/benchmarks/compare_perf.py main.json pr.json

Performance Comparison Report

1. High-level Summary

Metric	Baseline	New	Diff	Status
E2E Latency	84927.21 ms	81948.12 ms	-2979.09 ms (-3.5%)	✅
Throughput	0.01 req/s	0.01 req/s	-	-

2. Stage Breakdown

Stage Name	Baseline (ms)	New (ms)	Diff (ms)	Diff (%)	Status
InputValidationStage	0.04	0.04	-0.01	-14.9%	⚪️
TextEncodingStage	1310.39	1309.37	-1.02	-0.1%	⚪️
ConditioningStage	0.01	0.01	+0.00	+15.5%	⚪️
TimestepPreparationStage	0.34	0.25	-0.08	-24.9%	⚪️
LatentPreparationStage	0.13	0.09	-0.04	-30.7%	⚪️
DenoisingStage	77232.59	77706.47	+473.87	+0.6%	⚪️
DecodingStage	6381.87	2930.39	-3451.48	-54.1%	🟢

For TextEncodingStage, it takes almost the same time. This is also expected, didn't modify any code in cuda kernel.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-11T03:47:27Z

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the timestep_embedding operation by migrating its CUDA implementation to a JIT-compiled kernel. This strategic refactoring enhances the modularity of the codebase and yields notable performance gains, particularly in the DecodingStage of multimodal generation, contributing to a faster overall execution flow.

Highlights

JIT Kernel Implementation: The timestep_embedding function has been re-implemented as a JIT-compiled CUDA kernel in python/sglang/jit_kernel/csrc/diffusion/timestep_embedding.cuh, supporting fp16, bf16, and fp32 input types.
Python Integration: A new Python module python/sglang/jit_kernel/timestep_embedding.py is introduced to load and expose this JIT kernel, replacing the previous PyTorch extension.
Old Implementation Removal: The previous C++/CUDA implementation (sgl-kernel/csrc/sgl_diffusion/elementwise/timestep_embedding.cu) and its associated Python bindings and build configurations have been removed to streamline the codebase.
Dynamic Kernel Usage: The Timesteps class in python/sglang/multimodal_gen/runtime/layers/visual_embedding.py now dynamically attempts to use the new JIT kernel, falling back to the default implementation if it's unavailable.
Performance Improvement: Benchmarks indicate a significant 54.1% reduction in DecodingStage latency and an overall 3.5% improvement in End-to-End Latency due to this change.
Test Suite Updates: Test cases for timestep_embedding have been updated to reflect the new module structure and to support float32 input types, ensuring correctness with the new kernel.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully moves the diffusion time embedding to a JIT CUDA kernel, leading to significant performance improvements as shown in the benchmarks. The implementation is clean and the integration with the existing codebase is well-handled, including a fallback mechanism for robustness. I've provided a few suggestions to further enhance performance, improve API clarity, and increase code robustness. Overall, this is a great contribution.

BBuf · 2026-01-11T06:04:01Z

/tag-and-rerun-ci

mickqian · 2026-01-11T10:34:48Z

/rerun-failed-ci

BBuf · 2026-01-11T15:30:17Z

/rerun-failed-ci

BBuf · 2026-01-11T15:31:18Z

/rerun-failed-ci

BBuf · 2026-01-12T00:30:45Z

/rerun-failed-ci

BBuf · 2026-01-12T10:30:19Z

/tag-and-rerun-ci

BBuf · 2026-01-12T11:48:00Z

/rerun-failed-ci

BBuf · 2026-01-12T15:16:40Z

/rerun-failed-ci

BBuf · 2026-01-13T00:55:36Z

/tag-and-rerun-ci

BBuf · 2026-01-14T06:20:19Z

/tag-and-rerun-ci

BBuf · 2026-01-14T09:55:29Z

/rerun-failed-ci

BBuf · 2026-01-14T12:42:33Z

/rerun-failed-ci

BBuf · 2026-01-15T00:48:47Z

/rerun-failed-ci

BBuf · 2026-01-15T01:19:33Z

/tag-and-rerun-ci

Updated version constraints for dependencies.

mickqian · 2026-01-15T13:40:56Z

/rerun-failed-ci

BBuf · 2026-01-16T07:12:55Z

/rerun-failed-ci

BBuf · 2026-01-16T14:25:13Z

/rerun-failed-ci

BBuf · 2026-01-17T04:21:13Z

https://github.com/sgl-project/sglang/actions/runs/21016169961/job/60596188450?pr=16879 All SGL Diffusion-related tests are passing, except for two tests on b200 that failed due to a cutedsl version issue. Since this change safely removes the time_embed kernel and tests from sgl-kernel, the decision is made to merge it.

…project#16879)" This reverts commit 2cdd437.

BBuf added 4 commits January 11, 2026 10:16

ud

a95d8ae

ud

25edc62

ud

1b70f3b

ud

fd3e9c5

BBuf requested review from DarkSharpness, FlamingoPg, HaiShaw, ispobock, merrymercy, mickqian, yhyang201, yizhang2077 and zhyncs as code owners January 11, 2026 03:47

github-actions Bot added quant LLM Quantization sgl-kernel diffusion SGLang Diffusion labels Jan 11, 2026

BBuf changed the title ~~Move diffusion time embedding to jit kernel~~ [Diffusion] Move diffusion time embedding to jit kernel Jan 11, 2026

gemini-code-assist Bot reviewed Jan 11, 2026

View reviewed changes

ud

021eada

66RING approved these changes Jan 11, 2026

View reviewed changes

github-actions Bot added the run-ci label Jan 11, 2026

Merge branch 'main' into move_diffusion_time_embedding_to_jit_kernel

6c274bf

Merge branch 'main' into move_diffusion_time_embedding_to_jit_kernel

398ced1

ud

21e6bd0

BBuf requested a review from Fridge003 as a code owner January 12, 2026 13:10

github-actions Bot added the dependencies Pull requests that update a dependency file label Jan 12, 2026

Merge branch 'main' into move_diffusion_time_embedding_to_jit_kernel

8269294

Merge branch 'main' into move_diffusion_time_embedding_to_jit_kernel

a7096f6

DarkSharpness mentioned this pull request Jan 13, 2026

[Roadmap] JIT kernel development #17035

Closed

21 tasks

BBuf and others added 2 commits January 14, 2026 14:19

merge

d9133b0

Merge branch 'main' into move_diffusion_time_embedding_to_jit_kernel

c0ea1be

lint

e87b402

Merge branch 'main' into move_diffusion_time_embedding_to_jit_kernel

af59c99

ud

0fb6d02

Updated version constraints for dependencies.

BBuf merged commit 2cdd437 into main Jan 17, 2026
252 of 304 checks passed

BBuf deleted the move_diffusion_time_embedding_to_jit_kernel branch January 17, 2026 04:21

michaelzhang-ai added a commit to michaelzhang-ai/sglang that referenced this pull request Jan 17, 2026

Revert "[Diffusion] Move diffusion time embedding to jit kernel (sgl-…

181243a

…project#16879)" This reverts commit 2cdd437.

michaelzhang-ai mentioned this pull request Jan 17, 2026

Revert "[Diffusion] Move diffusion time embedding to jit kernel" #17257

Merged

merrymercy mentioned this pull request Jan 20, 2026

File Limit Request: sglang-kernel - 900 MiB pypi/support#8367

Closed

3 tasks

Conversation

BBuf commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tests

Performance

main

pr

Performance Comparison Report

1. High-level Summary

2. Stage Breakdown

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf commented Jan 11, 2026

Uh oh!

mickqian commented Jan 11, 2026

Uh oh!

BBuf commented Jan 11, 2026

Uh oh!

BBuf commented Jan 11, 2026

Uh oh!

BBuf commented Jan 12, 2026

Uh oh!

BBuf commented Jan 12, 2026

Uh oh!

BBuf commented Jan 12, 2026

Uh oh!

BBuf commented Jan 12, 2026

Uh oh!

BBuf commented Jan 13, 2026

Uh oh!

BBuf commented Jan 14, 2026

Uh oh!

BBuf commented Jan 14, 2026

Uh oh!

BBuf commented Jan 14, 2026

Uh oh!

BBuf commented Jan 15, 2026

Uh oh!

BBuf commented Jan 15, 2026

Uh oh!

mickqian commented Jan 15, 2026

Uh oh!

BBuf commented Jan 16, 2026

Uh oh!

BBuf commented Jan 16, 2026

Uh oh!

BBuf commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BBuf commented Jan 11, 2026 •

edited

Loading