Skip to content

[Diffusion] Move diffusion time embedding to jit kernel#16879

Merged
BBuf merged 15 commits intomainfrom
move_diffusion_time_embedding_to_jit_kernel
Jan 17, 2026
Merged

[Diffusion] Move diffusion time embedding to jit kernel#16879
BBuf merged 15 commits intomainfrom
move_diffusion_time_embedding_to_jit_kernel

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Jan 11, 2026

tests

图片

Performance

main

 sglang generate --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers --prompt "A curious raccoon peers through a vibrant field of yellow sunflowers, its eyes wide with interest." --warmup  --perf-dump-path main.json

pr

sglang generate --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers --prompt "A curious raccoon peers through a vibrant field of yellow sunflowers, its eyes wide with interest." --warmup  --perf-dump-path pr.json  

bbuf python3 /home/lmsys/bbuf/sglang/python/sglang/multimodal_gen/benchmarks/compare_perf.py main.json pr.json

Performance Comparison Report

1. High-level Summary

Metric Baseline New Diff Status
E2E Latency 84927.21 ms 81948.12 ms -2979.09 ms (-3.5%)
Throughput 0.01 req/s 0.01 req/s - -

2. Stage Breakdown

Stage Name Baseline (ms) New (ms) Diff (ms) Diff (%) Status
InputValidationStage 0.04 0.04 -0.01 -14.9% ⚪️
TextEncodingStage 1310.39 1309.37 -1.02 -0.1% ⚪️
ConditioningStage 0.01 0.01 +0.00 +15.5% ⚪️
TimestepPreparationStage 0.34 0.25 -0.08 -24.9% ⚪️
LatentPreparationStage 0.13 0.09 -0.04 -30.7% ⚪️
DenoisingStage 77232.59 77706.47 +473.87 +0.6% ⚪️
DecodingStage 6381.87 2930.39 -3451.48 -54.1% 🟢

For TextEncodingStage, it takes almost the same time. This is also expected, didn't modify any code in cuda kernel.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions Bot added quant LLM Quantization sgl-kernel diffusion SGLang Diffusion labels Jan 11, 2026
@BBuf BBuf changed the title Move diffusion time embedding to jit kernel [Diffusion] Move diffusion time embedding to jit kernel Jan 11, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the timestep_embedding operation by migrating its CUDA implementation to a JIT-compiled kernel. This strategic refactoring enhances the modularity of the codebase and yields notable performance gains, particularly in the DecodingStage of multimodal generation, contributing to a faster overall execution flow.

Highlights

  • JIT Kernel Implementation: The timestep_embedding function has been re-implemented as a JIT-compiled CUDA kernel in python/sglang/jit_kernel/csrc/diffusion/timestep_embedding.cuh, supporting fp16, bf16, and fp32 input types.
  • Python Integration: A new Python module python/sglang/jit_kernel/timestep_embedding.py is introduced to load and expose this JIT kernel, replacing the previous PyTorch extension.
  • Old Implementation Removal: The previous C++/CUDA implementation (sgl-kernel/csrc/sgl_diffusion/elementwise/timestep_embedding.cu) and its associated Python bindings and build configurations have been removed to streamline the codebase.
  • Dynamic Kernel Usage: The Timesteps class in python/sglang/multimodal_gen/runtime/layers/visual_embedding.py now dynamically attempts to use the new JIT kernel, falling back to the default implementation if it's unavailable.
  • Performance Improvement: Benchmarks indicate a significant 54.1% reduction in DecodingStage latency and an overall 3.5% improvement in End-to-End Latency due to this change.
  • Test Suite Updates: Test cases for timestep_embedding have been updated to reflect the new module structure and to support float32 input types, ensuring correctness with the new kernel.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully moves the diffusion time embedding to a JIT CUDA kernel, leading to significant performance improvements as shown in the benchmarks. The implementation is clean and the integration with the existing codebase is well-handled, including a fallback mechanism for robustness. I've provided a few suggestions to further enhance performance, improve API clarity, and increase code robustness. Overall, this is a great contribution.

Comment thread python/sglang/jit_kernel/csrc/diffusion/timestep_embedding.cuh
Comment thread python/sglang/jit_kernel/tests/test_timestep_embedding.py
Comment thread python/sglang/jit_kernel/timestep_embedding.py
Comment thread python/sglang/multimodal_gen/runtime/layers/visual_embedding.py Outdated
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 11, 2026

/tag-and-rerun-ci

@mickqian
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 11, 2026

/rerun-failed-ci

2 similar comments
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 11, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 12, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 12, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 12, 2026

/rerun-failed-ci

@BBuf BBuf requested a review from Fridge003 as a code owner January 12, 2026 13:10
@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Jan 12, 2026
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 12, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 13, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 14, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 14, 2026

/rerun-failed-ci

2 similar comments
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 14, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 15, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 15, 2026

/tag-and-rerun-ci

Updated version constraints for dependencies.
@mickqian
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 16, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 16, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 17, 2026

https://github.com/sgl-project/sglang/actions/runs/21016169961/job/60596188450?pr=16879 All SGL Diffusion-related tests are passing, except for two tests on b200 that failed due to a cutedsl version issue. Since this change safely removes the time_embed kernel and tests from sgl-kernel, the decision is made to merge it.

@BBuf BBuf merged commit 2cdd437 into main Jan 17, 2026
252 of 304 checks passed
@BBuf BBuf deleted the move_diffusion_time_embedding_to_jit_kernel branch January 17, 2026 04:21
michaelzhang-ai added a commit to michaelzhang-ai/sglang that referenced this pull request Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file diffusion SGLang Diffusion quant LLM Quantization run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants