[diffusion] Remove Default post dit offload in local mode by ryang-max · Pull Request #15573 · sgl-project/sglang

ryang-max · 2025-12-21T15:06:59Z

Motivation

Remove the default dit offload imported in #15382 , as this will significantly increase generation time will not much benifit in memory saving.

Modifications

Remove

Accuracy Tests

Benchmarking and Profiling

several seconds generation time optimization in Wan2.1-T2V-1.3B and models with similar size dit.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-21T15:07:11Z

Summary of Changes

Hello @ryang-max, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the execution flow for diffusion models in local mode by removing a previously implemented optimization. This optimization involved offloading transformer weights to the CPU after the denoising phase to conserve VRAM during subsequent VAE decoding. The removal of this step suggests a re-evaluation of its necessity or impact on performance and resource management.

Highlights

Removed Default Post-Denoising Offload: The pull request eliminates the default mechanism that offloaded diffusion model transformer weights to the CPU after the denoising step when running in local mode.
VRAM Optimization Removal: The removed code block was originally intended to reduce peak VRAM usage during VAE decoding by moving transformer and transformer_2 models to CPU and clearing the CUDA cache.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request removes the default behavior of offloading the DiT model to the CPU after the denoising loop when running in local mode. This was a memory optimization to reduce peak VRAM usage for the subsequent VAE decoding stage.

While this change may improve latency by avoiding the model transfer overhead, it will increase peak memory consumption, which could lead to out-of-memory errors for users on systems with limited VRAM.

I've suggested making this behavior optional via a new configuration flag rather than removing it entirely. This would provide flexibility for users to choose between lower latency or lower peak VRAM usage based on their hardware constraints.

Additionally, the pull request description is currently empty. It would be beneficial to add details about the motivation for this change and its expected impact on performance and memory usage.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py (757-769)

Instead of completely removing this functionality, consider making it optional via a new flag in ServerArgs (e.g., post_denoising_offload). This would align with the PR title of removing the default offload, while still providing a valuable VRAM optimization for users in memory-constrained environments.

I've also improved the exception handling for torch.cuda.empty_cache() to log any errors instead of silently passing.

        # In offline local mode (`sglang generate`), offload transformer weights to CPU
        # after denoising to reduce peak VRAM during VAE decoding.
        if getattr(server_args, "post_denoising_offload", False) and current_platform.is_cuda_alike() and server_args.is_local_mode:
            for model in (self.transformer, self.transformer_2):
                if model is not None:
                    model.to("cpu")
            logger.info(
                "Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding."
            )
            try:
                torch.cuda.empty_cache()
            except Exception as e:
                logger.warning("Failed to empty CUDA cache after offloading: %s", e)

References

Avoid catching broad exceptions like Exception and then silently passing. If an exception must be caught, it should be logged to aid in debugging.

Change return type from int to float for get_current_available_memory method.

Update the get_current_available_memory method to provide a more accurate calculation of available GPU memory by considering both the CUDA driver view and the PyTorch allocator view.

This reverts commit 05f3d45.

mickqian · 2025-12-21T16:12:54Z

is it ready for review?

ryang-max · 2025-12-22T01:12:03Z

is it ready for review?

Yes

mickqian · 2025-12-22T09:01:03Z

@ryang-max great, but we definitely want to follow up on this

…n_eagle3_dp * 'main' of https://github.com/sgl-project/sglang: (208 commits) MoE: Skip SiLU/GELU activation for masked experts (sgl-project#15539) [GLM-ASR] GLM-ASR Support (sgl-project#15570) Improve engine customization interface (sgl-project#15635) chore: bump sgl-kernel version to 0.3.20 (sgl-project#15590) bugfix[schedule]: Refactor sort method and add related UT (sgl-project#13576) Adjust wrong `mtp` meaning introduce by mimo (sgl-project#15632) Tiny add back missing router per attempt response metric (sgl-project#15621) Fix router gRPC mode launch error caused by async loading (sgl-project#15368) [model-gateway] return 503 when all workers are circuit-broken (sgl-project#15611) [Diffusion] Support peak memory record in offline generate and serving (sgl-project#15610) [VLM] Tiny: Unify VLM environment variables (sgl-project#15572) [diffusion] chore: remove default post-denoising dit offload in local mode (sgl-project#15573) Tiny enable soft watchdog in CI for stuck without logs (sgl-project#15616) Tiny add stuck simulation (sgl-project#15613) Support soft watchdog for tokenizer/detokenizer/dp-controller processes (sgl-project#15607) Tiny avoid EnvField misuse (sgl-project#15612) add decode round robin policy (sgl-project#15164) Add glm-4.6-fp8 with/without mtp in nightly ci (sgl-project#15566) Adapt fixture-kit to gsm8k mixin (sgl-project#15599) [model-gateway] add retry support to OpenAI router chat endpoint (sgl-project#15589) ...

… mode (sgl-project#15573)

remove default dit offload in local mode

05f3d45

ryang-max requested review from mickqian and yhyang201 as code owners December 21, 2025 15:07

github-actions Bot added the diffusion SGLang Diffusion label Dec 21, 2025

gemini-code-assist Bot reviewed Dec 21, 2025

View reviewed changes

ryang-max and others added 5 commits December 21, 2025 23:42

Implement get_current_available_memory method

4712661

Update return type of get_current_available_memory

38c444e

Change return type from int to float for get_current_available_memory method.

Enhance GPU memory availability calculation

22640a4

Update the get_current_available_memory method to provide a more accurate calculation of available GPU memory by considering both the CUDA driver view and the PyTorch allocator view.

Revert "remove default dit offload in local mode"

23e15cc

This reverts commit 05f3d45.

magic number

2bf82a3

revert memory condition

f5e47f7

Merge branch 'main' into diffusion-release-dit-transformer

cba3040

mickqian merged commit 575a49d into sgl-project:main Dec 22, 2025
60 of 64 checks passed

mickqian mentioned this pull request Dec 24, 2025

[diffusion] fix: offload dit to cpu after denoising based on availabl… #15529

Closed

6 tasks

jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025

[diffusion] chore: remove default post-denoising dit offload in local…

f0d01f9

… mode (sgl-project#15573)

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[diffusion] chore: remove default post-denoising dit offload in local…

8781fbf

… mode (sgl-project#15573)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diffusion] Remove Default post dit offload in local mode#15573

[diffusion] Remove Default post dit offload in local mode#15573
mickqian merged 8 commits intosgl-project:mainfrom
ryang-max:diffusion-release-dit-transformer

ryang-max commented Dec 21, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Dec 21, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mickqian commented Dec 21, 2025

Uh oh!

ryang-max commented Dec 22, 2025

Uh oh!

mickqian commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryang-max commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py (757-769)

Uh oh!

mickqian commented Dec 21, 2025

Uh oh!

ryang-max commented Dec 22, 2025

Uh oh!

mickqian commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryang-max commented Dec 21, 2025 •

edited

Loading