Skip to content

[diffusion] Remove Default post dit offload in local mode#15573

Merged
mickqian merged 8 commits intosgl-project:mainfrom
ryang-max:diffusion-release-dit-transformer
Dec 22, 2025
Merged

[diffusion] Remove Default post dit offload in local mode#15573
mickqian merged 8 commits intosgl-project:mainfrom
ryang-max:diffusion-release-dit-transformer

Conversation

@ryang-max
Copy link
Copy Markdown
Contributor

@ryang-max ryang-max commented Dec 21, 2025

Motivation

Remove the default dit offload imported in #15382 , as this will significantly increase generation time will not much benifit in memory saving.

Modifications

Remove

Accuracy Tests

Benchmarking and Profiling

several seconds generation time optimization in Wan2.1-T2V-1.3B and models with similar size dit.

Checklist

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Dec 21, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ryang-max, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the execution flow for diffusion models in local mode by removing a previously implemented optimization. This optimization involved offloading transformer weights to the CPU after the denoising phase to conserve VRAM during subsequent VAE decoding. The removal of this step suggests a re-evaluation of its necessity or impact on performance and resource management.

Highlights

  • Removed Default Post-Denoising Offload: The pull request eliminates the default mechanism that offloaded diffusion model transformer weights to the CPU after the denoising step when running in local mode.
  • VRAM Optimization Removal: The removed code block was originally intended to reduce peak VRAM usage during VAE decoding by moving transformer and transformer_2 models to CPU and clearing the CUDA cache.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the default behavior of offloading the DiT model to the CPU after the denoising loop when running in local mode. This was a memory optimization to reduce peak VRAM usage for the subsequent VAE decoding stage.

While this change may improve latency by avoiding the model transfer overhead, it will increase peak memory consumption, which could lead to out-of-memory errors for users on systems with limited VRAM.

I've suggested making this behavior optional via a new configuration flag rather than removing it entirely. This would provide flexibility for users to choose between lower latency or lower peak VRAM usage based on their hardware constraints.

Additionally, the pull request description is currently empty. It would be beneficial to add details about the motivation for this change and its expected impact on performance and memory usage.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py (757-769)

medium

Instead of completely removing this functionality, consider making it optional via a new flag in ServerArgs (e.g., post_denoising_offload). This would align with the PR title of removing the default offload, while still providing a valuable VRAM optimization for users in memory-constrained environments.

I've also improved the exception handling for torch.cuda.empty_cache() to log any errors instead of silently passing.

        # In offline local mode (`sglang generate`), offload transformer weights to CPU
        # after denoising to reduce peak VRAM during VAE decoding.
        if getattr(server_args, "post_denoising_offload", False) and current_platform.is_cuda_alike() and server_args.is_local_mode:
            for model in (self.transformer, self.transformer_2):
                if model is not None:
                    model.to("cpu")
            logger.info(
                "Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding."
            )
            try:
                torch.cuda.empty_cache()
            except Exception as e:
                logger.warning("Failed to empty CUDA cache after offloading: %s", e)
References
  1. Avoid catching broad exceptions like Exception and then silently passing. If an exception must be caught, it should be logged to aid in debugging.

ryang-max and others added 5 commits December 21, 2025 23:42
Change return type from int to float for get_current_available_memory method.
Update the get_current_available_memory method to provide a more accurate calculation of available GPU memory by considering both the CUDA driver view and the PyTorch allocator view.
@mickqian
Copy link
Copy Markdown
Collaborator

is it ready for review?

@ryang-max
Copy link
Copy Markdown
Contributor Author

is it ready for review?

Yes

@mickqian
Copy link
Copy Markdown
Collaborator

@ryang-max great, but we definitely want to follow up on this

@mickqian mickqian merged commit 575a49d into sgl-project:main Dec 22, 2025
60 of 64 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 23, 2025
…n_eagle3_dp

* 'main' of https://github.com/sgl-project/sglang: (208 commits)
  MoE: Skip SiLU/GELU activation for masked experts (sgl-project#15539)
  [GLM-ASR] GLM-ASR Support  (sgl-project#15570)
  Improve engine customization interface (sgl-project#15635)
  chore: bump sgl-kernel version to 0.3.20 (sgl-project#15590)
  bugfix[schedule]: Refactor sort method and add related UT (sgl-project#13576)
  Adjust wrong `mtp` meaning introduce by mimo (sgl-project#15632)
  Tiny add back missing router per attempt response metric (sgl-project#15621)
  Fix router gRPC mode launch error caused by async loading (sgl-project#15368)
  [model-gateway] return 503 when all workers are circuit-broken (sgl-project#15611)
  [Diffusion] Support peak memory record in offline generate and serving (sgl-project#15610)
  [VLM] Tiny: Unify VLM environment variables (sgl-project#15572)
  [diffusion] chore: remove default post-denoising dit offload in local mode (sgl-project#15573)
  Tiny enable soft watchdog in CI for stuck without logs (sgl-project#15616)
  Tiny add stuck simulation (sgl-project#15613)
  Support soft watchdog for tokenizer/detokenizer/dp-controller processes (sgl-project#15607)
  Tiny avoid EnvField misuse (sgl-project#15612)
  add decode round robin policy (sgl-project#15164)
  Add glm-4.6-fp8 with/without mtp in nightly ci (sgl-project#15566)
  Adapt fixture-kit to gsm8k mixin (sgl-project#15599)
  [model-gateway] add retry support to OpenAI router chat endpoint (sgl-project#15589)
  ...
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants