[diffusion] Remove Default post dit offload in local mode#15573
[diffusion] Remove Default post dit offload in local mode#15573mickqian merged 8 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @ryang-max, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request streamlines the execution flow for diffusion models in local mode by removing a previously implemented optimization. This optimization involved offloading transformer weights to the CPU after the denoising phase to conserve VRAM during subsequent VAE decoding. The removal of this step suggests a re-evaluation of its necessity or impact on performance and resource management. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request removes the default behavior of offloading the DiT model to the CPU after the denoising loop when running in local mode. This was a memory optimization to reduce peak VRAM usage for the subsequent VAE decoding stage.
While this change may improve latency by avoiding the model transfer overhead, it will increase peak memory consumption, which could lead to out-of-memory errors for users on systems with limited VRAM.
I've suggested making this behavior optional via a new configuration flag rather than removing it entirely. This would provide flexibility for users to choose between lower latency or lower peak VRAM usage based on their hardware constraints.
Additionally, the pull request description is currently empty. It would be beneficial to add details about the motivation for this change and its expected impact on performance and memory usage.
I am having trouble creating individual review comments. Click here to see my feedback.
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py (757-769)
Instead of completely removing this functionality, consider making it optional via a new flag in ServerArgs (e.g., post_denoising_offload). This would align with the PR title of removing the default offload, while still providing a valuable VRAM optimization for users in memory-constrained environments.
I've also improved the exception handling for torch.cuda.empty_cache() to log any errors instead of silently passing.
# In offline local mode (`sglang generate`), offload transformer weights to CPU
# after denoising to reduce peak VRAM during VAE decoding.
if getattr(server_args, "post_denoising_offload", False) and current_platform.is_cuda_alike() and server_args.is_local_mode:
for model in (self.transformer, self.transformer_2):
if model is not None:
model.to("cpu")
logger.info(
"Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding."
)
try:
torch.cuda.empty_cache()
except Exception as e:
logger.warning("Failed to empty CUDA cache after offloading: %s", e)
References
- Avoid catching broad exceptions like
Exceptionand then silently passing. If an exception must be caught, it should be logged to aid in debugging.
Change return type from int to float for get_current_available_memory method.
Update the get_current_available_memory method to provide a more accurate calculation of available GPU memory by considering both the CUDA driver view and the PyTorch allocator view.
This reverts commit 05f3d45.
|
is it ready for review? |
Yes |
|
@ryang-max great, but we definitely want to follow up on this |
…n_eagle3_dp * 'main' of https://github.com/sgl-project/sglang: (208 commits) MoE: Skip SiLU/GELU activation for masked experts (sgl-project#15539) [GLM-ASR] GLM-ASR Support (sgl-project#15570) Improve engine customization interface (sgl-project#15635) chore: bump sgl-kernel version to 0.3.20 (sgl-project#15590) bugfix[schedule]: Refactor sort method and add related UT (sgl-project#13576) Adjust wrong `mtp` meaning introduce by mimo (sgl-project#15632) Tiny add back missing router per attempt response metric (sgl-project#15621) Fix router gRPC mode launch error caused by async loading (sgl-project#15368) [model-gateway] return 503 when all workers are circuit-broken (sgl-project#15611) [Diffusion] Support peak memory record in offline generate and serving (sgl-project#15610) [VLM] Tiny: Unify VLM environment variables (sgl-project#15572) [diffusion] chore: remove default post-denoising dit offload in local mode (sgl-project#15573) Tiny enable soft watchdog in CI for stuck without logs (sgl-project#15616) Tiny add stuck simulation (sgl-project#15613) Support soft watchdog for tokenizer/detokenizer/dp-controller processes (sgl-project#15607) Tiny avoid EnvField misuse (sgl-project#15612) add decode round robin policy (sgl-project#15164) Add glm-4.6-fp8 with/without mtp in nightly ci (sgl-project#15566) Adapt fixture-kit to gsm8k mixin (sgl-project#15599) [model-gateway] add retry support to OpenAI router chat endpoint (sgl-project#15589) ...
Motivation
Remove the default dit offload imported in #15382 , as this will significantly increase generation time will not much benifit in memory saving.
Modifications
Remove
Accuracy Tests
Benchmarking and Profiling
several seconds generation time optimization in
Wan2.1-T2V-1.3Band models with similar size dit.Checklist