[diffusion][mps] fix vae model offload by yeahdongcn · Pull Request #20607 · sgl-project/sglang

yeahdongcn · 2026-03-15T00:56:39Z

Motivation

When running:

uv run sglang generate --model-path /home/dist/FLUX.1-dev \
    --prompt "A logo With Bold Large text: SGL Diffusion" \
    --save-output --warmup

the --warmup option does not work correctly on mps.

Modifications

Avoid offloading the VAE model after warmup
Fix the VAE offload/reload logic

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

gemini-code-assist · 2026-03-15T00:56:52Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where the VAE model's offloading behavior was incorrect, particularly on MPS devices and during the warmup phase of diffusion pipelines. The changes ensure that the VAE model is properly managed, staying loaded when needed (like during warmup) and correctly offloaded/reloaded to prevent system hangs and optimize memory usage, thereby improving the stability and performance of the diffusion generation process.

Highlights

VAE Offloading Logic: The VAE model will no longer be offloaded immediately after warmup, ensuring it remains resident for subsequent real requests.
MPS Device Compatibility: Improved the VAE offload/reload mechanism specifically for MPS (Apple Silicon) devices, including a synchronization step to prevent hangs.
Dynamic Component Naming: The VAE component name is now dynamically passed and used within the DecodingStage for more flexible model management.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py
- Modified add_standard_decoding_stage to pass the component_name to the DecodingStage constructor.
python/sglang/multimodal_gen/runtime/pipelines_core/stages/decoding.py
- Added component_name parameter to the DecodingStage constructor and stored it as an instance variable.
- Updated load_model to use the component_name for accessing model paths and tracking loaded status, and adjusted the VAELoader.load call to include component_name and transformers_or_diffusers arguments.
- Modified offload_model to include torch.mps.synchronize() for MPS devices before deleting the VAE, and used self.component_name for pipeline module management and model loaded status.
- Introduced a conditional check in the forward method to prevent offloading the VAE model if the current batch is identified as a warmup run.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses an issue with VAE model offloading on MPS devices by adding torch.mps.synchronize() before model deletion, a crucial step to prevent hangs. The changes also introduce a sensible optimization to skip offloading during warmup, and generalize the component handling by replacing a hardcoded "vae" string with a dynamic component_name. The implementation is solid. I have one minor suggestion to enhance maintainability.

…oding.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

mickqian

some things that feel strange for me:

model_loaded stays in server_args
do we really need this lazy load and offload?

yhyang201 · 2026-03-15T02:38:02Z

/tag-and-rerun-ci

yeahdongcn · 2026-03-15T02:42:28Z

model_loaded stays in server_args

Yes, I noticed that as well.

do we really need this lazy load and offload?

FastVideo appears to use the same pattern here: https://github.com/hao-ai-lab/FastVideo/blob/main/fastvideo/pipelines/stages/decoding.py#L242

Edit: However, FastVideo does not have a warmup process. After one forward pass, the model is offloaded.

yhyang201 · 2026-03-17T02:40:17Z

/rerun-failed-ci

yhyang201 · 2026-03-17T03:05:36Z

/rerun-failed-ci

yhyang201 · 2026-03-17T03:26:33Z

/rerun-failed-ci

yhyang201 · 2026-03-18T03:32:24Z

@mickqian All CI (Nvidia + AMD) passed and PR is approved, ready for merge

— SGLDHelper bot

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[diffusion][mps] fix vae model offload

7607eb2

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

github-actions Bot added the diffusion SGLang Diffusion label Mar 15, 2026

yeahdongcn marked this pull request as ready for review March 15, 2026 00:57

yeahdongcn requested review from mickqian, ping1jing2 and yhyang201 as code owners March 15, 2026 00:57

gemini-code-assist Bot reviewed Mar 15, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/pipelines_core/stages/decoding.py Outdated

Update python/sglang/multimodal_gen/runtime/pipelines_core/stages/dec…

f9284c3

…oding.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

mickqian approved these changes Mar 15, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 15, 2026

mickqian merged commit ead9d7a into sgl-project:main Mar 18, 2026
135 of 146 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diffusion][mps] fix vae model offload#20607

[diffusion][mps] fix vae model offload#20607
mickqian merged 2 commits intosgl-project:mainfrom
yeahdongcn:xd/mps_fix

yeahdongcn commented Mar 15, 2026

Uh oh!

gemini-code-assist Bot commented Mar 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mickqian left a comment

Uh oh!

yhyang201 commented Mar 15, 2026

Uh oh!

yeahdongcn commented Mar 15, 2026 •

edited

Loading

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yeahdongcn commented Mar 15, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mickqian left a comment

Choose a reason for hiding this comment

Uh oh!

yhyang201 commented Mar 15, 2026

Uh oh!

yeahdongcn commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yeahdongcn commented Mar 15, 2026 •

edited

Loading