Skip to content

[diffusion][mps] fix vae model offload#20607

Merged
mickqian merged 2 commits intosgl-project:mainfrom
yeahdongcn:xd/mps_fix
Mar 18, 2026
Merged

[diffusion][mps] fix vae model offload#20607
mickqian merged 2 commits intosgl-project:mainfrom
yeahdongcn:xd/mps_fix

Conversation

@yeahdongcn
Copy link
Copy Markdown
Collaborator

Motivation

When running:

uv run sglang generate --model-path /home/dist/FLUX.1-dev \
    --prompt "A logo With Bold Large text: SGL Diffusion" \
    --save-output --warmup

the --warmup option does not work correctly on mps.

Modifications

  1. Avoid offloading the VAE model after warmup
  2. Fix the VAE offload/reload logic

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
@github-actions github-actions Bot added the diffusion SGLang Diffusion label Mar 15, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where the VAE model's offloading behavior was incorrect, particularly on MPS devices and during the warmup phase of diffusion pipelines. The changes ensure that the VAE model is properly managed, staying loaded when needed (like during warmup) and correctly offloaded/reloaded to prevent system hangs and optimize memory usage, thereby improving the stability and performance of the diffusion generation process.

Highlights

  • VAE Offloading Logic: The VAE model will no longer be offloaded immediately after warmup, ensuring it remains resident for subsequent real requests.
  • MPS Device Compatibility: Improved the VAE offload/reload mechanism specifically for MPS (Apple Silicon) devices, including a synchronization step to prevent hangs.
  • Dynamic Component Naming: The VAE component name is now dynamically passed and used within the DecodingStage for more flexible model management.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py
    • Modified add_standard_decoding_stage to pass the component_name to the DecodingStage constructor.
  • python/sglang/multimodal_gen/runtime/pipelines_core/stages/decoding.py
    • Added component_name parameter to the DecodingStage constructor and stored it as an instance variable.
    • Updated load_model to use the component_name for accessing model paths and tracking loaded status, and adjusted the VAELoader.load call to include component_name and transformers_or_diffusers arguments.
    • Modified offload_model to include torch.mps.synchronize() for MPS devices before deleting the VAE, and used self.component_name for pipeline module management and model loaded status.
    • Introduced a conditional check in the forward method to prevent offloading the VAE model if the current batch is identified as a warmup run.
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yeahdongcn yeahdongcn marked this pull request as ready for review March 15, 2026 00:57
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses an issue with VAE model offloading on MPS devices by adding torch.mps.synchronize() before model deletion, a crucial step to prevent hangs. The changes also introduce a sensible optimization to skip offloading during warmup, and generalize the component handling by replacing a hardcoded "vae" string with a dynamic component_name. The implementation is solid. I have one minor suggestion to enhance maintainability.

Comment thread python/sglang/multimodal_gen/runtime/pipelines_core/stages/decoding.py Outdated
…oding.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@mickqian mickqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some things that feel strange for me:

  1. model_loaded stays in server_args
  2. do we really need this lazy load and offload?

@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

yeahdongcn commented Mar 15, 2026

  1. model_loaded stays in server_args

Yes, I noticed that as well.

  1. do we really need this lazy load and offload?

FastVideo appears to use the same pattern here: https://github.com/hao-ai-lab/FastVideo/blob/main/fastvideo/pipelines/stages/decoding.py#L242

Edit: However, FastVideo does not have a warmup process. After one forward pass, the model is offloaded.

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

@mickqian All CI (Nvidia + AMD) passed and PR is approved, ready for merge

— SGLDHelper bot

@mickqian mickqian merged commit ead9d7a into sgl-project:main Mar 18, 2026
135 of 146 checks passed
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants