Skip to content

[Diffusion] Support MOVA model#17704

Merged
mickqian merged 33 commits intosgl-project:mainfrom
CloudRipple:feat/mova
Jan 29, 2026
Merged

[Diffusion] Support MOVA model#17704
mickqian merged 33 commits intosgl-project:mainfrom
CloudRipple:feat/mova

Conversation

@CloudRipple
Copy link
Copy Markdown
Contributor

Motivation

This PR integrates MOVA (MOSS Video and Audio Synthesis), a foundation model for synchronized video-audio generation. MOVA features an asymmetric dual-tower architecture with a bidirectional cross-attention mechanism for modality fusion, enabling high-quality multilingual lip-sync and precise audio-visual alignment.

Modifications

MOVA Model and VAE Configurations:

  • Added MOVAVideoConfig and MOVAAudioConfig classes for MOVA video and audio DiT models, including architecture details and parameter mappings (mova_video.py, mova_audio.py, __init__.py) [1] [2] [3].
  • Introduced DacVAEConfig for the DAC audio VAE, with encoder/decoder and quantizer settings (dac.py, __init__.py) [1] [2].
  • Added MOVADualTowerConfig for dual-tower bridge model support (mova_dual_tower.py, __init__.py) [1] [2].

Pipeline and Sampling Configuration:

  • Implemented MOVAPipelineConfig for the MOVA pipeline, handling preprocessing, latent shape calculation, and normalization/denormalization for video and audio, with 360P and 720P variants (mova.py, __init__.py) [1] [2] [3].
  • Added MOVASamplingParams, MOVA_360P_SamplingParams, and MOVA_720P_SamplingParams for MOVA-specific sampling parameters and supported resolutions (mova.py).

Integration and Registry Updates:

  • Registered MOVA pipeline and sampling configs in the main registry, enabling automatic selection based on model identifiers (registry.py) [1] [2] [3].

Runtime Enhancements:

  • Updated the runtime generator to handle audio output and sample rate in the result objects, supporting MOVA's video+audio outputs (diffusion_generator.py) [1] [2].

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

gaoyang07 and others added 16 commits January 24, 2026 17:04
update mossVA pipeline

Support Open-Veo3 with SGLang Native SP (sgl-project#14)

* refactor: deduplicate MoVA code by reusing Attn/LN and enable context parallel

* fix: linting and code deduplication in MoVA models

feat: support serve in single node

fix: update adjust_frames parameter to False for improved multi-GPU compatibility

fix: expand negative prompt in MovaSamplingParams for improved image generation quality

feat: add additional parameters for MoVA configuration in server_sglang_mova.sh

refactor: deduplicate MoVA code by reusing Attn/LN and enable context parallel

fix: linting and code deduplication in MoVA models

update: expand negative_prompt in MovaSamplingParams for improved content filtering

refactor: replace AttentionModule with USPAttention and streamline MoVA pipeline stages

refactor: replace USPAttention with LocalAttention for cross-modal efficiency and enhance SP support in MoVA pipeline
…and revert necessary changes (sgl-project#20)

* refactor: enhance KL divergence method in DiagonalGaussianDistribution for flexible dimension handling and clean up DAC class by removing unused code

* refactor: update DacVAE architecture, configuration and its customized loader.

* Revert "fix: update adjust_frames parameter to False for improved multi-GPU compatibility"

* revert changes in base pipeline configs

* revert changes in configs/sample/__init__.py

* [Feature] Remove weight norm in DAC

* [Fix] Use legacy weight norm, which can be removed

* [Fix] remove weight norm at the right place

* [Chore] update test script

* Revert "[Fix] remove weight norm at the right place"

This reverts commit 3a0accbae41650e926c5828025323a12454827a4.

* Revert "[Fix] Use legacy weight norm, which can be removed"

This reverts commit eb93f20f134888adba4a5124fa1d167b93d180e7.

* Revert "[Feature] Remove weight norm in DAC"

This reverts commit aaa64abbc25112a706bf3d3604ffeac390a1d8a8.

* [Feature] Remove all weight norm from DAC modeling

---------

Co-authored-by: CloudRipple <yiyangzhang25@m.fudan.edu.cn>
* feat: add logging for missing and unexpected keys in Audio VAE loading

* feat: refactor MovaPipelineConfig with enhanced image preprocessing

- Change task_type from TI2V to T2V to reflect text-to-video focus
- Simplify model configs by replacing video_dit_config and video_dit2_config with single dit_config
- Add torch.nn.functional import for interpolation operations
- Refactor _center_crop_and_resize to support both PIL.Image and torch.Tensor
- Add proper tensor format handling, dtype conversion, and channel dimension management
- Replace PIL resize with F.interpolate for consistent tensor operations
- Update get_latent_shape to use dit_config instead of video_dit_config
- Add new Mova360PConfig class for 360p resolution configuration

* revert: remove redundant '--pipeline-class-name' argument
* [Feature] Replace custom sequential and mlp with sglang MLP

* [Fix] param mapping after using sglang MLP implementation

* [Fix] [TODO] a temporary fix for some weird sglang error

* [Fix] modeling fix after changing MLP implementation

* [Feature] modify audio tower implementation to use sglang-native MLP

* [Revert] revert: commit e095f8c4c
This reverts commit e095f8c4c.
The issue addressed by this commit has already been resolved elsewhere.

---------

Co-authored-by: Ruixiao Li <cgruixiao@outlook.com>
- Replace MovaDiTLoader with TransformerLoader and AudioTransformerLoader for video and audio DIT modules.
- Simplify initialization of frequency parameters in WanAudioModel and WanModel classes.
- Remove unused imports and redundant code in model definitions.
- Update test script to reflect changes in model paths and configurations.
* [Fix] [TODO] a temporary fix for some weird sglang error

* [Revert] revert: commit e095f8c4c
This reverts commit e095f8c4c.
The issue addressed by this commit has already been resolved elsewhere.

* [Feat] add basic TP functionality

* [Feat] add tp parameter in the mova test script

* [Chore] unify code style

* scripts: update tp test scripts

* chore: format code

---------

Co-authored-by: cms42 <c@cms42.top>
* feat: mossva bridge & TP support

* chore: remove dead code & format code
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @CloudRipple, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's capabilities by integrating the MOVA model, enabling the generation of synchronized video and audio content. The changes encompass the addition of new model and VAE configurations, the development of a specialized pipeline to orchestrate the generation process, and enhancements to the runtime to manage and output multimodal results. The implementation also includes advanced features like Sequence Parallelism for improved performance and a flexible component loading mechanism to accommodate MOVA's complex architecture.

Highlights

  • MOVA Model Integration: Introduced comprehensive support for the MOVA (MOSS Video and Audio Synthesis) model, a foundation model for synchronized video-audio generation, including its asymmetric dual-tower architecture and bidirectional cross-attention mechanism.
  • New Configurations: Added dedicated configuration classes for MOVA's video and audio Diffusion Transformer (DiT) models (MOVAVideoConfig, MOVAAudioConfig), the DAC audio VAE (DacVAEConfig), and the dual-tower bridge model (MOVADualTowerConfig).
  • Pipeline and Sampling: Implemented MOVAPipelineConfig to manage preprocessing, latent shape calculation, and normalization/denormalization for video and audio, along with MOVASamplingParams for MOVA-specific sampling, including 360P and 720P variants.
  • Runtime Enhancements: Updated the runtime generator and utility functions to correctly handle and save both video and audio outputs, including muxing video with generated audio using FFmpeg.
  • Modular Component Loading: Refactored component loaders (VAELoader, TransformerLoader) into more granular base classes (BaseVAELoader, BaseTransformerLoader) and specialized loaders (AudioVAELoader, AudioTransformerLoader, BridgeLoader) to support MOVA's distinct video, audio, and bridge components.
  • Sequence Parallelism Support: Integrated Sequence Parallelism (SP) into the MOVA denoising stage, allowing video and audio latents to be sharded across SP ranks for distributed attention and efficient processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for the MOVA model, including its dual-tower architecture for synchronized video and audio generation. The changes are extensive, adding new model configurations, pipeline stages, and runtime implementations for the video DiT, audio DiT, and the interaction bridge. The integration also includes necessary updates to the component loader, schedulers, and data structures to handle audio output.

The overall structure is well-designed, leveraging existing patterns in the codebase while introducing new abstractions like BaseVAELoader and BaseTransformerLoader to handle the new model components. The implementation of sequence parallelism for the DiT models is a key feature for performance.

My review focuses on a few areas for improvement:

  • A bug in image preprocessing logic.
  • A removed assertion that could impact robustness.
  • Code clarity and maintainability, particularly regarding non-English comments and opportunities for refactoring.
  • Encapsulation of debugging/profiling code.

The changes are substantial and well-executed. Addressing these points will further enhance the quality and maintainability of this new feature.

Comment thread python/sglang/multimodal_gen/configs/pipeline_configs/mova.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/loader/component_loader.py
Comment thread python/sglang/multimodal_gen/runtime/models/dits/mova_audio_dit.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/models/dits/mova_audio_dit.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/models/model_stages/mova.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/models/dits/mova_video_dit.py Outdated
@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization dependencies Pull requests that update a dependency file Multi-modal multi-modal language model deepseek model-gateway labels Jan 27, 2026
Comment thread python/sglang/multimodal_gen/configs/models/bridges/mova_dual_tower.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/loader/component_loader.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/loader/component_loader.py
Comment thread python/sglang/multimodal_gen/runtime/models/bridges/mova_dual_tower.py Outdated
Comment thread python/sglang/multimodal_gen/registry.py Outdated
@mickqian
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@mickqian mickqian merged commit 09a9147 into sgl-project:main Jan 29, 2026
170 of 179 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>
Co-authored-by: cms42 <c@cms42.top>
Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com>
Co-authored-by: Ruixiao Li <cgruixiao@outlook.com>
Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>
Co-authored-by: cms42 <c@cms42.top>
Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com>
Co-authored-by: Ruixiao Li <cgruixiao@outlook.com>
Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
raise RuntimeError("调度器未初始化,请先调用 set_timesteps()")
self._refresh_pair_cache()

def set_pair_postprocess_by_name(self, name: str | None, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean code

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>
Co-authored-by: cms42 <c@cms42.top>
Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com>
Co-authored-by: Ruixiao Li <cgruixiao@outlook.com>
Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>
Co-authored-by: cms42 <c@cms42.top>
Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com>
Co-authored-by: Ruixiao Li <cgruixiao@outlook.com>
Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation model-gateway Multi-modal multi-modal language model npu quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants