[Diffusion] Support MOVA model by CloudRipple · Pull Request #17704 · sgl-project/sglang

CloudRipple · 2026-01-25T09:50:52Z

Motivation

This PR integrates MOVA (MOSS Video and Audio Synthesis), a foundation model for synchronized video-audio generation. MOVA features an asymmetric dual-tower architecture with a bidirectional cross-attention mechanism for modality fusion, enabling high-quality multilingual lip-sync and precise audio-visual alignment.

Modifications

MOVA Model and VAE Configurations:

Added MOVAVideoConfig and MOVAAudioConfig classes for MOVA video and audio DiT models, including architecture details and parameter mappings (mova_video.py, mova_audio.py, __init__.py) [1] [2] [3].
Introduced DacVAEConfig for the DAC audio VAE, with encoder/decoder and quantizer settings (dac.py, __init__.py) [1] [2].
Added MOVADualTowerConfig for dual-tower bridge model support (mova_dual_tower.py, __init__.py) [1] [2].

Pipeline and Sampling Configuration:

Implemented MOVAPipelineConfig for the MOVA pipeline, handling preprocessing, latent shape calculation, and normalization/denormalization for video and audio, with 360P and 720P variants (mova.py, __init__.py) [1] [2] [3].
Added MOVASamplingParams, MOVA_360P_SamplingParams, and MOVA_720P_SamplingParams for MOVA-specific sampling parameters and supported resolutions (mova.py).

Integration and Registry Updates:

Registered MOVA pipeline and sampling configs in the main registry, enabling automatic selection based on model identifiers (registry.py) [1] [2] [3].

Runtime Enhancements:

Updated the runtime generator to handle audio output and sample rate in the result objects, supporting MOVA's video+audio outputs (diffusion_generator.py) [1] [2].

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

update mossVA pipeline Support Open-Veo3 with SGLang Native SP (sgl-project#14) * refactor: deduplicate MoVA code by reusing Attn/LN and enable context parallel * fix: linting and code deduplication in MoVA models feat: support serve in single node fix: update adjust_frames parameter to False for improved multi-GPU compatibility fix: expand negative prompt in MovaSamplingParams for improved image generation quality feat: add additional parameters for MoVA configuration in server_sglang_mova.sh refactor: deduplicate MoVA code by reusing Attn/LN and enable context parallel fix: linting and code deduplication in MoVA models update: expand negative_prompt in MovaSamplingParams for improved content filtering refactor: replace AttentionModule with USPAttention and streamline MoVA pipeline stages refactor: replace USPAttention with LocalAttention for cross-modal efficiency and enhance SP support in MoVA pipeline

…and revert necessary changes (sgl-project#20) * refactor: enhance KL divergence method in DiagonalGaussianDistribution for flexible dimension handling and clean up DAC class by removing unused code * refactor: update DacVAE architecture, configuration and its customized loader. * Revert "fix: update adjust_frames parameter to False for improved multi-GPU compatibility" * revert changes in base pipeline configs * revert changes in configs/sample/__init__.py * [Feature] Remove weight norm in DAC * [Fix] Use legacy weight norm, which can be removed * [Fix] remove weight norm at the right place * [Chore] update test script * Revert "[Fix] remove weight norm at the right place" This reverts commit 3a0accbae41650e926c5828025323a12454827a4. * Revert "[Fix] Use legacy weight norm, which can be removed" This reverts commit eb93f20f134888adba4a5124fa1d167b93d180e7. * Revert "[Feature] Remove weight norm in DAC" This reverts commit aaa64abbc25112a706bf3d3604ffeac390a1d8a8. * [Feature] Remove all weight norm from DAC modeling --------- Co-authored-by: CloudRipple <yiyangzhang25@m.fudan.edu.cn>

* feat: add logging for missing and unexpected keys in Audio VAE loading * feat: refactor MovaPipelineConfig with enhanced image preprocessing - Change task_type from TI2V to T2V to reflect text-to-video focus - Simplify model configs by replacing video_dit_config and video_dit2_config with single dit_config - Add torch.nn.functional import for interpolation operations - Refactor _center_crop_and_resize to support both PIL.Image and torch.Tensor - Add proper tensor format handling, dtype conversion, and channel dimension management - Replace PIL resize with F.interpolate for consistent tensor operations - Update get_latent_shape to use dit_config instead of video_dit_config - Add new Mova360PConfig class for 360p resolution configuration * revert: remove redundant '--pipeline-class-name' argument

* [Feature] Replace custom sequential and mlp with sglang MLP * [Fix] param mapping after using sglang MLP implementation * [Fix] [TODO] a temporary fix for some weird sglang error * [Fix] modeling fix after changing MLP implementation * [Feature] modify audio tower implementation to use sglang-native MLP * [Revert] revert: commit e095f8c4c This reverts commit e095f8c4c. The issue addressed by this commit has already been resolved elsewhere. --------- Co-authored-by: Ruixiao Li <cgruixiao@outlook.com>

- Replace MovaDiTLoader with TransformerLoader and AudioTransformerLoader for video and audio DIT modules. - Simplify initialization of frequency parameters in WanAudioModel and WanModel classes. - Remove unused imports and redundant code in model definitions. - Update test script to reflect changes in model paths and configurations.

* [Fix] [TODO] a temporary fix for some weird sglang error * [Revert] revert: commit e095f8c4c This reverts commit e095f8c4c. The issue addressed by this commit has already been resolved elsewhere. * [Feat] add basic TP functionality * [Feat] add tp parameter in the mova test script * [Chore] unify code style * scripts: update tp test scripts * chore: format code --------- Co-authored-by: cms42 <c@cms42.top>

* feat: mossva bridge & TP support * chore: remove dead code & format code

Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>

gemini-code-assist · 2026-01-25T09:52:01Z

Summary of Changes

Hello @CloudRipple, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's capabilities by integrating the MOVA model, enabling the generation of synchronized video and audio content. The changes encompass the addition of new model and VAE configurations, the development of a specialized pipeline to orchestrate the generation process, and enhancements to the runtime to manage and output multimodal results. The implementation also includes advanced features like Sequence Parallelism for improved performance and a flexible component loading mechanism to accommodate MOVA's complex architecture.

Highlights

MOVA Model Integration: Introduced comprehensive support for the MOVA (MOSS Video and Audio Synthesis) model, a foundation model for synchronized video-audio generation, including its asymmetric dual-tower architecture and bidirectional cross-attention mechanism.
New Configurations: Added dedicated configuration classes for MOVA's video and audio Diffusion Transformer (DiT) models (MOVAVideoConfig, MOVAAudioConfig), the DAC audio VAE (DacVAEConfig), and the dual-tower bridge model (MOVADualTowerConfig).
Pipeline and Sampling: Implemented MOVAPipelineConfig to manage preprocessing, latent shape calculation, and normalization/denormalization for video and audio, along with MOVASamplingParams for MOVA-specific sampling, including 360P and 720P variants.
Runtime Enhancements: Updated the runtime generator and utility functions to correctly handle and save both video and audio outputs, including muxing video with generated audio using FFmpeg.
Modular Component Loading: Refactored component loaders (VAELoader, TransformerLoader) into more granular base classes (BaseVAELoader, BaseTransformerLoader) and specialized loaders (AudioVAELoader, AudioTransformerLoader, BridgeLoader) to support MOVA's distinct video, audio, and bridge components.
Sequence Parallelism Support: Integrated Sequence Parallelism (SP) into the MOVA denoising stage, allowing video and audio latents to be sharded across SP ranks for distributed attention and efficient processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive support for the MOVA model, including its dual-tower architecture for synchronized video and audio generation. The changes are extensive, adding new model configurations, pipeline stages, and runtime implementations for the video DiT, audio DiT, and the interaction bridge. The integration also includes necessary updates to the component loader, schedulers, and data structures to handle audio output.

The overall structure is well-designed, leveraging existing patterns in the codebase while introducing new abstractions like BaseVAELoader and BaseTransformerLoader to handle the new model components. The implementation of sequence parallelism for the DiT models is a key feature for performance.

My review focuses on a few areas for improvement:

A bug in image preprocessing logic.
A removed assertion that could impact robustness.
Code clarity and maintainability, particularly regarding non-English comments and opportunities for refactoring.
Encapsulation of debugging/profiling code.

The changes are substantial and well-executed. Addressing these points will further enhance the quality and maintainability of this new feature.

…lse and remove related code - It's a config key used by Diffsynth Studio, where True means we use CLIP as image embedding model - We don't, so assert it's False, and remove code path for True

…mova' identifier

…struction in dual tower

…Infer optimized version

…ia module_name passing

…tead of a new loader

mickqian · 2026-01-27T11:35:58Z

/tag-and-rerun-ci

Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com> Co-authored-by: cms42 <c@cms42.top> Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com> Co-authored-by: Ruixiao Li <cgruixiao@outlook.com> Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>

BBuf · 2026-02-01T08:55:09Z

+            raise RuntimeError("调度器未初始化，请先调用 set_timesteps()")
+        self._refresh_pair_cache()
+
+    def set_pair_postprocess_by_name(self, name: str | None, **kwargs):


Please clean code

Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com> Co-authored-by: cms42 <c@cms42.top> Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com> Co-authored-by: Ruixiao Li <cgruixiao@outlook.com> Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>

gaoyang07 and others added 16 commits January 24, 2026 17:04

support Open-Veo3 by OpenMOSS

d82f806

[Chore] update testing script

32dbb5d

fix: Correct output ratio (sgl-project#23)

c6f5510

feat: use native loader for mossva vae (sgl-project#24)

5b9a207

refactor: move MoVA stages from pipeline cores to models

3fe782d

chore: rename _Base** into Base**

49f2db5

feat: mossva bridge & TP support (sgl-project#26)

19f2973

* feat: mossva bridge & TP support * chore: remove dead code & format code

fix: adapt upstream changes

90202f9

feat: add torch compile and nsys profiling support

a030957

Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>

refactor: standardize MOVA class and configuration naming

4cbbcaf

CloudRipple requested review from BBuf, mickqian and yhyang201 as code owners January 25, 2026 09:50

github-actions Bot added npu diffusion SGLang Diffusion labels Jan 25, 2026

gemini-code-assist Bot reviewed Jan 25, 2026

View reviewed changes

Li-dongyang and others added 7 commits January 25, 2026 18:42

fix: image type fix

4590be9

fix: remove dead code(deprecated vae type) & format code

2414e21

refactor: remove temporary profiling code

9201773

refactor: extract shared logic between wan and mova in input_validation

f8caac0

fix: format code

745e706

fix: enhance input verification to ensure reference image is provided

413bf9d

fix: update MOVA model configurations to ensure has_image_input is Fa…

7e20134

…lse and remove related code - It's a config key used by Diffsynth Studio, where True means we use CLIP as image embedding model - We don't, so assert it's False, and remove code path for True

CloudRipple requested review from DarkSharpness, Kangyan-Zhou and ishandhanani as code owners January 27, 2026 05:33

github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization dependencies Pull requests that update a dependency file Multi-modal multi-modal language model deepseek model-gateway labels Jan 27, 2026

CloudRipple force-pushed the feat/mova branch from 3729f93 to 067b68f Compare January 27, 2026 05:43

CloudRipple added 2 commits January 27, 2026 06:45

Merge remote-tracking branch 'upstream/main' into feat/mova

bd49e29

refactor: remove redundant fields

73b80a4

mickqian requested changes Jan 27, 2026

View reviewed changes

CloudRipple added 6 commits January 27, 2026 09:27

refactor: remove unused parameter mappings from MOVADualTowerArchConfig

a1bfeef

refactor: update model detectors for MOVA configurations to include '…

6923e7b

…mova' identifier

fix: correct USPAttention usage and remove redundant RoPE cos/sin con…

3b735af

…struction in dual tower

refactor: replace rotary position embedding implementation with Flash…

5451d18

…Infer optimized version

refactor: restore TransformerLoader and implement multi-DiT loading v…

cf4b76e

…ia module_name passing

refactor: reuse VaeLoader for audio by adding compatibility logic ins…

9577839

…tead of a new loader

github-actions Bot added the run-ci label Jan 27, 2026

yhyang201 approved these changes Jan 28, 2026

View reviewed changes

mickqian approved these changes Jan 29, 2026

View reviewed changes

mickqian merged commit 09a9147 into sgl-project:main Jan 29, 2026
170 of 179 checks passed

BBuf reviewed Feb 1, 2026

View reviewed changes

L4-1024 mentioned this pull request Feb 2, 2026

[NPU][diffusion] model: support WAN/FLUX/Qwen-Image/Qwen-Image-edit on Ascend #13662

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] Support MOVA model#17704

[Diffusion] Support MOVA model#17704
mickqian merged 33 commits intosgl-project:mainfrom
CloudRipple:feat/mova

CloudRipple commented Jan 25, 2026

Uh oh!

gemini-code-assist Bot commented Jan 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian commented Jan 27, 2026

Uh oh!

Uh oh!

BBuf Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

CloudRipple commented Jan 25, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian commented Jan 27, 2026

Uh oh!

Uh oh!

BBuf Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants