[Diffusion] Support MOVA model#17704
Conversation
update mossVA pipeline Support Open-Veo3 with SGLang Native SP (sgl-project#14) * refactor: deduplicate MoVA code by reusing Attn/LN and enable context parallel * fix: linting and code deduplication in MoVA models feat: support serve in single node fix: update adjust_frames parameter to False for improved multi-GPU compatibility fix: expand negative prompt in MovaSamplingParams for improved image generation quality feat: add additional parameters for MoVA configuration in server_sglang_mova.sh refactor: deduplicate MoVA code by reusing Attn/LN and enable context parallel fix: linting and code deduplication in MoVA models update: expand negative_prompt in MovaSamplingParams for improved content filtering refactor: replace AttentionModule with USPAttention and streamline MoVA pipeline stages refactor: replace USPAttention with LocalAttention for cross-modal efficiency and enhance SP support in MoVA pipeline
…and revert necessary changes (sgl-project#20) * refactor: enhance KL divergence method in DiagonalGaussianDistribution for flexible dimension handling and clean up DAC class by removing unused code * refactor: update DacVAE architecture, configuration and its customized loader. * Revert "fix: update adjust_frames parameter to False for improved multi-GPU compatibility" * revert changes in base pipeline configs * revert changes in configs/sample/__init__.py * [Feature] Remove weight norm in DAC * [Fix] Use legacy weight norm, which can be removed * [Fix] remove weight norm at the right place * [Chore] update test script * Revert "[Fix] remove weight norm at the right place" This reverts commit 3a0accbae41650e926c5828025323a12454827a4. * Revert "[Fix] Use legacy weight norm, which can be removed" This reverts commit eb93f20f134888adba4a5124fa1d167b93d180e7. * Revert "[Feature] Remove weight norm in DAC" This reverts commit aaa64abbc25112a706bf3d3604ffeac390a1d8a8. * [Feature] Remove all weight norm from DAC modeling --------- Co-authored-by: CloudRipple <yiyangzhang25@m.fudan.edu.cn>
* feat: add logging for missing and unexpected keys in Audio VAE loading * feat: refactor MovaPipelineConfig with enhanced image preprocessing - Change task_type from TI2V to T2V to reflect text-to-video focus - Simplify model configs by replacing video_dit_config and video_dit2_config with single dit_config - Add torch.nn.functional import for interpolation operations - Refactor _center_crop_and_resize to support both PIL.Image and torch.Tensor - Add proper tensor format handling, dtype conversion, and channel dimension management - Replace PIL resize with F.interpolate for consistent tensor operations - Update get_latent_shape to use dit_config instead of video_dit_config - Add new Mova360PConfig class for 360p resolution configuration * revert: remove redundant '--pipeline-class-name' argument
* [Feature] Replace custom sequential and mlp with sglang MLP * [Fix] param mapping after using sglang MLP implementation * [Fix] [TODO] a temporary fix for some weird sglang error * [Fix] modeling fix after changing MLP implementation * [Feature] modify audio tower implementation to use sglang-native MLP * [Revert] revert: commit e095f8c4c This reverts commit e095f8c4c. The issue addressed by this commit has already been resolved elsewhere. --------- Co-authored-by: Ruixiao Li <cgruixiao@outlook.com>
- Replace MovaDiTLoader with TransformerLoader and AudioTransformerLoader for video and audio DIT modules. - Simplify initialization of frequency parameters in WanAudioModel and WanModel classes. - Remove unused imports and redundant code in model definitions. - Update test script to reflect changes in model paths and configurations.
* [Fix] [TODO] a temporary fix for some weird sglang error * [Revert] revert: commit e095f8c4c This reverts commit e095f8c4c. The issue addressed by this commit has already been resolved elsewhere. * [Feat] add basic TP functionality * [Feat] add tp parameter in the mova test script * [Chore] unify code style * scripts: update tp test scripts * chore: format code --------- Co-authored-by: cms42 <c@cms42.top>
* feat: mossva bridge & TP support * chore: remove dead code & format code
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>
Summary of ChangesHello @CloudRipple, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the system's capabilities by integrating the MOVA model, enabling the generation of synchronized video and audio content. The changes encompass the addition of new model and VAE configurations, the development of a specialized pipeline to orchestrate the generation process, and enhancements to the runtime to manage and output multimodal results. The implementation also includes advanced features like Sequence Parallelism for improved performance and a flexible component loading mechanism to accommodate MOVA's complex architecture. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for the MOVA model, including its dual-tower architecture for synchronized video and audio generation. The changes are extensive, adding new model configurations, pipeline stages, and runtime implementations for the video DiT, audio DiT, and the interaction bridge. The integration also includes necessary updates to the component loader, schedulers, and data structures to handle audio output.
The overall structure is well-designed, leveraging existing patterns in the codebase while introducing new abstractions like BaseVAELoader and BaseTransformerLoader to handle the new model components. The implementation of sequence parallelism for the DiT models is a key feature for performance.
My review focuses on a few areas for improvement:
- A bug in image preprocessing logic.
- A removed assertion that could impact robustness.
- Code clarity and maintainability, particularly regarding non-English comments and opportunities for refactoring.
- Encapsulation of debugging/profiling code.
The changes are substantial and well-executed. Addressing these points will further enhance the quality and maintainability of this new feature.
…lse and remove related code - It's a config key used by Diffsynth Studio, where True means we use CLIP as image embedding model - We don't, so assert it's False, and remove code path for True
3729f93 to
067b68f
Compare
…struction in dual tower
…Infer optimized version
…ia module_name passing
…tead of a new loader
|
/tag-and-rerun-ci |
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com> Co-authored-by: cms42 <c@cms42.top> Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com> Co-authored-by: Ruixiao Li <cgruixiao@outlook.com> Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com> Co-authored-by: cms42 <c@cms42.top> Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com> Co-authored-by: Ruixiao Li <cgruixiao@outlook.com> Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
| raise RuntimeError("调度器未初始化,请先调用 set_timesteps()") | ||
| self._refresh_pair_cache() | ||
|
|
||
| def set_pair_postprocess_by_name(self, name: str | None, **kwargs): |
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com> Co-authored-by: cms42 <c@cms42.top> Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com> Co-authored-by: Ruixiao Li <cgruixiao@outlook.com> Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com> Co-authored-by: cms42 <c@cms42.top> Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com> Co-authored-by: Ruixiao Li <cgruixiao@outlook.com> Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
Motivation
This PR integrates MOVA (MOSS Video and Audio Synthesis), a foundation model for synchronized video-audio generation. MOVA features an asymmetric dual-tower architecture with a bidirectional cross-attention mechanism for modality fusion, enabling high-quality multilingual lip-sync and precise audio-visual alignment.
Modifications
MOVA Model and VAE Configurations:
MOVAVideoConfigandMOVAAudioConfigclasses for MOVA video and audio DiT models, including architecture details and parameter mappings (mova_video.py,mova_audio.py,__init__.py) [1] [2] [3].DacVAEConfigfor the DAC audio VAE, with encoder/decoder and quantizer settings (dac.py,__init__.py) [1] [2].MOVADualTowerConfigfor dual-tower bridge model support (mova_dual_tower.py,__init__.py) [1] [2].Pipeline and Sampling Configuration:
MOVAPipelineConfigfor the MOVA pipeline, handling preprocessing, latent shape calculation, and normalization/denormalization for video and audio, with 360P and 720P variants (mova.py,__init__.py) [1] [2] [3].MOVASamplingParams,MOVA_360P_SamplingParams, andMOVA_720P_SamplingParamsfor MOVA-specific sampling parameters and supported resolutions (mova.py).Integration and Registry Updates:
registry.py) [1] [2] [3].Runtime Enhancements:
diffusion_generator.py) [1] [2].Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci