[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization support#17996
Conversation
Summary of ChangesHello @OrangeRedeng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces several key enhancements and new features to the SGLang project. It expands hardware support by adding Ascend NPU compatibility and integrating MORI-EP for AMD GPUs. Performance is improved with a new fused add RMSNorm kernel. Initial support for MOVA pipelines is added, enabling text and image to video and audio generation. The PR also includes dependency updates, bug fixes, and code improvements, enhancing the overall stability and functionality of the SGLang framework. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new features, primarily adding support for NPU devices, including w8a8 quantization via a new "modelslim" method. It also adds comprehensive support for the MOVA audio-video generation pipeline, including new model architectures (DiTs, VAEs, Bridge), pipeline stages, and a custom scheduler. Additionally, it integrates the MORI backend for expert parallelism on AMD GPUs. The changes include platform-specific optimizations, major refactoring of the logits processor for better readability, and updates to documentation and dependencies. My review found a couple of leftover debug print statements that should be removed. Overall, this is a substantial and well-structured contribution.
6e5d194 to
c971852
Compare
|
/gemini review |
|
/gemini summary |
There was a problem hiding this comment.
Code Review
This pull request introduces support for NPU devices, focusing on w8a8 quantized diffusion models. The changes are comprehensive, covering platform abstraction, a new modelslim quantization method, NPU-specific kernels with fallbacks, and updates to the model loading and execution pipeline. A utility script for converting quantized models and new NPU-specific tests are also included. The implementation effectively abstracts platform differences. I've identified a minor issue with leftover debugging print statements that should be addressed.
Summary of ChangesThis pull request significantly extends the SGLang framework's hardware compatibility by introducing comprehensive support for Ascend NPU, particularly for the Wan2.2-I2V-A14B-Diffusers model with Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
|
/rerun-failed-ci |
|
/tag-and-rerun-ci |
|
I merged it because bbuf already approved and most of the CIs passed. The analysis results for failed CIs:
please let me know if there are any other issues. |
…pport (sgl-project#17996) Co-authored-by: ronnie_zheng <zl19940307@163.com>
…pport (sgl-project#17996) Co-authored-by: ronnie_zheng <zl19940307@163.com>
…pport (sgl-project#17996) Co-authored-by: ronnie_zheng <zl19940307@163.com>
Motivation
Support w8a8 (and w4a4) quantized Wan2.2-I2V-A14B-Diffusers model by modelslim
Modifications
Accuracy Tests
To run quantized model in sglang, convert msmodelslim w8a8 model using:
wan_repack.py --input-path *quantized_model_path* --output-path *original Wan2.2-I2V-A14B-Diffusers model without transformer and transformer_2 folders"Warning: Sglang does not support quantized embeddings at the moment, let me know if this functionality is needed.
Then copy
config.jsonfrom original transformer/transformer_2 folder to quantized transformer/transformer_2 folder.Original:
Two_anthropomorphic_cats_in_comfy_boxing_gear_and_bright_gloves_fight_intensely_on_a_spotlighted_sta_20260213-181134_a9332f65_fp16.mp4
W8A8:
Two_anthropomorphic_cats_in_comfy_boxing_gear_and_bright_gloves_fight_intensely_on_a_spotlighted_sta_20260213-180423_9f8f8811_w8a8.mp4
Benchmarking and Profiling
Command:
SGLANG_CACHE_DIT_FN=2 SGLANG_CACHE_DIT_BN=1 SGLANG_CACHE_DIT_WARMUP=4 SGLANG_CACHE_DIT_RDT=0.4 SGLANG_CACHE_DIT_MC=4 SGLANG_CACHE_DIT_TAYLORSEER=true SGLANG_CACHE_DIT_TS_ORDER=2 SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path *model_path* --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." --height 720 --width 1280 --tp-size 4 --sp-degree 2 --num-gpus 8 --num-frames 81 --num-inference-steps 40w8a8 gives 7% acceleration compare to FP16 (325.47 vs 350.30)
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci