Skip to content

[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization support#17996

Merged
sglang-npu-bot merged 119 commits intosgl-project:mainfrom
OrangeRedeng:diffusion_npu_w8a8_support
Mar 7, 2026
Merged

[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization support#17996
sglang-npu-bot merged 119 commits intosgl-project:mainfrom
OrangeRedeng:diffusion_npu_w8a8_support

Conversation

@OrangeRedeng
Copy link
Copy Markdown
Contributor

@OrangeRedeng OrangeRedeng commented Jan 30, 2026

Motivation

Support w8a8 (and w4a4) quantized Wan2.2-I2V-A14B-Diffusers model by modelslim

Modifications

  • Add modelslim support for diffusion models (with full w4a4 dynamic, w8a8 static/dynamic linear support)
  • Modify wanvideo, fix prefix naming
  • Add wan_repack function to convert original wan from msmodelslim to diffusers format
  • Fix get model name for quantized models

Accuracy Tests

To run quantized model in sglang, convert msmodelslim w8a8 model using:

wan_repack.py --input-path *quantized_model_path* --output-path *original Wan2.2-I2V-A14B-Diffusers model without transformer and transformer_2 folders"

Warning: Sglang does not support quantized embeddings at the moment, let me know if this functionality is needed.

Then copy config.json from original transformer/transformer_2 folder to quantized transformer/transformer_2 folder.

Original:

Two_anthropomorphic_cats_in_comfy_boxing_gear_and_bright_gloves_fight_intensely_on_a_spotlighted_sta_20260213-181134_a9332f65_fp16.mp4

W8A8:

Two_anthropomorphic_cats_in_comfy_boxing_gear_and_bright_gloves_fight_intensely_on_a_spotlighted_sta_20260213-180423_9f8f8811_w8a8.mp4

Benchmarking and Profiling

Command:
SGLANG_CACHE_DIT_FN=2 SGLANG_CACHE_DIT_BN=1 SGLANG_CACHE_DIT_WARMUP=4 SGLANG_CACHE_DIT_RDT=0.4 SGLANG_CACHE_DIT_MC=4 SGLANG_CACHE_DIT_TAYLORSEER=true SGLANG_CACHE_DIT_TS_ORDER=2 SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path *model_path* --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." --height 720 --width 1280 --tp-size 4 --sp-degree 2 --num-gpus 8 --num-frames 81 --num-inference-steps 40

w8a8 gives 7% acceleration compare to FP16 (325.47 vs 350.30)

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization amd dependencies Pull requests that update a dependency file Multi-modal multi-modal language model deepseek blackwell SM100/SM120 npu diffusion SGLang Diffusion model-gateway labels Jan 30, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @OrangeRedeng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several key enhancements and new features to the SGLang project. It expands hardware support by adding Ascend NPU compatibility and integrating MORI-EP for AMD GPUs. Performance is improved with a new fused add RMSNorm kernel. Initial support for MOVA pipelines is added, enabling text and image to video and audio generation. The PR also includes dependency updates, bug fixes, and code improvements, enhancing the overall stability and functionality of the SGLang framework.

Highlights

  • NPU Support: Adds support for Ascend NPU, enabling SGLang to run on Huawei's neural processing units.
  • MORI-EP Integration: Integrates MORI-EP, AMD's native all-to-all communication implementation optimized for ROCm, enhancing performance on AMD GPUs.
  • Fused Add RMSNorm Kernel: Introduces a new fused add RMSNorm JIT kernel for improved performance, along with benchmarking and testing.
  • MOVA Model Support: Adds initial support for MOVA (text+image -> video+audio) pipelines, including model configurations and sampling parameters.
  • Bug Fixes and Enhancements: Includes various bug fixes, dependency updates, and code improvements across multiple modules.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new features, primarily adding support for NPU devices, including w8a8 quantization via a new "modelslim" method. It also adds comprehensive support for the MOVA audio-video generation pipeline, including new model architectures (DiTs, VAEs, Bridge), pipeline stages, and a custom scheduler. Additionally, it integrates the MORI backend for expert parallelism on AMD GPUs. The changes include platform-specific optimizations, major refactoring of the logits processor for better readability, and updates to documentation and dependencies. My review found a couple of leftover debug print statements that should be removed. Overall, this is a substantial and well-structured contribution.

Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/loader/fsdp_load.py Outdated
@OrangeRedeng OrangeRedeng force-pushed the diffusion_npu_w8a8_support branch from 6e5d194 to c971852 Compare February 2, 2026 15:03
@OrangeRedeng OrangeRedeng reopened this Feb 2, 2026
@OrangeRedeng OrangeRedeng marked this pull request as ready for review February 2, 2026 15:39
@OrangeRedeng
Copy link
Copy Markdown
Contributor Author

/gemini review

@OrangeRedeng
Copy link
Copy Markdown
Contributor Author

/gemini summary

@OrangeRedeng OrangeRedeng changed the title [Diffusion] [NPU] Diffusion npu w8a8 support [Diffusion] [NPU] Diffusion w8a8 modelslim quantization support Feb 2, 2026
@OrangeRedeng OrangeRedeng changed the title [Diffusion] [NPU] Diffusion w8a8 modelslim quantization support [Diffusion] [NPU] w8a8 Wan2.2-I2V-A14B-Diffusers modelslim quantization support Feb 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for NPU devices, focusing on w8a8 quantized diffusion models. The changes are comprehensive, covering platform abstraction, a new modelslim quantization method, NPU-specific kernels with fallbacks, and updates to the model loading and execution pipeline. A utility script for converting quantized models and new NPU-specific tests are also included. The implementation effectively abstracts platform differences. I've identified a minor issue with leftover debugging print statements that should be addressed.

Comment thread python/sglang/multimodal_gen/runtime/utils/hf_diffusers_utils.py Outdated
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request significantly extends the SGLang framework's hardware compatibility by introducing comprehensive support for Ascend NPU, particularly for the Wan2.2-I2V-A14B-Diffusers model with modelslim quantization. The changes involve deep integration at the platform level, adapting core functionalities like device management, distributed operations, and custom kernels to leverage NPU capabilities. A new utility for model format conversion and dedicated NPU performance tests are also part of this update, ensuring robust and efficient execution on the new hardware.

Highlights

  • NPU Support: Introduced comprehensive support for Ascend NPU, enabling the SGLang framework to run on Huawei's neural processing units. This includes platform-specific adaptations for device handling, distributed communication, and custom operation implementations.
  • ModelSlim Quantization Integration: Integrated modelslim quantization, specifically supporting w8a8 and w4a4 static/dynamic linear quantization, and MoE. This allows for efficient execution of quantized models on NPU hardware.
  • Wan2.2-I2V-A14B-Diffusers Model Support: Added support for the w8a8 quantized Wan2.2-I2V-A14B-Diffusers model, including modifications to the wanvideo module to handle prefix naming and quantization for ColumnParallelLinear and RowParallelLinear layers.
  • Model Conversion Utility: A new utility script, wan_repack.py, was added to convert modelslim quantized Wan models into the Diffusers format, facilitating easier integration and use within the SGLang ecosystem.
  • Platform-Aware Optimizations: Implemented platform-aware changes across various modules, including conditional torch.compile disabling for NPU, NPU-specific implementations for activation functions (e.g., silu_and_mul using npu_swiglu) and layer normalization (npu_add_rms_norm, npu_rms_norm), and updates to Triton operations for NPU compatibility.
  • Distributed Environment Enhancements: Updated the distributed environment initialization to properly configure for NPU, including adding NPU to platforms that don't require device_id and setting the distributed backend to 'hccl' for NPU.
  • NPU Performance Testing: Introduced a new test suite and performance baselines specifically for NPU, allowing for systematic testing and benchmarking of NPU-enabled models and configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/multimodal_gen/runtime/distributed/group_coordinator.py
    • Removed unused import of sglang.multimodal_gen.envs.
    • Updated get_local_torch_device to use current_platform.get_local_torch_device() for platform-agnostic device retrieval.
    • Removed redundant import of current_platform within __init__ and graph_capture methods.
  • python/sglang/multimodal_gen/runtime/distributed/parallel_state.py
    • Extended init_process_group to not pass device_id for NPU, similar to MPS and MUSA.
    • Passed backend=current_platform.get_torch_distributed_backend_str() to init_distributed_environment for platform-specific backend selection.
  • python/sglang/multimodal_gen/runtime/layers/activation.py
    • Implemented conditional import of sgl_kernel.silu_and_mul or torch_npu based on NPU platform detection.
    • Added forward_npu method to SiluAndMul to utilize torch_npu.npu_swiglu for NPU-optimized activation.
  • python/sglang/multimodal_gen/runtime/layers/custom_op.py
    • Added a default forward_npu method that falls back to forward_native for NPU compatibility.
  • python/sglang/multimodal_gen/runtime/layers/layernorm.py
    • Implemented conditional import of sgl_kernel.fused_add_rmsnorm, rmsnorm or torch_npu based on NPU platform detection.
    • Added forward_npu method to RMSNorm to use torch_npu.npu_add_rms_norm and torch_npu.npu_rms_norm for NPU-optimized layer normalization.
    • Disabled torch.compile for forward_native when running on NPU.
  • python/sglang/multimodal_gen/runtime/layers/lora/linear.py
    • Conditionally disabled torch._dynamo.config.recompile_limit for NPU to avoid potential issues.
  • python/sglang/multimodal_gen/runtime/layers/mlp.py
    • Imported QuantizationConfig to support quantization configurations.
    • Added quant_config and prefix parameters to MLP.__init__ and propagated them to ColumnParallelLinear and RowParallelLinear for quantization awareness.
  • python/sglang/multimodal_gen/runtime/layers/quantization/init.py
    • Added "modelslim" to the QuantizationMethods literal type.
    • Registered ModelSlimConfig as the handler for "modelslim" quantization.
  • python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py
    • Added new file modelslim.py to define ModelSlimConfig for NPU-specific quantization.
    • Implemented ModelSlimConfig to handle modelslim quantization schemes, including layer skipping and linear method retrieval.
    • Introduced ModelSlimLinearMethod to process and apply modelslim quantized weights.
  • python/sglang/multimodal_gen/runtime/layers/triton_ops.py
    • Imported current_platform for platform-specific logic.
    • Updated triton_autotune_configs and _layer_norm_fwd_impl to use torch.get_device_module() for device property access and context management.
    • Added NPU-specific native implementations for fuse_scale_shift_kernel and apply_rotary_embedding as a workaround for Triton Ascend bugs.
  • python/sglang/multimodal_gen/runtime/layers/vocab_parallel_embedding.py
    • Disabled torch.compile for get_masked_input_and_mask when on NPU.
  • python/sglang/multimodal_gen/runtime/loader/fsdp_load.py
    • Ensured that the dtype for full_tensor and sharded_tensor is derived from meta_sharded_param.dtype during model loading.
    • Set requires_grad=False for newly created nn.Parameter instances in sharded_sd.
  • python/sglang/multimodal_gen/runtime/managers/gpu_worker.py
    • Replaced torch.cuda.set_device with torch.get_device_module().set_device for platform-agnostic device setting.
    • Added distributed_init_method to maybe_init_distributed_environment_and_model_parallel call.
    • Replaced torch.cuda.reset_peak_memory_stats with torch.get_device_module().reset_peak_memory_stats.
    • Conditionally called set_cuda_arch() only if the current platform is CUDA.
  • python/sglang/multimodal_gen/runtime/models/dits/wanvideo.py
    • Imported QuantizationConfig and add_prefix for quantization and prefix management.
    • Added quant_config and prefix parameters to Attention.__init__, CrossAttention.__init__, DiTBlock.__init__, UlyssesDiTBlock.__init__, and MLP initializations to support quantization.
    • Updated encoder_hidden_states casting to orig_dtype to include NPU.
    • Corrected prefix naming for DiTBlock and UlyssesDiTBlock initialization.
  • python/sglang/multimodal_gen/runtime/models/encoders/clip.py
    • Corrected is_causal logic for scaled_dot_product_attention based on the presence of an attention_mask.
  • python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py
    • Removed assertion that attn_metadata cannot be None, allowing it to be None for the SDPA attention backend.
  • python/sglang/multimodal_gen/runtime/platforms/init.py
    • Added npu_platform_plugin to detect NPU availability and register NPUPlatformBase.
    • Included "npu" in the PLATFORM_PLUGINS dictionary.
    • Prioritized NPU fallback in resolve_current_platform_cls_qualname for platform detection.
  • python/sglang/multimodal_gen/runtime/platforms/cuda.py
    • Imported envs.
    • Added get_local_torch_device to return a CUDA device based on envs.LOCAL_RANK.
  • python/sglang/multimodal_gen/runtime/platforms/interface.py
    • Added NPU to the PlatformEnum.
    • Introduced is_npu() method for platform identification.
    • Added abstract get_local_torch_device method to the Platform interface.
    • Implemented NPU-specific logic for get_device and get_torch_distributed_backend_str (returning "hccl").
  • python/sglang/multimodal_gen/runtime/platforms/mps.py
    • Added get_local_torch_device to return an MPS device.
  • python/sglang/multimodal_gen/runtime/platforms/npu.py
    • Added new file npu.py defining NPUPlatformBase for Ascend NPU.
    • Implemented NPU-specific methods for device control, memory management, and attention backend selection (SDPA).
  • python/sglang/multimodal_gen/runtime/platforms/rocm.py
    • Imported envs.
    • Added get_local_torch_device to return a CUDA device based on envs.LOCAL_RANK for ROCm.
  • python/sglang/multimodal_gen/runtime/utils/hf_diffusers_utils.py
    • Imported Dict and List for type hinting.
    • Imported QuantizationConfig and get_quantization_config for quantization support.
    • Added replace_prefix function for remapping keys in quantization configurations.
    • Introduced find_quant_modelslim_config to detect and load modelslim quantization configurations.
    • Modified get_quant_config to support modelslim and handle various quantization config loading scenarios.
  • python/sglang/multimodal_gen/test/run_suite.py
    • Added suites_ascend dictionary to include NPU-specific test files.
    • Updated the main SUITES dictionary to incorporate suites_ascend.
  • python/sglang/multimodal_gen/test/server/ascend/perf_baselines_npu.json
    • Added new file perf_baselines_npu.json containing NPU-specific performance baselines for the wan2_1_t2v_1.3b_1_npu model.
  • python/sglang/multimodal_gen/test/server/ascend/test_server_1_npu.py
    • Added new file test_server_1_npu.py introducing TestDiffusionServerOneNpu for performance testing on a single NPU.
    • Utilizes ONE_NPU_CASES for test parametrization.
  • python/sglang/multimodal_gen/test/server/ascend/testcase_configs_npu.py
    • Added new file testcase_configs_npu.py defining ONE_NPU_CASES for NPU-specific diffusion test scenarios, including a Text-to-Video case.
  • python/sglang/multimodal_gen/test/server/testcase_configs.py
    • Added an update method to BaselineConfig to allow loading and merging additional baseline configurations, specifically for NPU.
    • Updated BASELINE_CONFIG to load and merge NPU baselines from perf_baselines_npu.json.
  • python/sglang/multimodal_gen/wan_repack.py
    • Added new file wan_repack.py as a utility script to convert modelslim quantized Wan models to the Diffusers format.
    • Includes TRANSFORMER_KEYS_RENAME_DICT for key mapping and functions for loading, converting, and saving model weights and quantization configurations.
  • python/sglang/srt/hardware_backend/npu/quantization/linear_method_npu.py
    • Modified pertoken_scale to be dynamic_scale.flatten() in the apply method for improved quantization handling.
  • test/srt/ep/test_moriep_small.py
    • Added new file test_moriep_small.py containing TestPureDP and TestMTP classes.
    • These classes are designed for testing distributed processing and speculative decoding with the MORI-EP backend, including setup, teardown, and GSM8k evaluation.
Activity
  • A bot provided an initial summary of the changes.
  • The pull request author requested a review.
  • The pull request author requested a summary.
  • A bot provided review comments suggesting the removal of debugging print statements.
  • The pull request author addressed the review comments by removing the print statements.

@ping1jing2 ping1jing2 self-assigned this Feb 2, 2026
Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/layers/activation.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/layers/layernorm.py Outdated
Comment thread python/sglang/multimodal_gen/runtime/platforms/__init__.py
@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

ping1jing2 commented Mar 7, 2026

I merged it because bbuf already approved and most of the CIs passed. The analysis results for failed CIs:

  1. AMD-related CIs: these two issues are known CI errors confirmed by Bingxu Chen
  2. XPU-related CIs: AttributeError: module 'torch.xpu' has no attribute 'graph_pool_handle' this error has nothing to do with this PR.
  3. GPU CI: HttpError in wait-for-stage-b, I also think this error has nothing to do with this PR.

please let me know if there are any other issues.

@sglang-npu-bot sglang-npu-bot merged commit 5297b02 into sgl-project:main Mar 7, 2026
111 of 119 checks passed
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…pport (sgl-project#17996)

Co-authored-by: ronnie_zheng <zl19940307@163.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…pport (sgl-project#17996)

Co-authored-by: ronnie_zheng <zl19940307@163.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…pport (sgl-project#17996)

Co-authored-by: ronnie_zheng <zl19940307@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd blackwell SM100/SM120 deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation model-gateway Multi-modal multi-modal language model npu quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants