[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization support by OrangeRedeng · Pull Request #17996 · sgl-project/sglang

OrangeRedeng · 2026-01-30T13:46:06Z

Motivation

Support w8a8 (and w4a4) quantized Wan2.2-I2V-A14B-Diffusers model by modelslim

Modifications

Add modelslim support for diffusion models (with full w4a4 dynamic, w8a8 static/dynamic linear support)
Modify wanvideo, fix prefix naming
Add wan_repack function to convert original wan from msmodelslim to diffusers format
Fix get model name for quantized models

Accuracy Tests

To run quantized model in sglang, convert msmodelslim w8a8 model using:

wan_repack.py --input-path *quantized_model_path* --output-path *original Wan2.2-I2V-A14B-Diffusers model without transformer and transformer_2 folders"

Warning: Sglang does not support quantized embeddings at the moment, let me know if this functionality is needed.

Then copy config.json from original transformer/transformer_2 folder to quantized transformer/transformer_2 folder.

Original:

Two_anthropomorphic_cats_in_comfy_boxing_gear_and_bright_gloves_fight_intensely_on_a_spotlighted_sta_20260213-181134_a9332f65_fp16.mp4

W8A8:

Two_anthropomorphic_cats_in_comfy_boxing_gear_and_bright_gloves_fight_intensely_on_a_spotlighted_sta_20260213-180423_9f8f8811_w8a8.mp4

Benchmarking and Profiling

Command:
SGLANG_CACHE_DIT_FN=2 SGLANG_CACHE_DIT_BN=1 SGLANG_CACHE_DIT_WARMUP=4 SGLANG_CACHE_DIT_RDT=0.4 SGLANG_CACHE_DIT_MC=4 SGLANG_CACHE_DIT_TAYLORSEER=true SGLANG_CACHE_DIT_TS_ORDER=2 SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path *model_path* --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." --height 720 --width 1280 --tp-size 4 --sp-degree 2 --num-gpus 8 --num-frames 81 --num-inference-steps 40

w8a8 gives 7% acceleration compare to FP16 (325.47 vs 350.30)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-30T14:01:02Z

Summary of Changes

Hello @OrangeRedeng, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several key enhancements and new features to the SGLang project. It expands hardware support by adding Ascend NPU compatibility and integrating MORI-EP for AMD GPUs. Performance is improved with a new fused add RMSNorm kernel. Initial support for MOVA pipelines is added, enabling text and image to video and audio generation. The PR also includes dependency updates, bug fixes, and code improvements, enhancing the overall stability and functionality of the SGLang framework.

Highlights

NPU Support: Adds support for Ascend NPU, enabling SGLang to run on Huawei's neural processing units.
MORI-EP Integration: Integrates MORI-EP, AMD's native all-to-all communication implementation optimized for ROCm, enhancing performance on AMD GPUs.
Fused Add RMSNorm Kernel: Introduces a new fused add RMSNorm JIT kernel for improved performance, along with benchmarking and testing.
MOVA Model Support: Adds initial support for MOVA (text+image -> video+audio) pipelines, including model configurations and sampling parameters.
Bug Fixes and Enhancements: Includes various bug fixes, dependency updates, and code improvements across multiple modules.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant new features, primarily adding support for NPU devices, including w8a8 quantization via a new "modelslim" method. It also adds comprehensive support for the MOVA audio-video generation pipeline, including new model architectures (DiTs, VAEs, Bridge), pipeline stages, and a custom scheduler. Additionally, it integrates the MORI backend for expert parallelism on AMD GPUs. The changes include platform-specific optimizations, major refactoring of the logits processor for better readability, and updates to documentation and dependencies. My review found a couple of leftover debug print statements that should be removed. Overall, this is a substantial and well-structured contribution.

OrangeRedeng · 2026-02-02T15:44:24Z

/gemini review

OrangeRedeng · 2026-02-02T15:44:45Z

/gemini summary

gemini-code-assist

Code Review

This pull request introduces support for NPU devices, focusing on w8a8 quantized diffusion models. The changes are comprehensive, covering platform abstraction, a new modelslim quantization method, NPU-specific kernels with fallbacks, and updates to the model loading and execution pipeline. A utility script for converting quantized models and new NPU-specific tests are also included. The implementation effectively abstracts platform differences. I've identified a minor issue with leftover debugging print statements that should be addressed.

gemini-code-assist · 2026-02-02T15:49:30Z

Summary of Changes

This pull request significantly extends the SGLang framework's hardware compatibility by introducing comprehensive support for Ascend NPU, particularly for the Wan2.2-I2V-A14B-Diffusers model with modelslim quantization. The changes involve deep integration at the platform level, adapting core functionalities like device management, distributed operations, and custom kernels to leverage NPU capabilities. A new utility for model format conversion and dedicated NPU performance tests are also part of this update, ensuring robust and efficient execution on the new hardware.

Highlights

NPU Support: Introduced comprehensive support for Ascend NPU, enabling the SGLang framework to run on Huawei's neural processing units. This includes platform-specific adaptations for device handling, distributed communication, and custom operation implementations.
ModelSlim Quantization Integration: Integrated modelslim quantization, specifically supporting w8a8 and w4a4 static/dynamic linear quantization, and MoE. This allows for efficient execution of quantized models on NPU hardware.
Wan2.2-I2V-A14B-Diffusers Model Support: Added support for the w8a8 quantized Wan2.2-I2V-A14B-Diffusers model, including modifications to the wanvideo module to handle prefix naming and quantization for ColumnParallelLinear and RowParallelLinear layers.
Model Conversion Utility: A new utility script, wan_repack.py, was added to convert modelslim quantized Wan models into the Diffusers format, facilitating easier integration and use within the SGLang ecosystem.
Platform-Aware Optimizations: Implemented platform-aware changes across various modules, including conditional torch.compile disabling for NPU, NPU-specific implementations for activation functions (e.g., silu_and_mul using npu_swiglu) and layer normalization (npu_add_rms_norm, npu_rms_norm), and updates to Triton operations for NPU compatibility.
Distributed Environment Enhancements: Updated the distributed environment initialization to properly configure for NPU, including adding NPU to platforms that don't require device_id and setting the distributed backend to 'hccl' for NPU.
NPU Performance Testing: Introduced a new test suite and performance baselines specifically for NPU, allowing for systematic testing and benchmarking of NPU-enabled models and configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/multimodal_gen/runtime/distributed/group_coordinator.py
- Removed unused import of sglang.multimodal_gen.envs.
- Updated get_local_torch_device to use current_platform.get_local_torch_device() for platform-agnostic device retrieval.
- Removed redundant import of current_platform within __init__ and graph_capture methods.
python/sglang/multimodal_gen/runtime/distributed/parallel_state.py
- Extended init_process_group to not pass device_id for NPU, similar to MPS and MUSA.
- Passed backend=current_platform.get_torch_distributed_backend_str() to init_distributed_environment for platform-specific backend selection.
python/sglang/multimodal_gen/runtime/layers/activation.py
- Implemented conditional import of sgl_kernel.silu_and_mul or torch_npu based on NPU platform detection.
- Added forward_npu method to SiluAndMul to utilize torch_npu.npu_swiglu for NPU-optimized activation.
python/sglang/multimodal_gen/runtime/layers/custom_op.py
- Added a default forward_npu method that falls back to forward_native for NPU compatibility.
python/sglang/multimodal_gen/runtime/layers/layernorm.py
- Implemented conditional import of sgl_kernel.fused_add_rmsnorm, rmsnorm or torch_npu based on NPU platform detection.
- Added forward_npu method to RMSNorm to use torch_npu.npu_add_rms_norm and torch_npu.npu_rms_norm for NPU-optimized layer normalization.
- Disabled torch.compile for forward_native when running on NPU.
python/sglang/multimodal_gen/runtime/layers/lora/linear.py
- Conditionally disabled torch._dynamo.config.recompile_limit for NPU to avoid potential issues.
python/sglang/multimodal_gen/runtime/layers/mlp.py
- Imported QuantizationConfig to support quantization configurations.
- Added quant_config and prefix parameters to MLP.__init__ and propagated them to ColumnParallelLinear and RowParallelLinear for quantization awareness.
python/sglang/multimodal_gen/runtime/layers/quantization/init.py
- Added "modelslim" to the QuantizationMethods literal type.
- Registered ModelSlimConfig as the handler for "modelslim" quantization.
python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py
- Added new file modelslim.py to define ModelSlimConfig for NPU-specific quantization.
- Implemented ModelSlimConfig to handle modelslim quantization schemes, including layer skipping and linear method retrieval.
- Introduced ModelSlimLinearMethod to process and apply modelslim quantized weights.
python/sglang/multimodal_gen/runtime/layers/triton_ops.py
- Imported current_platform for platform-specific logic.
- Updated triton_autotune_configs and _layer_norm_fwd_impl to use torch.get_device_module() for device property access and context management.
- Added NPU-specific native implementations for fuse_scale_shift_kernel and apply_rotary_embedding as a workaround for Triton Ascend bugs.
python/sglang/multimodal_gen/runtime/layers/vocab_parallel_embedding.py
- Disabled torch.compile for get_masked_input_and_mask when on NPU.
python/sglang/multimodal_gen/runtime/loader/fsdp_load.py
- Ensured that the dtype for full_tensor and sharded_tensor is derived from meta_sharded_param.dtype during model loading.
- Set requires_grad=False for newly created nn.Parameter instances in sharded_sd.
python/sglang/multimodal_gen/runtime/managers/gpu_worker.py
- Replaced torch.cuda.set_device with torch.get_device_module().set_device for platform-agnostic device setting.
- Added distributed_init_method to maybe_init_distributed_environment_and_model_parallel call.
- Replaced torch.cuda.reset_peak_memory_stats with torch.get_device_module().reset_peak_memory_stats.
- Conditionally called set_cuda_arch() only if the current platform is CUDA.
python/sglang/multimodal_gen/runtime/models/dits/wanvideo.py
- Imported QuantizationConfig and add_prefix for quantization and prefix management.
- Added quant_config and prefix parameters to Attention.__init__, CrossAttention.__init__, DiTBlock.__init__, UlyssesDiTBlock.__init__, and MLP initializations to support quantization.
- Updated encoder_hidden_states casting to orig_dtype to include NPU.
- Corrected prefix naming for DiTBlock and UlyssesDiTBlock initialization.
python/sglang/multimodal_gen/runtime/models/encoders/clip.py
- Corrected is_causal logic for scaled_dot_product_attention based on the presence of an attention_mask.
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py
- Removed assertion that attn_metadata cannot be None, allowing it to be None for the SDPA attention backend.
python/sglang/multimodal_gen/runtime/platforms/init.py
- Added npu_platform_plugin to detect NPU availability and register NPUPlatformBase.
- Included "npu" in the PLATFORM_PLUGINS dictionary.
- Prioritized NPU fallback in resolve_current_platform_cls_qualname for platform detection.
python/sglang/multimodal_gen/runtime/platforms/cuda.py
- Imported envs.
- Added get_local_torch_device to return a CUDA device based on envs.LOCAL_RANK.
python/sglang/multimodal_gen/runtime/platforms/interface.py
- Added NPU to the PlatformEnum.
- Introduced is_npu() method for platform identification.
- Added abstract get_local_torch_device method to the Platform interface.
- Implemented NPU-specific logic for get_device and get_torch_distributed_backend_str (returning "hccl").
python/sglang/multimodal_gen/runtime/platforms/mps.py
- Added get_local_torch_device to return an MPS device.
python/sglang/multimodal_gen/runtime/platforms/npu.py
- Added new file npu.py defining NPUPlatformBase for Ascend NPU.
- Implemented NPU-specific methods for device control, memory management, and attention backend selection (SDPA).
python/sglang/multimodal_gen/runtime/platforms/rocm.py
- Imported envs.
- Added get_local_torch_device to return a CUDA device based on envs.LOCAL_RANK for ROCm.
python/sglang/multimodal_gen/runtime/utils/hf_diffusers_utils.py
- Imported Dict and List for type hinting.
- Imported QuantizationConfig and get_quantization_config for quantization support.
- Added replace_prefix function for remapping keys in quantization configurations.
- Introduced find_quant_modelslim_config to detect and load modelslim quantization configurations.
- Modified get_quant_config to support modelslim and handle various quantization config loading scenarios.
python/sglang/multimodal_gen/test/run_suite.py
- Added suites_ascend dictionary to include NPU-specific test files.
- Updated the main SUITES dictionary to incorporate suites_ascend.
python/sglang/multimodal_gen/test/server/ascend/perf_baselines_npu.json
- Added new file perf_baselines_npu.json containing NPU-specific performance baselines for the wan2_1_t2v_1.3b_1_npu model.
python/sglang/multimodal_gen/test/server/ascend/test_server_1_npu.py
- Added new file test_server_1_npu.py introducing TestDiffusionServerOneNpu for performance testing on a single NPU.
- Utilizes ONE_NPU_CASES for test parametrization.
python/sglang/multimodal_gen/test/server/ascend/testcase_configs_npu.py
- Added new file testcase_configs_npu.py defining ONE_NPU_CASES for NPU-specific diffusion test scenarios, including a Text-to-Video case.
python/sglang/multimodal_gen/test/server/testcase_configs.py
- Added an update method to BaselineConfig to allow loading and merging additional baseline configurations, specifically for NPU.
- Updated BASELINE_CONFIG to load and merge NPU baselines from perf_baselines_npu.json.
python/sglang/multimodal_gen/wan_repack.py
- Added new file wan_repack.py as a utility script to convert modelslim quantized Wan models to the Diffusers format.
- Includes TRANSFORMER_KEYS_RENAME_DICT for key mapping and functions for loading, converting, and saving model weights and quantization configurations.
python/sglang/srt/hardware_backend/npu/quantization/linear_method_npu.py
- Modified pertoken_scale to be dynamic_scale.flatten() in the apply method for improved quantization handling.
test/srt/ep/test_moriep_small.py
- Added new file test_moriep_small.py containing TestPureDP and TestMTP classes.
- These classes are designed for testing distributed processing and speculative decoding with the MORI-EP backend, including setup, teardown, and GSM8k evaluation.

Activity

A bot provided an initial summary of the changes.
The pull request author requested a review.
The pull request author requested a summary.
A bot provided review comments suggesting the removal of debugging print statements.
The pull request author addressed the review comments by removing the print statements.

ping1jing2 · 2026-03-05T07:58:42Z

/rerun-failed-ci

ping1jing2 · 2026-03-05T11:59:09Z

/tag-and-rerun-ci

ping1jing2 · 2026-03-07T14:25:32Z

I merged it because bbuf already approved and most of the CIs passed. The analysis results for failed CIs:

AMD-related CIs: these two issues are known CI errors confirmed by Bingxu Chen
XPU-related CIs: AttributeError: module 'torch.xpu' has no attribute 'graph_pool_handle' this error has nothing to do with this PR.
GPU CI: HttpError in wait-for-stage-b, I also think this error has nothing to do with this PR.

please let me know if there are any other issues.

…pport (sgl-project#17996) Co-authored-by: ronnie_zheng <zl19940307@163.com>

github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization amd dependencies Pull requests that update a dependency file Multi-modal multi-modal language model deepseek blackwell SM100/SM120 npu diffusion SGLang Diffusion model-gateway labels Jan 30, 2026

gemini-code-assist Bot reviewed Jan 30, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/layers/quantization/modelslim.py Outdated

Comment thread python/sglang/multimodal_gen/runtime/loader/fsdp_load.py Outdated

OrangeRedeng closed this Feb 2, 2026

OrangeRedeng force-pushed the diffusion_npu_w8a8_support branch from 6e5d194 to c971852 Compare February 2, 2026 15:03

OrangeRedeng reopened this Feb 2, 2026

OrangeRedeng marked this pull request as ready for review February 2, 2026 15:39

OrangeRedeng requested review from BBuf, iforgetmyname, mickqian and yhyang201 as code owners February 2, 2026 15:39

OrangeRedeng changed the title ~~[Diffusion] [NPU] Diffusion npu w8a8 support~~ [Diffusion] [NPU] Diffusion w8a8 modelslim quantization support Feb 2, 2026

OrangeRedeng changed the title ~~[Diffusion] [NPU] Diffusion w8a8 modelslim quantization support~~ [Diffusion] [NPU] w8a8 Wan2.2-I2V-A14B-Diffusers modelslim quantization support Feb 2, 2026

gemini-code-assist Bot reviewed Feb 2, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/utils/hf_diffusers_utils.py Outdated

ping1jing2 self-assigned this Feb 2, 2026

Makcum888e mentioned this pull request Feb 3, 2026

[Roadmap] [NPU] Sglang Diffusion on Ascend #18177

Closed

18 tasks

ping1jing2 reviewed Feb 4, 2026

View reviewed changes

OrangeRedeng requested a review from merrymercy as a code owner February 4, 2026 09:29

OrangeRedeng and others added 15 commits March 4, 2026 11:30

Update testcase_configs_npu.py

3647926

Update pr-test-npu.yml

fc062a9

Create test_server_8_npu.py

49b3198

Update run_suite.py

e413e45

Update test_server_8_npu.py

2be9834

Fix lint issue

6708f2a

Merge branch 'main' into diffusion_npu_w8a8_support

4fb5dd3

Update pr-test-npu.yml

fa05fed

Update utils.py

a38a92a

Fix lint issue

2a891ca

Update perf_baselines_npu.json

523d057

Merge branch 'main' into diffusion_npu_w8a8_support

8000349

Update perf_baselines_npu.json

c92185c

Update perf_baselines_npu.json

4bffe52

Merge branch 'main' into diffusion_npu_w8a8_support

1f51d0c

Update turbo_layer.py

50a2b91

ping1jing2 and others added 4 commits March 5, 2026 20:00

Merge branch 'main' into diffusion_npu_w8a8_support

f7c076c

Fix get model name for quantized models

881925b

Update perf_baselines_npu.json

0378cfc

Merge branch 'main' into diffusion_npu_w8a8_support

3bb2a15

ping1jing2 approved these changes Mar 7, 2026

View reviewed changes

sglang-npu-bot merged commit 5297b02 into sgl-project:main Mar 7, 2026
111 of 119 checks passed

This was referenced Mar 13, 2026

[Bug] [NPU] Diffusion models performance decreased after 17996 merged #20533

Closed

[Bug] [NPU] NZ conversion doesn't work on CPU #20682

Closed

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization su…

6a52843

…pport (sgl-project#17996) Co-authored-by: ronnie_zheng <zl19940307@163.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization su…

ffa5dff

…pport (sgl-project#17996) Co-authored-by: ronnie_zheng <zl19940307@163.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization su…

40f5232

…pport (sgl-project#17996) Co-authored-by: ronnie_zheng <zl19940307@163.com>

Conversation

OrangeRedeng commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

OrangeRedeng commented Feb 2, 2026

Uh oh!

OrangeRedeng commented Feb 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot commented Feb 2, 2026

Summary of Changes

Highlights

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented Mar 5, 2026

Uh oh!

ping1jing2 commented Mar 5, 2026

Uh oh!

ping1jing2 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OrangeRedeng commented Jan 30, 2026 •

edited

Loading

ping1jing2 commented Mar 7, 2026 •

edited

Loading