Skip to content

[Feature] [NPU] Support for Diffusion VAE Parallel Decoding #20764

@gxxx-hum

Description

@gxxx-hum

Motivation

This issue tracks the validation and support of VAE Parallel Decoding for diffusion models on NPU platforms.

Target Models

The following diffusion models will be validated:

  • Wan2.1
  • Wan2.2
  • Qwen-Image
  • Qwen-Image-Edit

Support Status

✅: already supported, PR attached
⏳: not supported yet, with PR raised

Model VAE Parallel Decoding Status Related PR Note
Wan2.1 Verified on NPU #18179 GPU implementation works without modification
Wan2.2 Verified on NPU #18179 GPU implementation works without modification
Qwen-Image Under continuous optimization #20757 not supported yet, with PR raised
Qwen-Image-Edit Under continuous optimization #20974 not supported yet, with PR raised

Validation Results

Benchmark and validation results will be updated here.

Test Environment

  • Hardware: Ascend NPU (910B3)
  • Driver:25.5.0
  • CANN:8.5.0

Wan2.1

Wan2.1-T2V-1.3B-Diffusers was used as a test case.

test script:

curl -sS -X POST "http://localhost:8080/v1/videos" \
-H "Content-Type: application/json" \
-H "Authorization: Empty" \
-d '{
  "prompt": "A calico cat playing a piano on stage",
  "size": "1280x720",
  "num_inference_steps": 50,
  "num_frames": 160
}'

run command:

nohup sglang serve \
--model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \
--num-gpus 2 \
--sp-degree 2 \
--port 8080 \
--host 0.0.0.0 > sglang.log 2>&1 &

Test Results:

Mode decoding time(s) peak memory(GB) peak allocate memory(GB) remaining memory at peak(GB)
use_parallel_decode=true 34.6884 33.67 16.15 27.29
use_parallel_decode=false 48.8579 40.26 24.60 20.70

Conclusions:

  • Decoding latency reduced by ~29%
  • Peak NPU memory reduced by ~16%
  • Peak allocated memory reduced by ~34%
  • Available memory at peak increased by ~31%

Wan2.2

test script:

curl -sS -X POST "http://localhost:8080/v1/videos" \
-H "Content-Type: application/json" \
-H "Authorization: Empty" \
-d '{
  "prompt": "A calico cat playing a piano on stage",
  "size": "256x256"
}'

run command:

nohup sglang serve \
--model-path /nas/disk1/Wan2.2-T2V-A14B-Diffusers \
--port 8080 \
--num-gpus 4 \
--sp-degree 4 \
--host 0.0.0.0 > sglang.log 2>&1 &

Test Results:

Mode decoding time(s) peak memory(GB) peak allocate memory(GB) remaining memory at peak(GB)
use_parallel_decode=true 1.2839 59.31 55.39 1.64
use_parallel_decode=false 1.6009 58.88 55.37 2.08

Conclusions:

  • Decoding latency reduced by ~20%
  • Memory usage shows little improvement

Qwen-Image

decoding time peak gpu memory(GB) peak allocated memory(GB) memory pool overhead(GB) remaining gpu memory at peak(GB)
main 0.2859s 58.39 43.22 15.17 2.57
#20757 0.4124 54.57 42.62 11.95 6.39

Conclusions:

  • Peak NPU memory reduced by ~6.5%
  • Peak allocated memory reduced by ~1.4%
  • Available memory at peak increased by ~148.6%
  • has minimal impact on latency

Qwen-Image-Edit

The main branch still has an issue for Qwen-Image-Edit on NPU when sp is enabled , A PR has already been submitted for the functional fix(#20974).

decoding time peak gpu memory(GB) peak allocated memory(GB) memory pool overhead(GB) remaining gpu memory at peak(GB)
main 0.3905s 58.97 45.30 13.67 1.99
#20974 #20757 0.6066 58.36 45.30 13.06 2.60

Conclusions:

  • No clear improvement yet, but it avoids OOM during 1024 x 1024 image generation.

Planned Optimizations

The following optimizations may be explored after initial support is validated:

  1. Add multi-batch support for the decoding stage to improve throughput and resource utilization. #18764
  2. Introduce a threshold-based strategy to selectively enable parallel VAE decoding, and consider a VAE patch-parallel design to further improve concurrency.#23248

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions