Motivation
This issue tracks the validation and support of VAE Parallel Decoding for diffusion models on NPU platforms.
Target Models
The following diffusion models will be validated:
- Wan2.1
- Wan2.2
- Qwen-Image
- Qwen-Image-Edit
Support Status
✅: already supported, PR attached
⏳: not supported yet, with PR raised
| Model |
VAE Parallel Decoding |
Status |
Related PR |
Note |
| Wan2.1 |
✅ |
Verified on NPU |
#18179 |
GPU implementation works without modification |
| Wan2.2 |
✅ |
Verified on NPU |
#18179 |
GPU implementation works without modification |
| Qwen-Image |
✅ |
Under continuous optimization |
#20757 |
not supported yet, with PR raised |
| Qwen-Image-Edit |
✅ |
Under continuous optimization |
#20974 |
not supported yet, with PR raised |
Validation Results
Benchmark and validation results will be updated here.
Test Environment
- Hardware: Ascend NPU (910B3)
- Driver:25.5.0
- CANN:8.5.0
Wan2.1
Wan2.1-T2V-1.3B-Diffusers was used as a test case.
test script:
curl -sS -X POST "http://localhost:8080/v1/videos" \
-H "Content-Type: application/json" \
-H "Authorization: Empty" \
-d '{
"prompt": "A calico cat playing a piano on stage",
"size": "1280x720",
"num_inference_steps": 50,
"num_frames": 160
}'
run command:
nohup sglang serve \
--model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \
--num-gpus 2 \
--sp-degree 2 \
--port 8080 \
--host 0.0.0.0 > sglang.log 2>&1 &
Test Results:
| Mode |
decoding time(s) |
peak memory(GB) |
peak allocate memory(GB) |
remaining memory at peak(GB) |
| use_parallel_decode=true |
34.6884 |
33.67 |
16.15 |
27.29 |
| use_parallel_decode=false |
48.8579 |
40.26 |
24.60 |
20.70 |
Conclusions:
- Decoding latency reduced by ~29%
- Peak NPU memory reduced by ~16%
- Peak allocated memory reduced by ~34%
- Available memory at peak increased by ~31%
Wan2.2
test script:
curl -sS -X POST "http://localhost:8080/v1/videos" \
-H "Content-Type: application/json" \
-H "Authorization: Empty" \
-d '{
"prompt": "A calico cat playing a piano on stage",
"size": "256x256"
}'
run command:
nohup sglang serve \
--model-path /nas/disk1/Wan2.2-T2V-A14B-Diffusers \
--port 8080 \
--num-gpus 4 \
--sp-degree 4 \
--host 0.0.0.0 > sglang.log 2>&1 &
Test Results:
| Mode |
decoding time(s) |
peak memory(GB) |
peak allocate memory(GB) |
remaining memory at peak(GB) |
| use_parallel_decode=true |
1.2839 |
59.31 |
55.39 |
1.64 |
| use_parallel_decode=false |
1.6009 |
58.88 |
55.37 |
2.08 |
Conclusions:
- Decoding latency reduced by ~20%
- Memory usage shows little improvement
Qwen-Image
|
decoding time |
peak gpu memory(GB) |
peak allocated memory(GB) |
memory pool overhead(GB) |
remaining gpu memory at peak(GB) |
| main |
0.2859s |
58.39 |
43.22 |
15.17 |
2.57 |
| #20757 |
0.4124 |
54.57 |
42.62 |
11.95 |
6.39 |
Conclusions:
- Peak NPU memory reduced by ~6.5%
- Peak allocated memory reduced by ~1.4%
- Available memory at peak increased by ~148.6%
- has minimal impact on latency
Qwen-Image-Edit
The main branch still has an issue for Qwen-Image-Edit on NPU when sp is enabled , A PR has already been submitted for the functional fix(#20974).
|
decoding time |
peak gpu memory(GB) |
peak allocated memory(GB) |
memory pool overhead(GB) |
remaining gpu memory at peak(GB) |
| main |
0.3905s |
58.97 |
45.30 |
13.67 |
1.99 |
| #20974 #20757 |
0.6066 |
58.36 |
45.30 |
13.06 |
2.60 |
Conclusions:
- No clear improvement yet, but it avoids OOM during 1024 x 1024 image generation.
Planned Optimizations
The following optimizations may be explored after initial support is validated:
- Add multi-batch support for the decoding stage to improve throughput and resource utilization. #18764
- Introduce a threshold-based strategy to selectively enable parallel VAE decoding, and consider a VAE patch-parallel design to further improve concurrency.#23248
Motivation
This issue tracks the validation and support of VAE Parallel Decoding for diffusion models on NPU platforms.
Target Models
The following diffusion models will be validated:
Support Status
✅: already supported, PR attached
⏳: not supported yet, with PR raised
Validation Results
Benchmark and validation results will be updated here.
Test Environment
Wan2.1
Wan2.1-T2V-1.3B-Diffusers was used as a test case.
test script:
run command:
Test Results:
Conclusions:
Wan2.2
test script:
run command:
Test Results:
Conclusions:
Qwen-Image
Conclusions:
Qwen-Image-Edit
The main branch still has an issue for Qwen-Image-Edit on NPU when sp is enabled , A PR has already been submitted for the functional fix(#20974).
Conclusions:
Planned Optimizations
The following optimizations may be explored after initial support is validated: