[Feature] [NPU] Support for Diffusion VAE Parallel Decoding

## Motivation

This issue tracks the validation and support of **VAE Parallel Decoding** for diffusion models on **NPU platforms**.

## Target Models

The following diffusion models will be validated:

- Wan2.1
- Wan2.2
- Qwen-Image
- Qwen-Image-Edit

## Support Status

✅: already supported, PR attached
⏳: not supported yet, with PR raised

| Model | VAE Parallel Decoding | Status | Related PR | Note |
|------|------|------|------|------|
| Wan2.1 | ✅ | Verified on NPU  | [#18179](https://github.com/sgl-project/sglang/pull/18179) | GPU implementation works without modification |
| Wan2.2 |  ✅  | Verified on NPU  | [#18179](https://github.com/sgl-project/sglang/pull/18179) | GPU implementation works without modification |
| Qwen-Image |  ✅ | Under continuous optimization | [#20757](https://github.com/sgl-project/sglang/pull/20757) | not supported yet, with PR raised |
| Qwen-Image-Edit |  ✅ | Under continuous optimization | [#20974](https://github.com/sgl-project/sglang/pull/20974) | not supported yet, with PR raised |

## Validation Results
Benchmark and validation results will be updated here.

**Test Environment**

- Hardware: Ascend NPU (910B3)
- Driver：25.5.0
- CANN：8.5.0

### Wan2.1
Wan2.1-T2V-1.3B-Diffusers was used as a test case.

test script:
```
curl -sS -X POST "http://localhost:8080/v1/videos" \
-H "Content-Type: application/json" \
-H "Authorization: Empty" \
-d '{
  "prompt": "A calico cat playing a piano on stage",
  "size": "1280x720",
  "num_inference_steps": 50,
  "num_frames": 160
}'
```

run command: 
```
nohup sglang serve \
--model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \
--num-gpus 2 \
--sp-degree 2 \
--port 8080 \
--host 0.0.0.0 > sglang.log 2>&1 &
```

Test Results:
| Mode | decoding time(s)  | peak memory(GB) | peak allocate memory(GB) | remaining memory at peak(GB)|
|------|------|------|------ |------ |
| use_parallel_decode=true  | 34.6884 |  33.67 | 16.15 | 27.29|
| use_parallel_decode=false | 48.8579 |  40.26 | 24.60 | 20.70 |

Conclusions:
* Decoding latency reduced by ~29%
* Peak NPU memory reduced by ~16%
* Peak allocated memory reduced by ~34%
* Available memory at peak increased by ~31%

### Wan2.2
test script:
```
curl -sS -X POST "http://localhost:8080/v1/videos" \
-H "Content-Type: application/json" \
-H "Authorization: Empty" \
-d '{
  "prompt": "A calico cat playing a piano on stage",
  "size": "256x256"
}'
```
run command:
```
nohup sglang serve \
--model-path /nas/disk1/Wan2.2-T2V-A14B-Diffusers \
--port 8080 \
--num-gpus 4 \
--sp-degree 4 \
--host 0.0.0.0 > sglang.log 2>&1 &
```
Test Results:
| Mode | decoding time(s)  | peak memory(GB) | peak allocate memory(GB) | remaining memory at peak(GB)|
|------|------|------|------ |------ |
| use_parallel_decode=true  | 1.2839 |  59.31  | 55.39 | 1.64 |
| use_parallel_decode=false | 1.6009 |  58.88 | 55.37 | 2.08 |

Conclusions:

* Decoding latency reduced by ~20%
* Memory usage shows little improvement

### Qwen-Image
|  | decoding time | peak gpu memory(GB) | peak allocated memory(GB) | memory pool overhead(GB) | remaining gpu memory at peak(GB)|
|---|---|---|---|---|---|
| main | 0.2859s | 58.39 | 43.22 | 15.17 | 2.57 |
| [#20757](https://github.com/sgl-project/sglang/pull/20757) | 0.4124 | 54.57 | 42.62 | 11.95 | 6.39

Conclusions:
* Peak NPU memory reduced by ~6.5%
* Peak allocated memory reduced by ~1.4%
* Available memory at peak increased by ~148.6%
* has minimal impact on latency

### Qwen-Image-Edit
The main branch still has an issue for Qwen-Image-Edit on NPU when sp is enabled , A PR has already been submitted for the functional fix（[#20974](https://github.com/sgl-project/sglang/pull/20974)）.

|  | decoding time | peak gpu memory(GB) | peak allocated memory(GB) | memory pool overhead(GB) | remaining gpu memory at peak(GB)|
|---|---|---|---|---|---|
| main | 0.3905s | 58.97 | 45.30 | 13.67 | 1.99 |
| [#20974](https://github.com/sgl-project/sglang/pull/20974)  [#20757](https://github.com/sgl-project/sglang/pull/20757) | 0.6066 | 58.36 | 45.30 | 13.06 | 2.60

Conclusions:
* No clear improvement yet, but it avoids OOM during 1024 x 1024 image generation.

## Planned Optimizations

The following optimizations may be explored after initial support is validated:
1. Add multi-batch support for the decoding stage to improve throughput and resource utilization. [#18764](https://github.com/sgl-project/sglang/pull/18764)
3. Introduce a threshold-based strategy to selectively enable parallel VAE decoding, and consider a VAE patch-parallel design to further improve concurrency.[#23248](https://github.com/sgl-project/sglang/pull/23248)










Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [NPU] Support for Diffusion VAE Parallel Decoding #20764

Motivation

Target Models

Support Status

Validation Results

Wan2.1

Wan2.2

Qwen-Image

Qwen-Image-Edit

Planned Optimizations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	VAE Parallel Decoding	Status	Related PR	Note
Wan2.1	✅	Verified on NPU	#18179	GPU implementation works without modification
Wan2.2	✅	Verified on NPU	#18179	GPU implementation works without modification
Qwen-Image	✅	Under continuous optimization	#20757	not supported yet, with PR raised
Qwen-Image-Edit	✅	Under continuous optimization	#20974	not supported yet, with PR raised

Mode	decoding time(s)	peak memory(GB)	peak allocate memory(GB)	remaining memory at peak(GB)
use_parallel_decode=true	34.6884	33.67	16.15	27.29
use_parallel_decode=false	48.8579	40.26	24.60	20.70

Mode	decoding time(s)	peak memory(GB)	peak allocate memory(GB)	remaining memory at peak(GB)
use_parallel_decode=true	1.2839	59.31	55.39	1.64
use_parallel_decode=false	1.6009	58.88	55.37	2.08

	decoding time	peak gpu memory(GB)	peak allocated memory(GB)	memory pool overhead(GB)	remaining gpu memory at peak(GB)
main	0.2859s	58.39	43.22	15.17	2.57
#20757	0.4124	54.57	42.62	11.95	6.39

	decoding time	peak gpu memory(GB)	peak allocated memory(GB)	memory pool overhead(GB)	remaining gpu memory at peak(GB)
main	0.3905s	58.97	45.30	13.67	1.99
#20974 #20757	0.6066	58.36	45.30	13.06	2.60

[Feature] [NPU] Support for Diffusion VAE Parallel Decoding #20764

Description

Motivation

Target Models

Support Status

Validation Results

Wan2.1

Wan2.2

Qwen-Image

Qwen-Image-Edit

Planned Optimizations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions