Teammates: @Makcum888e @OrangeRedeng @e-martirosian @ssshinigami ## Foundational Capability Development initial npu support for SGLang-Diffusion was added in #13662 - support NPU - [x] support `NPUPlatformBase` (python\sglang\multimodal_gen\runtime\platforms\npu.py) - [ ] support models - [x] Wan/Qwen-Image/FLUX #13662 - [x] MOVA @LLThomas #21633 - [x] Hunyuan3D @e-martirosian #20352 - [ ] rasterize_image_npu - [ ] BAGEL - [x] GLM-Image - [ ] ERNIE-Image - [ ] Hunyuan-Image/Video - [ ] ERNIE-Image - [ ] GPT-Image 2.0 - [ ] Happy Horse 1.0 - Tools - [x] profiling #17807 - [x] benchmark #18907 - documentation - [x] get started #18894 - [x] basic usage - [ ] advanced features - [ ] developer guide - [ ] cookbook - CI/CD & Observability - [ ] PR-test design(include testcase and corresponding time) - [ ] nightly-test design(include testcase and corresponding time) - [ ] Improve [CI monitor](https://github.com/sgl-project/sglang/actions/workflows/ci-monitor.yml) workflow - [ ] More tests - [ ] Different parallelism - [ ] ulysses+ring+tp - [ ] ulysses+ring - [ ] tp+ring - [ ] Models - [ ] accuracy test - [ ] module test - [ ] v-bench ## Performance Optimization > P0 means High priority, P1 means Medium priority, P2 means low priority - Graph optimization - [ ] [P1] support ACLGraph [in progress] @Alisehen - [x] support torch.compile @Alisehen #20687 - Advanced attention - [x] USPAttention - [x] LocalAttention - [x] UlyssesAttention - [ ] [P2] UlyssesAttention_VSA - [ ] Video Sparse Attention P2 - [x] MinimalA2AAttnOp - [ ] [P2] Sparse Linear attention backend - [ ] [P2] Sage Sparse Linear attention backend - [ ] [P2] SAGEAttention - [ ] [P0] RainFusion Attention @Napkin-AI @Svoloch2940194 - [ ] Block Sparse Attention - [ ] [P0] Initial support @Napkin-AI @Svoloch2940194 - [ ] [P1] Improved version support - [ ] Laser Attention - [ ] [P0] Initial support @Napkin-AI @Svoloch2940194 - [ ] [P1] Improved version support - [ ] Ulysses Anything Attention - [ ] Support Cross attention skipping for Ulysses SP - Cache - [x] Cache-DiT - [x] TeaCache - [ ] [P1] AttentionCache - memory optimization - [x] CPU offload (need support for quant models) - parallelism - [x] TP - [x] Sequence Parallel(SP) - [x] DP - [ ] Image-model SP(with padding) - [x] VAE Decoding Parallel @gxxx-hum #20764 - [ ] VAE tiling Parallel encode - [ ] PipelineStage Parallelism - [x] Ring Sequence Parallel #20013 #21383 #20998 - [x] Ulysses Sequence Parallel - [x] CFG parallelism - [ ] Expert Parallelism - Other optimizations - [ ] Support fast_layernorm from MindIE-SD @Napkin-AI @Svoloch2940194 - [ ] Support BF16 datatype in vae - [ ] NZ for MoE support - quantization and compression @OrangeRedeng - [x] [modelslim](https://gitcode.com/Ascend/msmodelslim/tree/master/example/multimodal_sd) - - [x] Wan2.2 #17996 - [ ] Wan2.1 - need to test with #17996 - [ ] SD3 - [ ] Open-Sora-Plan v1.2 - [ ] Flux.1 - [ ] HunyuanVideo - [ ] Qwen-Image-Edit - [ ] [nunchaku](https://github.com/nunchaku-ai/nunchaku) - [ ] [TurboDiffusion](https://github.com/thu-ml/TurboDiffusion) - Scheduling & Serving - [x] ComfyUI web serving - [x] LoRA support - [ ] Batch Scheduler ## Advanced features - [ ] Real time diffusion - [x] Postprocess - [x] Upscaling #18327 - [x] Frame interpolation #19384 ## Refactoring - [ ] Refactor Custom Op - [ ] Refactor Rotary Embedding: Custom Op for RoPE
Teammates: @Makcum888e @OrangeRedeng @e-martirosian @ssshinigami
Foundational Capability Development
initial npu support for SGLang-Diffusion was added in #13662
support NPU
NPUPlatformBase(python\sglang\multimodal_gen\runtime\platforms\npu.py)Tools
documentation
CI/CD & Observability
Models
accuracy test
Performance Optimization
Graph optimization
Advanced attention
Cache
memory optimization
parallelism
Other optimizations
quantization and compression @OrangeRedeng
modelslim -
nunchaku
TurboDiffusion
Scheduling & Serving
Advanced features
Refactoring