Skip to content

[diffusion] Add Sage Attention 3 Support for sm 120 (RTX5090)#15382

Merged
mickqian merged 15 commits intosgl-project:mainfrom
ryang-max:diffusion5090_1
Dec 19, 2025
Merged

[diffusion] Add Sage Attention 3 Support for sm 120 (RTX5090)#15382
mickqian merged 15 commits intosgl-project:mainfrom
ryang-max:diffusion5090_1

Conversation

@ryang-max
Copy link
Copy Markdown
Contributor

@ryang-max ryang-max commented Dec 18, 2025

Motivation

Support SGLang Diffusion in RTX 5090.

Modifications

Since Sage Attention 3 already supports RTX 5090, we choose it as default backend in such cases.

Accuracy Tests

Tested with:

  • black-forest-labs/FLUX.1-dev
  • Wan-AI/Wan2.1-I2V-14B-480P-Diffusers

Benchmarking and Profiling

For image models, no significant acceleration comparing with torch_sdpa

For video models:

sglang generate --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A curious raccoon" \
    --save-output

Comparison

Metric sage_attn_3 (default) torch_sdpa
Average time per step (s/step) 1.8552 3.2210
Total DenoisingStage time (s) 94.1848 162.5129
Speedup (torch_sdpa / sage_attn_3) 1.74× 1.00×
Time reduction (vs. torch_sdpa) 42.4% 0%

Limitation
As mentioned in sage attention official repo, all steps with SageAttn3 may introduce some loss in inference. This is also noticed in my experiment. Will fix it with hybrid attention in another PR.

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Dec 18, 2025
Comment thread python/sglang/multimodal_gen/runtime/server_args.py Outdated
if model is not None:
model.to("cpu")
logger.info(
"Offloaded denoiser transformer weights to CPU after denoising to reduce peak VRAM during VAE decoding."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set dit_cpu_offload

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dit_cpu_offload was used to switch transformer in denoising process, which is useful for all serving modes; but for this, it's to offload all transformers after denoising stage(while they won't be used in offline mode). So I think is_local_mode would be better

Comment thread python/sglang/multimodal_gen/runtime/platforms/cuda.py Outdated
@mickqian
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@mickqian
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@mickqian mickqian merged commit 1e58248 into sgl-project:main Dec 19, 2025
164 of 174 checks passed
@IPostYellow
Copy link
Copy Markdown
Contributor

Hi, @ryang-max could you share your launch cmd ?

@ryang-max
Copy link
Copy Markdown
Contributor Author

ryang-max commented Dec 22, 2025

Hi, @ryang-max could you share your launch cmd ?

Hi @IPostYellow , just use default command described in blog can start it. If OOM, please try smaller image size. Curerntly we have tested flux1 and wan2.1-t2v-1.3B with 480p. And we're working actively on optimizing memory usage and parallelism on 5090.

@IPostYellow
Copy link
Copy Markdown
Contributor

IPostYellow commented Dec 22, 2025

Hi, @ryang-max could you share your launch cmd ?

Hi @IPostYellow , just use default command described in blog can start it. If OOM, please try smaller image size. Curerntly we have tested flux1 and wan2.1-t2v-1.3B with 480p. And we're working actively on optimizing memory usage and parallelism on 5090.
@ryang-max thank you for your reply.
During my experiments in qwen-image, I encounter the following error:
RuntimeError: The size of tensor a (28) must match the size of tensor b (4) at non-singleton dimension 1
This is because sage3 does not support Qwen2.5-VL I guess.
but in sglang/multimodal_gen/runtime/platforms/cuda.py it seems all attention_backend will be set sage3 in 5090 if selected_backend==None due to

if is_sm120():
                try:
                    from sglang.multimodal_gen.runtime.layers.attention.backends.sage_attn3 import (  # noqa: F401
                        SageAttention3Backend,
                    )

                    logger.info("Using Sage Attention 3 backend")
                    return "sglang.multimodal_gen.runtime.layers.attention.backends.sage_attn3.SageAttention3Backend"
                except ImportError as e:
                    logger.info(e)
                    logger.info(
                        "Sage Attention 3 backend is not installed, Falling back to Torch SDPA (To install it, see https://github.com/thu-ml/SageAttention/tree/main/sageattention3_blackwell#installation)"
                    )
                    target_backend = AttentionBackendEnum.TORCH_SDPA

@ryang-max
Copy link
Copy Markdown
Contributor Author

Hi @IPostYellow , yes Qwen-Image has some issue, working on fixing it. You can try flux.1-dev or Wan-AI/Wan2.1-I2V-14B-480P-Diffuser for now.

@ryang-max ryang-max deleted the diffusion5090_1 branch December 23, 2025 01:09
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025
…RTX5090) (sgl-project#15382)

Co-authored-by: Mengxi Li <marcyleemx@gmail.com>
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
…RTX5090) (sgl-project#15382)

Co-authored-by: Mengxi Li <marcyleemx@gmail.com>
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
…RTX5090) (sgl-project#15382)

Co-authored-by: Mengxi Li <marcyleemx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants