Skip to content

[diffusion] attention: add AITER Sage attention backend#20178

Merged
mickqian merged 3 commits intosgl-project:mainfrom
avjves:feature/aiter_sage_attention_support
Mar 11, 2026
Merged

[diffusion] attention: add AITER Sage attention backend#20178
mickqian merged 3 commits intosgl-project:mainfrom
avjves:feature/aiter_sage_attention_support

Conversation

@avjves
Copy link
Copy Markdown
Contributor

@avjves avjves commented Mar 9, 2026

Motivation

Sage attention is supported for NV GPUs, but the support for AMD Sage attention is missing. This PR adds the support for it.

Modifications

  • Adds a new backend type AITER_SAGE
  • Adds it as a supported attention backend where sage is typically supported

Accuracy Tests

Benchmarking and Profiling

Performance Comparison Report

1. High-level Summary

Metric Baseline New Diff Status
E2E Latency 276887.31 ms 241404.22 ms -35483.09 ms (-12.8%)
Throughput 0.00 req/s 0.00 req/s - -

2. Stage Breakdown

Stage Name Baseline (ms) New (ms) Diff (ms) Diff (%) Status
InputValidationStage 5.10 4.61 -0.50 -9.7% ⚪️
TextEncodingStage 1461.92 1460.04 -1.88 -0.1% ⚪️
ImageEncodingStage 861.02 914.68 +53.66 +6.2% ⚪️
ImageVAEEncodingStage 4040.19 3997.95 -42.24 -1.0% ⚪️
LatentPreparationStage 0.17 0.13 -0.05 -26.3% ⚪️
TimestepPreparationStage 0.35 0.31 -0.05 -13.0% ⚪️
DenoisingStage 266231.67 230743.97 -35487.70 -13.3% 🟢
DecodingStage 4282.54 4278.92 -3.63 -0.1% ⚪️

Output videos

AITER:
https://github.com/user-attachments/assets/62ae8e96-85e0-4b3f-b1dc-c15c708b4738

AITER SAGE:
https://github.com/user-attachments/assets/43f64ee6-dba4-45a5-9bca-8302bcfaff22

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions Bot added documentation Improvements or additions to documentation amd diffusion SGLang Diffusion labels Mar 9, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the diffusion model's compatibility and performance by introducing native support for Sage attention on AMD GPUs. By implementing the AITER_SAGE backend, the system can now leverage AMD hardware for attention computations, which was previously limited to NVIDIA GPUs. This change not only broadens hardware support but also delivers a substantial improvement in overall inference latency, making the system more efficient.

Highlights

  • AMD GPU Support: Introduced the AITER_SAGE attention backend to enable Sage attention on AMD GPUs, addressing a previous limitation to NV GPUs.
  • Performance Improvement: Achieved a notable 12.8% reduction in end-to-end latency, primarily driven by a 13.3% improvement in the Denoising Stage.
  • New Backend Integration: Added AITER_SAGE as a new attention backend type and integrated it across various configuration and runtime files.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/diffusion/performance/attention_backends.md
    • Documented the new aiter_sage attention backend.
    • Updated the aiter backend's GPU compatibility to include ROCm.
    • Added aiter_sage as a ROCm-supported backend.
  • python/sglang/multimodal_gen/configs/models/adapter/base.py
    • Registered AITER_SAGE as a supported attention backend for adapter architectures.
  • python/sglang/multimodal_gen/configs/models/dits/base.py
    • Registered AITER_SAGE as a supported attention backend for DiT architectures.
  • python/sglang/multimodal_gen/runtime/layers/attention/backends/aiter_sage.py
    • Implemented the AITERSageBackend and AITERSageImpl for AMD GPU attention, utilizing aiter.ops.triton.attention.fav3_sage.
  • python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py
    • Added AITER_SAGE to the set of supported attention backends for the QwenImageDiffusionModel.
  • python/sglang/multimodal_gen/runtime/platforms/interface.py
    • Extended the AttentionBackendEnum to include AITER_SAGE.
  • python/sglang/multimodal_gen/runtime/platforms/rocm.py
    • Configured the ROCm platform to select AITERSageBackend for AITER_SAGE, with dtype validation.
Activity
  • The author has completed the code formatting, documentation updates, and provided accuracy and speed benchmark results.
  • Unit tests are pending.
  • The PR outlines a review process involving Merge Oncalls, CODEOWNERS, and CI tests.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the AITER_SAGE attention backend, which is targeted for AMD GPUs. The changes are well-contained and consistently applied across documentation, configuration, and implementation files. The new backend implementation correctly handles the optional aiter dependency. My primary feedback is to enhance the robustness of the new AITERSageImpl by adding explicit checks for unsupported attention features like causal masking, dropout, and grouped-query attention. This will prevent silent misconfigurations and make the backend safer to use.

Comment on lines +51 to +58
try:
from aiter.ops.triton.attention.fav3_sage import fav3_sage_wrapper_func

self.aiter_sage_attn_fn = fav3_sage_wrapper_func
except ImportError:
raise ImportError(
"AITER Sage attention is not available, please update AITER version."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The __init__ method accepts several parameters (causal, num_kv_heads, dropout_p, etc.) that are not used by the implementation. This could lead to silent misconfigurations where a user might expect a feature to be active (e.g., causal attention), but it is not applied. To make the implementation more robust, it's better to explicitly check for unsupported parameter values and raise an error.

        if causal:
            raise NotImplementedError(
                "AITER Sage attention backend does not support causal attention."
            )
        if dropout_p > 0.0:
            raise NotImplementedError(
                "AITER Sage attention backend does not support dropout."
            )
        if num_kv_heads is not None and num_kv_heads != num_heads:
            raise NotImplementedError(
                "AITER Sage attention backend does not support Grouped Query Attention."
            )

        try:
            from aiter.ops.triton.attention.fav3_sage import fav3_sage_wrapper_func

            self.aiter_sage_attn_fn = fav3_sage_wrapper_func
        except ImportError:
            raise ImportError(
                "AITER Sage attention is not available, please update AITER version."
            )

@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

@mickqian All CI (Nvidia + AMD) passed and PR is approved, ready for merge

— SGLDHelper bot

@mickqian mickqian merged commit c8bbe50 into sgl-project:main Mar 11, 2026
143 of 151 checks passed
liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd diffusion SGLang Diffusion documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants