Skip to content

[AMD] Diffusion Support on ROCm#13760

Merged
mickqian merged 27 commits intosgl-project:mainfrom
zyzshishui:amd_diffusion
Dec 19, 2025
Merged

[AMD] Diffusion Support on ROCm#13760
mickqian merged 27 commits intosgl-project:mainfrom
zyzshishui:amd_diffusion

Conversation

@zyzshishui
Copy link
Copy Markdown
Contributor

@zyzshishui zyzshishui commented Nov 22, 2025

Motivation

Support Sglang Diffusion on ROCm.
A more fine-grained follow-up to #13492. Thanks @sabreshao for the original support!
Also manually merged #13743.

Modifications

  • Make AITer the default attention backend on ROCm diffusion.

  • Avoid profiler trace overwrites by suffixing rank IDs.

  • Remove dependency of installing yunchang, put relevant function locally (as torch not found error would arise when installing yunchang on ROCm) to support sp_degree/ulysses_degree on ROCm.

  • Bump openai version to 2.6.1 for all non-default pyproject to solve bugs caused by api change.

  • Fixed some ci.

Note

  • Ring Attention on ROCm hasn't been supported yet. May do in the future.

Accuracy Tests

Tested on MI300 GPUs.

Benchmarking and Profiling

Checklist

@github-actions github-actions Bot added documentation Improvements or additions to documentation amd dependencies Pull requests that update a dependency file diffusion SGLang Diffusion labels Nov 22, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zyzshishui, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Sglang Diffusion's compatibility by integrating full support for AMD's ROCm platform. The changes enable efficient execution of diffusion models on AMD GPUs, primarily by adopting AITer as the default attention backend and implementing robust data type handling. It also streamlines the development environment setup with a new ROCm-specific Dockerfile and improves stability by localizing external dependencies and preventing profiler data conflicts. The overall impact is a more versatile and performant Sglang Diffusion for a broader range of hardware.

Highlights

  • ROCm Support for Sglang Diffusion: This pull request introduces comprehensive support for Sglang Diffusion on ROCm-enabled AMD GPUs, allowing the framework to leverage AMD hardware for diffusion workloads. This includes a new Dockerfile for building ROCm diffusion images and specific environment variable configurations.
  • AITer as Default Attention Backend on ROCm: The AITer backend is now set as the default attention mechanism for Sglang Diffusion when running on ROCm. This is intended to optimize performance and compatibility on AMD platforms.
  • Automatic Data Type Casting for AITer: The AITer backend now transparently casts fp32 Query, Key, and Value (QKV) tensors to bf16/fp16 and casts the outputs back to the original dtype. This resolves issues where CLIP/image encoders would crash due to unsupported data types.
  • Profiler Trace Overwrite Prevention: Profiler traces are now suffixed with rank IDs to prevent overwrites in multi-GPU or distributed environments, ensuring unique and accurate profiling data for each process.
  • Decoupling from Yunchang Library: The dependency on the yunchang library for sequence parallelism (sp_degree/ulysses_degree) has been removed. Relevant functions have been localized within the SGLang codebase to avoid torch not found errors when installing yunchang on ROCm.
  • Documentation Updates for ROCm: Installation documentation has been updated to include a dedicated ROCm quickstart guide, providing clear instructions for AMD Instinct/ROCm users on kernel builds and attention backend settings validated on MI300X GPUs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces ROCm support for SGLang Diffusion, a significant step towards broader hardware compatibility. The changes include a new Dockerfile for ROCm, the integration of the AITer attention backend, and various code modifications to ensure compatibility and remove problematic dependencies on ROCm. My review has identified a couple of critical issues—one in the Dockerfile that would cause build failures and another in the AITer backend implementation that could lead to runtime errors. I have also provided suggestions to improve Dockerfile efficiency and documentation clarity. Overall, this is a valuable contribution that enables SGLang Diffusion on AMD hardware.

Comment thread docker/rocm.diffusion.Dockerfile Outdated
Comment thread python/sglang/multimodal_gen/runtime/layers/attention/backends/aiter.py Outdated
Comment thread docker/rocm.diffusion.Dockerfile Outdated
Comment thread python/sglang/multimodal_gen/docs/cli.md Outdated
@zyzshishui zyzshishui force-pushed the amd_diffusion branch 2 times, most recently from 1a9891b to 586b79b Compare November 22, 2025 08:26
@hubertlu-tw
Copy link
Copy Markdown
Collaborator

hubertlu-tw commented Nov 23, 2025

/tag-and-rerun-ci 11/26

@sunxxuns sunxxuns added run-ci and removed run-ci labels Nov 27, 2025
@sabreshao
Copy link
Copy Markdown
Contributor

Automatic Data Type Casting for AITer: I suggest falling back to SDPA instead of AITER in CLIP or other model except DIT part to avoid some image incorrectness.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

You are the GOAT!

@guapisolo
Copy link
Copy Markdown
Contributor

/rerun-failed-ci

zyzshishui and others added 27 commits December 19, 2025 03:43
Co-authored-by: Sabre Shao <sabre.shao@amd.com>
Co-authored-by: Yusheng (Ethan) Su <yushengsu.thu@gmail.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
- Fix GPU OOM in sequential tests on ROCm/AMD with explicit memory cleanup
- Skip Ring Attention tests on AMD/ROCm (unsupported)
- Fix SGLANG_TEST_OUTPUT_SIZE not applied to actual test requests
- Add MIOpen kernel caching for AMD VAE performance
- Add diagnostics for HF cache and system resources
- Add disk cleanup for non-persistent HF cache between tests
- Enable all diffusion tests including LoRA (except FLUX.2 on 1-GPU)
The Docker image contains pre-compiled AITER kernels at
/sgl-workspace/aiter/aiter/jit/ which may be incompatible.
Clear them before running tests to force fresh JIT compilation.
if self.height is None:
self.height_not_provided = True

# Allow env var to override num_inference_steps (for faster CI testing on AMD)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix it in a follow-up PR, this can be passed via sampling params

@zyzshishui
Copy link
Copy Markdown
Contributor Author

Automatic Data Type Casting for AITer: I suggest falling back to SDPA instead of AITER in CLIP or other model except DIT part to avoid some image incorrectness.

Could you please provide more detail about this, under which circumstance would this cause an issue? Thanks!

@mickqian mickqian mentioned this pull request Jan 8, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants