Skip to content

[diffusion] [AMD] model: allow AITER backends in Flux 2 pipeline#22802

Merged
HaiShaw merged 4 commits intosgl-project:mainfrom
avjves:bug/flux_aiter
Apr 22, 2026
Merged

[diffusion] [AMD] model: allow AITER backends in Flux 2 pipeline#22802
HaiShaw merged 4 commits intosgl-project:mainfrom
avjves:bug/flux_aiter

Conversation

@avjves
Copy link
Copy Markdown
Contributor

@avjves avjves commented Apr 14, 2026

Motivation

This PR is related to #22690.
PR #22423 specified supported backends for Flux2, but this only included SDPA and FA, not AITER. This is a big performance regression for AMD devices.

Modifications

  • Adds AITER and AITER_SAGE as supported backends for Flux2

Accuracy Tests

With SDPA as backend:
Add_a_cool_hat_to_the_cat_torch_sdpa

With AITER as backend:
Add_a_cool_hat_to_the_cat_aiter

Run command for both (First ran without fixes in this PR and latter with the fixes):

sglang generate --model-path black-forest-labs/FLUX.2-dev --height 1024 --width 1024 --ulysses-degree 8 --ring-degree 1 --num-gpus 8 --num-inference-steps 50 --guidance-scale 4.0 --prompt "Add a cool hat to the cat" --dit-cpu-offload False --dit-layerwise-offload False --text-encoder-cpu-offload False --image-encoder-cpu-offload False --vae-cpu-offload False --warmup True --warmup-steps 2 --vae-precision bf16 --output-path /outputs/flux2.default --seed 42 --image-path /app/data/flux_cat.png --enable-torch-compile

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Apr 14, 2026
@avjves avjves changed the title [diffusion] allow AITER backends in Flux 2 pipeline [diffusion] model: allow AITER backends in Flux 2 pipeline Apr 14, 2026
@avjves avjves changed the title [diffusion] model: allow AITER backends in Flux 2 pipeline [diffusion] [AMD] model: allow AITER backends in Flux 2 pipeline Apr 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds AITER and AITER_SAGE to the supported attention backends for the Flux2Transformer2DModel. However, feedback indicates that enabling these backends is premature due to critical integration issues, including a tensor layout mismatch in the AITER implementation, missing entries in the Ring Attention whitelist, and signature incompatibilities in the AITER_SAGE forward method.

Comment on lines +868 to +869
AttentionBackendEnum.AITER,
AttentionBackendEnum.AITER_SAGE,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Enabling these backends for Flux2 is premature due to several integration issues in the underlying attention infrastructure:

  1. AITER Layout Mismatch (High Severity): The AITerImpl.forward implementation in aiter.py expects a [batch_size, num_heads, seq_len, head_dim] layout (as explicitly stated in its docstring and comments). However, USPAttention (which Flux2 uses) provides tensors in [batch_size, seq_len, num_heads, head_dim] format for both local and replicated-prefix paths. This will result in incorrect attention computation.
  2. Ring Attention Whitelist (Medium Severity): Both AITER and AITER_SAGE are currently missing from the Ring Attention whitelist in python/sglang/multimodal_gen/runtime/layers/attention/layer.py. Any attempt to use Flux2 with these backends and Ring Attention enabled will trigger a RuntimeError in USPAttention.__init__.
  3. AITER_SAGE Signature (Medium Severity): The AITERSageImpl.forward method does not accept **kwargs. If the Ring Attention whitelist is updated, ring_attn might still fail if it attempts to pass additional parameters (like dropout_p or is_causal) to the implementation.

Please ensure the backends in aiter.py, aiter_sage.py, and the whitelist in layer.py are updated to support the Flux2 pipeline requirements before enabling them here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are already supported. It's just Flux dit that disabled it. Tested running with AITER just fine.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, @avjves please provide some insight here? Also it is good to provide example runs with aiter, aitger_sage backend in PR body.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my response here to Gemini wasn't maybe as clear as it should be.

  1. The AITER implementation has a wrong docstring. It says "[batch_size, seq_len, head_num, head_dim]", but that would crash whenever AITER is used anywhere in the codebase. Seq_len and head_num dimensions are in reality swapped. As such, there is no layout mismatch here. I can change the docstring as well, though it's not really the scope of this PR.
  2. A check for backends for ring-attn was added in [diffusion] Validate attention backend for Ring Attention in USPAttention #21828 , but it doesn't include AITER. It should be added there as well, but it's again not in the scope of this PR, as that's not a Flux2 issue per se. I try to avoid changing too many "unrelated" issues in a PR afterwards to keep the history clear. I do see this adding review overhead, so maybe it is better to bundle all together and change the scope of the PR afterwards.
  3. This is a newer change, as most attention backends do not take in extra arguments. This should be looked at when fixing 2).

Sure, I'll add some example outputs as well :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the output images as well :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the docstring, but I think 2) and 3) warrant their own PR.

Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avjves would you please address review comment?

@bingxche
Copy link
Copy Markdown
Collaborator

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 16, 2026

@bingxche

CI Status for PR #22802

PR: [diffusion] [AMD] model: allow AITER backends in Flux 2 pipeline
Changed files: python/sglang/multimodal_gen/runtime/models/dits/flux_2.py (+2/-0)

AMD: 2 failures (0 likely related) | Others: 1 failure (0 related)

AMD CI Failures

Job Test File Test Function Error Related? Explanation Log
multimodal-gen-test-1-gpu-amd (partition 1) sglang/multimodal_gen/test/server/test_server_b.py test_diffusion_generation[fastwan2_2_ti2v_5b] GPU Hang (exit code 134) 🟢 Unlikely GPU hardware hang during FastWan2.2-TI2V-5B video generation; unrelated model/codepath to flux_2.py Log
multimodal-gen-test-2-gpu-amd (partition 0) sglang/multimodal_gen/test/server/test_server_2_gpu_b.py test_diffusion_generation[flux_2_image_t2i_2_gpus] RuntimeError: No available kernel in mistral_3.py:138 🟢 Unlikely Mistral 3 text encoder SDPA kernel issue on ROCm — fixed by PR #22690 (merged Apr 16, after this CI ran Apr 14) Log
multimodal-gen-test-2-gpu-amd (partition 0) sglang/multimodal_gen/test/server/test_server_2_gpu_b.py test_diffusion_generation[zimage_image_t2i_2_gpus] HfHubHTTPError: 504 Gateway Time-out (Z-Image-Turbo download) 🟢 Unlikely HuggingFace Hub 504 timeout — infra issue Log
multimodal-gen-test-2-gpu-amd (partition 0) sglang/multimodal_gen/test/server/test_server_2_gpu_b.py test_diffusion_generation[flux_image_t2i_2_gpus] HfHubHTTPError: 504 Gateway Time-out (FLUX.1-dev download) 🟢 Unlikely HuggingFace Hub 504 timeout — infra issue Log

Other CI Failures

Job Test File Test Function Error Related? Explanation Log
pr-test-finish N/A N/A Gate job: upstream call-gate reported failure 🟢 Unlikely Gate aggregator — fails because AMD multimodal-gen jobs failed Log

Details

All failures are unrelated to this PR's change (adding AITER and AITER_SAGE to _supported_attention_backends in Flux2Transformer2DModel):

  1. 1-GPU GPU Hang (fastwan2_2_ti2v_5b): The process crashed with HW Exception by GPU node-2 reason: GPU Hang during video generation with FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers. This is a hardware/infrastructure issue on the MI325X runner, completely unrelated to the Flux 2 model code. 9 of 10 tests in this partition passed before the hang.

  2. 2-GPU Mistral 3 SDPA failure (flux_2_image_t2i_2_gpus): The FLUX.2-dev model loaded successfully, but warmup failed with RuntimeError: No available kernel. Aborting execution. in mistral_3.py:138. On ROCm, SDPBackend.CUDNN_ATTENTION was being incorrectly applied because the platform check only looked at device.type == "cuda" (which is true for ROCm/HIP). This is a known pre-existing bug fixed by PR [diffusion] model: Properly validate device for Mistral 3 attention #22690 (merged Apr 16) — this CI ran on Apr 14, before that fix landed.

  3. 2-GPU HuggingFace timeouts (zimage_image_t2i_2_gpus, flux_image_t2i_2_gpus): Both tests failed because HuggingFace Hub returned 504 Gateway Time-out when trying to download model files. Pure infrastructure flake.

Verdict: No action needed from the PR author. All failures are pre-existing bugs (now fixed) or infrastructure issues. Re-running CI after rebasing on latest main (which includes #22690) should resolve the flux_2_image_t2i_2_gpus failure.

Generated by amd-bot using Claude Code CLI

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 19, 2026

/tag-and-rerun-ci

@avjves
Copy link
Copy Markdown
Contributor Author

avjves commented Apr 22, 2026

@HaiShaw Is everything now OK with this? Would be great to get if merged ASAP.

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 22, 2026

@HaiShaw Is everything now OK with this? Would be great to get if merged ASAP.

Can you provide a full run example with command (which is useful for others to begin with)?

@avjves
Copy link
Copy Markdown
Contributor Author

avjves commented Apr 22, 2026

@HaiShaw Is everything now OK with this? Would be great to get if merged ASAP.

Can you provide a full run example with command (which is useful for others to begin with)?

Sure, added in the PR description. Posting here as well:

sglang generate --model-path black-forest-labs/FLUX.2-dev 
                           --height 1024 --width 1024 --ulysses-degree 8 --ring-degree 1 --num-gpus 8
                           --num-inference-steps 50 --guidance-scale 4.0 --prompt "Add a cool hat to the cat"
                           --dit-cpu-offload False --dit-layerwise-offload False --text-encoder-cpu-offload False
                           --image-encoder-cpu-offload False --vae-cpu-offload False --warmup True --warmup-steps 2
                           --vae-precision bf16 --output-path /outputs/flux2.default --seed 42
                           --image-path /app/data/flux_cat.png --enable-torch-compile

The images shown in the description are ran with this command.

@HaiShaw HaiShaw merged commit ac351c1 into sgl-project:main Apr 22, 2026
65 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants