[ROCm][Inductor] Enable pipelining for FlexAttention by nithinsubbiah · Pull Request #176676 · pytorch/pytorch

nithinsubbiah · 2026-03-06T02:40:58Z

Changing the num_stages value from 1 to 2 enables more efficient pipelining in Triton backend which improves the performance. Here's some benchmark numbers for comparison run on MI350X.

Attn Type	Shape (B,Hq,M,Hkv,N,D)	stages=1 (μs)	stages=2 (μs)	Speedup
causal	(2, 16, 512, 16, 512, 64)	37.6	35.8	1.05x
causal	(2, 16, 512, 2, 512, 128)	35.7	35.1	1.02x
causal	(2, 16, 1024, 16, 1024, 64)	39.5	31.4	1.26x
causal	(2, 16, 4096, 16, 4096, 128)	680.3	580.6	1.17x
causal	(2, 16, 4096, 2, 4096, 64)	259.0	238.4	1.09x
noop	(8, 16, 1024, 16, 1024, 128)	196.2	183.3	1.07x
causal	(8, 16, 1024, 2, 1024, 64)	79.7	75.5	1.06x
alibi	(8, 16, 4096, 16, 4096, 64)	2017.7	1727.3	1.17x
causal	(8, 16, 4096, 16, 4096, 128)	2686.0	2258.7	1.19x
sliding_window	(8, 16, 4096, 2, 4096, 64)	610.4	559.3	1.09x
causal	(16, 16, 512, 16, 512, 128)	111.6	99.0	1.13x
alibi	(16, 16, 1024, 2, 1024, 128)	391.6	335.3	1.17x
causal	(16, 16, 1024, 16, 1024, 64)	163.6	142.6	1.15x
noop	(16, 16, 4096, 16, 4096, 128)	6260.5	5130.3	1.22x
causal	(16, 16, 4096, 2, 4096, 64)	2084.5	1780.5	1.17x
causal	(1, 32, 16384, 4, 16384, 64)	2687.9	2472.8	1.09x
Geo-mean				1.13x

All configs: num_warps=4, dtype=bfloat16, fwd only. Benchmarked with attention-gym on ROCm.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

Adds an additional tile size `256` to tuning config for Flex Attention performance on Triton for AMD backend. This provides significant performance boost (~2x) across a board range of shapes particularly for larger sequence lengths. This performance boost will be realized when developers provide `max_autotune=True` option to `torch.compile`.

Setting `num_stages=2` in config enables double-buffering that results in ~20-25% improvement in Flex Attention performance.

pytorch-bot · 2026-03-06T02:41:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176676

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (9 Unrelated Failures)

As of commit 2eac6ac with merge base c1943cf ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-jammy-py3.10-clang15 / test (openreg, 1, 1, lf.linux.2xlarge) (gh) (similar failure)
RuntimeError: test_openreg 1/1 failed!
pull / linux-jammy-py3.10-clang18-asan / test (openreg, 1, 1, lf.linux.4xlarge) (gh) (similar failure)
RuntimeError: test_openreg 1/1 failed!
pull / linux-jammy-py3.10-gcc11 / test (openreg, 1, 1, lf.linux.2xlarge) (gh) (similar failure)
RuntimeError: test_openreg 1/1 failed!
pull / linux-jammy-py3.14-clang15 / test (openreg, 1, 1, lf.linux.2xlarge) (gh) (similar failure)
RuntimeError: test_openreg 1/1 failed!
pull / linux-jammy-py3.14t-clang15 / test (openreg, 1, 1, lf.linux.2xlarge) (gh) (similar failure)
RuntimeError: test_openreg 1/1 failed!
rocm-mi200 / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.2) (gh) (similar failure)
test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_record_in_compiled_fn
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 2, 6, linux.rocm.gpu.gfx942.1) (gh) (similar failure)
test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_cuda_event_record_in_compiled_fn_dynamic_shapes
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) (gh) (similar failure)
test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_record_in_compiled_fn
trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx950.1) (gh) (similar failure)
test/inductor/test_gpu_cpp_wrapper.py::TestGpuWrapper::test_profiler_mark_wrapper_call_cuda_gpu_wrapper

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-06T02:41:05Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

nithinsubbiah · 2026-03-06T02:41:14Z

@pytorchbot label "topic: not user facing"

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

nithinsubbiah · 2026-03-09T21:57:01Z

I removed the addition of 256 to BLOCK_M tuning config since I see regression for some shapes. Ideally, there shouldn't be any regression since the autotuner should be able to skip non-optimal configs but there seems to be an issue there. Will investigate this in a follow-up PR

nithinsubbiah · 2026-03-10T01:53:00Z

@pytorchbot merge

pytorch-bot · 2026-03-10T01:53:05Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

pytorch-bot · 2026-03-10T11:36:21Z

To add the ciflow label ciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2026-03-10T11:36:22Z

To add the ciflow label ciflow/rocm-navi31 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2026-03-10T11:36:22Z

To add the ciflow label ciflow/rocm-mi355 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2026-03-10T11:36:22Z

To add the ciflow label ciflow/rocm-mi200 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorchmergebot · 2026-03-10T18:29:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-03-10T18:29:47Z

Merge failed

Reason: Comment with id 4033515126 not found

Details for Dev Infra team

Raised by workflow job

nithinsubbiah · 2026-03-10T18:51:13Z

@drisspg @jataylo Could we merge this PR? Failures are unrelated and caused by flaky tests, pytorch-bot reports that it can be merged

jataylo · 2026-03-12T13:25:29Z

@pytorchbot merge

pytorchmergebot · 2026-03-12T13:27:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Changing the `num_stages` value from 1 to 2 enables more efficient pipelining in Triton backend which improves the performance. Here's some benchmark numbers for comparison run on MI350X. | Attn Type | Shape (B,Hq,M,Hkv,N,D) | stages=1 (μs) | stages=2 (μs) | Speedup | |----------------|----------------------------------|----------------|----------------|---------| | causal | (2, 16, 512, 16, 512, 64) | 37.6 | 35.8 | 1.05x | | causal | (2, 16, 512, 2, 512, 128) | 35.7 | 35.1 | 1.02x | | causal | (2, 16, 1024, 16, 1024, 64) | 39.5 | 31.4 | 1.26x | | causal | (2, 16, 4096, 16, 4096, 128) | 680.3 | 580.6 | 1.17x | | causal | (2, 16, 4096, 2, 4096, 64) | 259.0 | 238.4 | 1.09x | | noop | (8, 16, 1024, 16, 1024, 128) | 196.2 | 183.3 | 1.07x | | causal | (8, 16, 1024, 2, 1024, 64) | 79.7 | 75.5 | 1.06x | | alibi | (8, 16, 4096, 16, 4096, 64) | 2017.7 | 1727.3 | 1.17x | | causal | (8, 16, 4096, 16, 4096, 128) | 2686.0 | 2258.7 | 1.19x | | sliding_window | (8, 16, 4096, 2, 4096, 64) | 610.4 | 559.3 | 1.09x | | causal | (16, 16, 512, 16, 512, 128) | 111.6 | 99.0 | 1.13x | | alibi | (16, 16, 1024, 2, 1024, 128) | 391.6 | 335.3 | 1.17x | | causal | (16, 16, 1024, 16, 1024, 64) | 163.6 | 142.6 | 1.15x | | noop | (16, 16, 4096, 16, 4096, 128) | 6260.5 | 5130.3 | 1.22x | | causal | (16, 16, 4096, 2, 4096, 64) | 2084.5 | 1780.5 | 1.17x | | causal | (1, 32, 16384, 4, 16384, 64) | 2687.9 | 2472.8 | 1.09x | | **Geo-mean** | | | | **1.13x** | All configs: `num_warps=4`, `dtype=bfloat16`, fwd only. Benchmarked with `attention-gym` on ROCm. Pull Request resolved: pytorch#176676 Approved by: https://github.com/drisspg, https://github.com/jeffdaily

nithinsubbiah added 2 commits March 6, 2026 00:35

[Triton][AMD] Enable pipelining in ROCmFlexConfig

de17fee

Setting `num_stages=2` in config enables double-buffering that results in ~20-25% improvement in Flex Attention performance.

pytorch-bot bot added module: inductor module: rocm AMD GPU support for Pytorch labels Mar 6, 2026

pytorch-bot bot added the topic: not user facing topic category label Mar 6, 2026

pytorchbot added the open source label Mar 6, 2026

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 6, 2026

albanD requested a review from eellison March 6, 2026 14:11

eellison requested a review from drisspg March 6, 2026 16:24

pytorch-bot bot added ciflow/inductor ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 7, 2026

drisspg approved these changes Mar 9, 2026

View reviewed changes

pytorch-bot bot removed ciflow/inductor ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 9, 2026

nithinsubbiah requested a review from drisspg March 9, 2026 21:50

Remove 256 tile size as the performance regresses in some shapes

2eac6ac

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

nithinsubbiah force-pushed the flexattn_perf branch from 5cee8c7 to 2eac6ac Compare March 9, 2026 21:54

nithinsubbiah changed the title ~~[ROCm][Inductor] Enable pipelining and efficient tile size for FlexAttention~~ [ROCm][Inductor] Enable pipelining for FlexAttention Mar 9, 2026

jataylo added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 labels Mar 10, 2026

pytorch-bot bot removed ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 labels Mar 10, 2026

pytorchmergebot added the merging label Mar 10, 2026

pytorchmergebot removed the merging label Mar 10, 2026

jataylo requested a review from jeffdaily March 11, 2026 11:22

jeffdaily approved these changes Mar 11, 2026

View reviewed changes

pytorchmergebot added the merging label Mar 12, 2026

pytorchmergebot closed this in 4cce831 Mar 12, 2026

pytorchmergebot added Merged and removed merging labels Mar 12, 2026

Conversation

nithinsubbiah commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176676

✅ You can merge normally! (9 Unrelated Failures)

Uh oh!

pytorch-bot bot commented Mar 6, 2026

This PR needs a release notes: label

Uh oh!

nithinsubbiah commented Mar 6, 2026

Uh oh!

nithinsubbiah commented Mar 9, 2026

Uh oh!

nithinsubbiah commented Mar 10, 2026

Uh oh!

pytorch-bot bot commented Mar 10, 2026

Uh oh!

pytorch-bot bot commented Mar 10, 2026

Uh oh!

pytorch-bot bot commented Mar 10, 2026

Uh oh!

pytorch-bot bot commented Mar 10, 2026

Uh oh!

pytorch-bot bot commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 10, 2026

Merge failed

Uh oh!

nithinsubbiah commented Mar 10, 2026

Uh oh!

jataylo commented Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nithinsubbiah commented Mar 6, 2026 •

edited

Loading

pytorch-bot bot commented Mar 6, 2026 •

edited

Loading

This PR needs a `release notes:` label