[flex attention][triton pin] triton_helpers shim for TMA apis by davidberard98 · Pull Request #154858 · pytorch/pytorch

davidberard98 · 2025-06-02T17:46:26Z

Stack from ghstack (oldest at bottom):

Triton 3.4 will remove the experimental TMA apis: triton-lang/triton#6488

To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API.

Test: python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes.

Note: we'll need to apply this for other things in inductor, this just does it for flex attention.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Triton 3.4 will remove the experimental TMA apis: triton-lang/triton#6488 To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API. Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/triton@cda4229, but now passes. Note: we'll need to apply this for other things in inductor, this just does it for flex attention. [ghstack-poisoned]

pytorch-bot · 2025-06-02T17:46:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154858

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6e91f74 with merge base 0d0058d ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / build (gh) (#150261)
Final attempt failed. Child_process exited with error code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…pis" Triton 3.4 will remove the experimental TMA apis: triton-lang/triton#6488 To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API. Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes. Note: we'll need to apply this for other things in inductor, this just does it for flex attention. [ghstack-poisoned]

…pis" Triton 3.4 will remove the experimental TMA apis: triton-lang/triton#6488 To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API. Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes. Note: we'll need to apply this for other things in inductor, this just does it for flex attention. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

Triton 3.4 will remove the experimental TMA apis: triton-lang/triton#6488 To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API. Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes. Note: we'll need to apply this for other things in inductor, this just does it for flex attention. ghstack-source-id: ec2a0a1 Pull Request resolved: #154858

drisspg

Looks good

davidberard98 · 2025-06-03T16:29:37Z

@pytorchbot merge

pytorchmergebot · 2025-06-03T16:31:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…h#154858) Triton 3.4 will remove the experimental TMA apis: triton-lang/triton#6488 To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API. Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes. Note: we'll need to apply this for other things in inductor, this just does it for flex attention. Pull Request resolved: pytorch#154858 Approved by: https://github.com/NikhilAPatel, https://github.com/drisspg

Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. This PR updates the TMA usage in mm.py and mm_scaled_grouped.py to use the TMA shim so that they will work with either Triton version. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

…_scaled_grouped" Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. This PR updates the TMA usage in mm.py and mm_scaled_grouped.py to use the TMA shim so that they will work with either Triton version. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. This PR updates the TMA usage in mm.py and mm_scaled_grouped.py to use the TMA shim so that they will work with either Triton version. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

…_scaled_grouped" Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. This PR updates the TMA usage in mm.py and mm_scaled_grouped.py to use the TMA shim so that they will work with either Triton version. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. This PR updates the TMA usage in mm.py and mm_scaled_grouped.py to use the TMA shim so that they will work with either Triton version. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

…_scaled_grouped" Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. This PR updates the TMA usage in mm.py and mm_scaled_grouped.py to use the TMA shim so that they will work with either Triton version. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. This PR updates the TMA usage in mm.py and mm_scaled_grouped.py to use the TMA shim so that they will work with either Triton version. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

… mm.py support" Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr). Second, this updates mm.py to use the TMA shim. mm_scaled_grouped.py still needs to be updated, but requires some work around recent changes in #150944. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr). Second, this updates mm.py to use the TMA shim. mm_scaled_grouped.py still needs to be updated, but requires some work around recent changes in #150944. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

… mm, mm_scaled_grouped support" Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr). Second, this updates mm.py & mm_scaled_grouped.py to use the TMA shim. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

…rouped support" Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr). Second, this updates mm.py & mm_scaled_grouped.py to use the TMA shim. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

…ort (#155182) Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr). Second, this updates mm.py & mm_scaled_grouped.py to use the TMA shim. Differential Revision: [D76318784](https://our.internmc.facebook.com/intern/diff/D76318784) Pull Request resolved: #155182 Approved by: https://github.com/drisspg

…#154858)" This reverts commit ea7b233. It fails internal tests in fbcode, but they weren't running so we didn't get signal Reverting w/ a PR/diff because the PR has been landed for ~1 week - too old to revert directly from internal. [ghstack-poisoned]

…#154858)" This reverts commit ea7b233. It fails internal tests in fbcode, but they weren't running so we didn't get signal Reverting w/ a PR/diff because the PR has been landed for ~1 week - too old to revert directly from internal. ghstack-source-id: 6bf6fd8 Pull Request resolved: #155640

…#154858)" (#155640) This reverts commit ea7b233. It fails internal tests in fbcode, but they weren't running so we didn't get signal Reverting w/ a PR/diff because the PR has been landed for ~1 week - too old to revert directly from internal. Differential Revision: [D76380887](https://our.internmc.facebook.com/intern/diff/D76380887) Pull Request resolved: #155640 Approved by: https://github.com/nmacchioni, https://github.com/danzimm

…#154858)" This reverts commit ea7b233. [ghstack-poisoned]

…py templates" Triton 3.4 will remove the experimental TMA APIs: triton-lang/triton#6488 For mm.py templates, this PR adds support for using the new APIs when they are available (and otherwise falls back to the experimental APIs). For flex_attention, we'll remove TMA support for Triton 3.2 and 3.3 (versions of triton that don't have the new API). For mm_scaled_grouped.py, #150944 will remove TMA support for Triton 3.2. Note: we attempted this earlier with #154858, but this broke TMA usage in Triton 3.2. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D76444471](https://our.internmc.facebook.com/intern/diff/D76444471) [ghstack-poisoned]

#155723) Triton 3.4 will remove the experimental TMA APIs: triton-lang/triton#6488 For mm.py templates, this PR adds support for using the new APIs when they are available (and otherwise falls back to the experimental APIs). For flex_attention, we'll remove TMA support for Triton 3.2 and 3.3 (versions of triton that don't have the new API). For mm_scaled_grouped.py, #150944 will remove TMA support for Triton 3.2. Note: we attempted this earlier with #154858, but this broke TMA usage in Triton 3.2. Differential Revision: [D76444471](https://our.internmc.facebook.com/intern/diff/D76444471) Pull Request resolved: #155723 Approved by: https://github.com/NikhilAPatel

… week pointer of tensor storage object (#154858)" This reverts commit b9b84d8.

pytorch-bot bot added ciflow/inductor module: inductor labels Jun 2, 2025

davidberard98 mentioned this pull request Jun 2, 2025

[TESTING] Triton pin (Jul 1) f81f19a7f6cb7f905fde3195014c1bf51659642f #153117

Closed

davidberard98 added the release notes: inductor label Jun 2, 2025

This was linked to issues Jun 2, 2025

[Upstream Triton] experimental_tensormap_fenceproxy_acquire #154692

Closed

Triton has removed the experimental descriptor API #154162

Closed

davidberard98 requested review from NikhilAPatel and removed request for NikhilAPatel June 2, 2025 19:45

NikhilAPatel approved these changes Jun 2, 2025

View reviewed changes

davidberard98 requested review from drisspg and mandroid6 June 2, 2025 22:51

drisspg approved these changes Jun 3, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 3, 2025

pytorchmergebot added the merging label Jun 3, 2025

pytorchmergebot closed this in ea7b233 Jun 3, 2025

pytorchmergebot added Merged and removed merging labels Jun 3, 2025

davidberard98 mentioned this pull request Jun 5, 2025

[inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support #155182

Closed

davidberard98 mentioned this pull request Jun 11, 2025

[inductor][triton pin] add support for new TMA API for mm.py templates #155723

Closed

davidberard98 added a commit that referenced this pull request Jun 11, 2025

Revert "[flex attention][triton pin] triton_helpers shim for TMA apis (…

0cb0961

…#154858)" This reverts commit ea7b233. [ghstack-poisoned]

github-actions bot deleted the gh/davidberard98/359/head branch July 4, 2025 02:20

ngimel added a commit that referenced this pull request Nov 20, 2025

Revert "Generate unique id for tensor storage object by observing the…

d51fc02

… week pointer of tensor storage object (#154858)" This reverts commit b9b84d8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flex attention][triton pin] triton_helpers shim for TMA apis#154858

[flex attention][triton pin] triton_helpers shim for TMA apis#154858
davidberard98 wants to merge 4 commits intogh/davidberard98/359/basefrom
gh/davidberard98/359/head

davidberard98 commented Jun 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 2, 2025 •

edited

Loading

Uh oh!

drisspg left a comment

Uh oh!

davidberard98 commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

davidberard98 commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154858

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

davidberard98 commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

davidberard98 commented Jun 2, 2025 •

edited

Loading

pytorch-bot bot commented Jun 2, 2025 •

edited

Loading