Enable XPU path for FlexAttention by liangan1 · Pull Request #143553 · pytorch/pytorch

liangan1 · 2024-12-19T04:39:48Z

Motivation

The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA in the transformers repo on XPU device. Besides, it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device.
FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256.

What does this PR do?

Enable the device type for Flexattention kernel and UTs to ensure all important UTs pass on XPU device.
For E2E model inference, ensure the functionality of LLM models inference with FlexAttention to be ready.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela @yf225 @ColinPeppler @desertfire

pytorch-bot · 2024-12-19T04:39:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143553

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm MI2xx CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

❌ 1 New Failure, 4 Unrelated Failures

As of commit 29dbb36 with merge base d153af7 ():

NEW FAILURE - The following job has failed:

xpu / linux-jammy-xpu-n-py3.10 / test (default, 6, 8, linux.idc.xpu) (gh)
inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_dynamic

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (similar failure)
Build left local git repository checkout dirty
xpu / linux-jammy-xpu-n-py3.10 / test (default, 3, 8, linux.idc.xpu) (gh) (similar failure)
'Test'
xpu / linux-jammy-xpu-n-py3.10 / test (default, 5, 8, linux.idc.xpu) (gh) (similar failure)
'Test'

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

xpu / linux-jammy-xpu-n-py3.10 / test (default, 1, 8, linux.idc.xpu) (gh) (trunk failure)
test_openreg.py::TestOpenReg::test_printing

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-12-19T04:39:54Z

The committers listed above are authorized under a signed CLA.

✅ login: retonym / name: Mao Yunfei (6da92dd, 5beaca3, b51f840, e0be178)
✅ login: hoshibara / name: Xingyuan Li (ad280b4, ba427ee, e9818be, 749392c, bde2d86, 9e316fd, f7cff4e, b9ea95b, 5250f64, ee9682f, fd14da6, 46df31b, f79f883, ffdb463, 401b399, 851aee1, 73db0c2, 8e56615, cbb7fe6, 0e89cf0, b6caa7d, 5ba57df, 24dc060, 453ec83, 5717803, 58a681a, 2843189, 13cb614, 214550b, 28715ed, fc93c31, f47e886, 8a3421f, b32ee44, 3d992ad, e642f6e, 973ebaa, 9390351, 7594b64, 1585ef8, 7f8814b, 7d4c320, dfec943, 0107952, 14ccfeb, 962ee1b, b7bbb96, 2c279f2, dc718c8, 9d6eb25, 500ed41, aa4be5a, dce24dd, b21033c, 90b6f0b, 47757b2, 54d8df9, b83e482, b514310, c810417, 685bfda, 471592d, 2981e41, 24a2602, f6b827b, 39971c1, e4bc4b9, b1c8ef1, d2ee416, 46e9aac, 91c3709, d87b92c, 75a52c5, 272c817, 3e1d1af, 486757c, 4d77dc2, 6ff44b3, a12cc6f, 33cb3e1, 96f44d7, 5acb43d, 7e69c40, a6f267c, 6b442c7, 377eef5, 800bed4, ca1f749, 0b4e4fc, 580a2f0, 66179fe, de7babd, ff7c5e4, 07e3bd5, fcda7b3, 702b06e, 3de28ca, 49859fe, 7579c4c, 70a107b, 1bf03c3, 0fafd2b, 1a9d880, 9fad36f, e6ae1ed, 34ffc7d, c5f2f4c, d5fe4d3, 8bad838, d63329f, d4b388c, eee6cb8, 9ca6d95, 29dbb36, 8fe1251)
✅ login: majing921201 / name: majing (553de0c, 619619f)
✅ login: liangan1 / name: Zhang, Liangang (e0d24ce, 0f4578c, 2ff0774, b557dde, bbc1fc4, 2cc84f2, 113dcf9, 686e05d, 92220a0, dd87093)
✅ login: xiaowangintel / name: Xiao, Wang (6460dc2, 1f9fb66)

pytorch-bot · 2024-12-24T02:08:47Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2024-12-24T02:14:21Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

…mask

…n crash

liangan1 · 2025-02-10T00:42:10Z

@pytorchbot rebase

pytorch-bot · 2025-02-10T00:42:14Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

enable flex attention TMA flag on xpu by default

jianan-gu · 2025-08-27T09:01:36Z

@jianan-gu can you help to see the following cpu related UT fails which is irrelevant to the XPU.
'test/inductor/test_flex_attention.py::TestFlexAttentionCPU::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape0_cpu', 'test/inductor/test_flex_attention.py::TestFlexAttentionCPU::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cpu'

Logs: https://hud.pytorch.org/pytorch/pytorch/pull/143553?sha=8fe1251b2a91d249b553fdf78e089de8f9145f46

Hi, @liangan1 @hoshibara

Thanks for mentioning this and refinements in this PR for FlexAttention related UTs on common devices.
Yes, we have also met this issue when doing similar refinements in #159835.
We have double checked this UT at local while not able to reproduce the failure, and we will take some more time to follow up for final fix. Thus, as this is also irrelevant to the XPU, we suggest you to skip this UT in this PR.
Thanks.

cc @Valentine233

torch/_inductor/kernel/flex/flex_decoding.py

EikanWang · 2025-08-28T04:52:10Z

@pytorchbot merge

pytorchmergebot · 2025-08-28T04:54:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-28T05:05:02Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / build

Details for Dev Infra team

Raised by workflow job

chengjunlu · 2025-08-28T07:25:25Z

torch/_inductor/kernel/flex/flex_attention.py

        # USE TMA = false by default
        cur_kernel_options.setdefault("USE_TMA", False)

+        if torch.xpu.is_available():


The tensor descriptor does not support all Q, K, V memory layout. The Q, K, V has to be contiguous on last dim to use the tensor descriptor.

Use the can_use_tma in the condition as if torch.xpu.is_available() and can_use_tma(query, key, value):.

hoshibara · 2025-08-28T07:44:31Z

@pytorchbot label ciflow/xpu

hoshibara · 2025-08-28T07:44:46Z

@pytorchbot merge

pytorch-bot · 2025-08-28T07:44:52Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

hoshibara · 2025-08-28T07:47:49Z

@pytorchbot label ciflow/trunk

pytorch-bot · 2025-08-28T07:48:00Z

To add these label(s) (ciflow/trunk) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

This reverts commit e6ae1ed.

enable tma case for xpu

hoshibara · 2025-08-29T03:25:33Z

Hi @EikanWang
The TMA code has been reverted. TMA related UT shows that USE_TMA flag can be interpreted correctly on XPU.

EikanWang · 2025-08-29T20:36:23Z

@pytorchbot merge -i

pytorchmergebot · 2025-08-29T20:38:15Z

Merge started

Your change will be merged while ignoring the following 4 checks: xpu / linux-jammy-xpu-n-py3.10 / test (default, 1, 8, linux.idc.xpu), xpu / linux-jammy-xpu-n-py3.10 / test (default, 5, 8, linux.idc.xpu), xpu / linux-jammy-xpu-n-py3.10 / test (default, 3, 8, linux.idc.xpu), xpu / linux-jammy-xpu-n-py3.10 / test (default, 6, 8, linux.idc.xpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

guangyey · 2025-09-02T16:58:23Z

torch/testing/_internal/common_device_type.py

-
-            if not _has_sufficient_memory(_device, size_bytes):
+            # TODO: Memory availability checks for Intel GPU
+            if device != "xpu" and not _has_sufficient_memory(_device, size_bytes):


Here changed the logic of largeTensorTest. It disabled largeTensorTest on XPU device which results in the failure of python test/dynamo/test_aot_autograd_cache.py AOTAutogradCacheTests.test_autograd_inductor_guards_device_xpu_float16_requires_grad_True.

In e6ae1ed, we attempted to complete the sufficient memory check for XPU, but it caused some previously skipped cases to fail.
This issue needs a new PR to fix.

@hoshibara Please fix those failures ASAP.

Raise #162034 for fixing this case

Guangye‘s PR #161988 will fix this issue.

@hoshibara Thanks. PR landed.

Enable XPU path for FlexAttention

0f4578c

pytorch-bot bot added the module: inductor label Dec 19, 2024

liangan1 marked this pull request as draft December 19, 2024 04:39

pytorchbot added the open source label Dec 19, 2024

liangan1 added 3 commits December 18, 2024 23:14

Enable flexAttention UT

2cc84f2

Fix the fp64 error

113dcf9

Add xpu for _validate_device

b557dde

EikanWang added topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module ciflow/xpu Run XPU CI tasks labels Dec 24, 2024

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

EikanWang self-requested a review December 24, 2024 02:14

EikanWang added the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

liangan1 and others added 5 commits December 23, 2024 19:31

minor changes

686e05d

support test_flex_decoding.py with xpu device

5beaca3

For test_flex_attention.py, add missing device info for create_block_…

6da92dd

…mask

temporarily use rocm config to avoid flex-attention backward assertio…

b51f840

…n crash

Merge branch 'main' into liangan1/flex_attention

2ff0774

Fix xpu rebase bug

bbc1fc4

liangan1 mentioned this pull request Feb 25, 2025

[FlexAttention] get poor performance with Triton based GEMM kernel. intel/intel-xpu-backend-for-triton#3518

Closed

enable new flexattention xpu ut

e0be178

pytorch-bot bot added the module: dynamo label Feb 27, 2025

remove debug config

9d6eb25

hoshibara added 2 commits August 27, 2025 06:34

lint

eee6cb8

skip a cpu case

1a9d880

enable flex attention TMA flag on xpu by default

EikanWang approved these changes Aug 27, 2025

View reviewed changes

drisspg reviewed Aug 27, 2025

View reviewed changes

torch/_inductor/kernel/flex/flex_decoding.py Show resolved Hide resolved

drisspg approved these changes Aug 28, 2025

View reviewed changes

hoshibara added 2 commits August 28, 2025 02:08

Merge branch 'main' into liangan1/flex_attention

9fad36f

fix cuda bias case

70a107b

whitneywhtsang mentioned this pull request Aug 28, 2025

[FlexAttention] An accuracy issue in TestFlexAttentionXPU::test_non_contiguous_last_dim_xpu intel/intel-xpu-backend-for-triton#4968

Closed

chengjunlu reviewed Aug 28, 2025

View reviewed changes

add condition for enabling USE_TMA on intel gpu

34ffc7d

hoshibara added 3 commits August 29, 2025 01:09

Merge branch 'main' into liangan1/flex_attention

d5fe4d3

Revert "align mem check for Intel GPU with cuda"

d63329f

This reverts commit e6ae1ed.

revert xpu bias tma path

29dbb36

enable tma case for xpu

Valentine233 mentioned this pull request Sep 2, 2025

[Flex Attn][CPU] support flash decoding for cpu #159835

Open

guangyey reviewed Sep 2, 2025

View reviewed changes

hoshibara mentioned this pull request Sep 3, 2025

[Inductor UT][Intel GPU] Align the _has_sufficient_memory check with CUDA #162034

Closed

whitneywhtsang mentioned this pull request Sep 4, 2025

[FlexAttn] Fix performance degradation intel/intel-xpu-backend-for-triton#5038

Merged

samet-akcay mentioned this pull request Nov 28, 2025

feat: add Groot (GR00T-N1.5) policy wrapper and reorganize config structure open-edge-platform/physical-ai-studio#138

Merged

2 tasks

Conversation

liangan1 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143553

❗ 1 Active SEVs

❌ 1 New Failure, 4 Unrelated Failures

Uh oh!

linux-foundation-easycla bot commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 24, 2024

Uh oh!

pytorch-bot bot commented Dec 24, 2024

Uh oh!

liangan1 commented Feb 10, 2025

Uh oh!

pytorch-bot bot commented Feb 10, 2025

Uh oh!

jianan-gu commented Aug 27, 2025

Uh oh!

Uh oh!

EikanWang commented Aug 28, 2025

Uh oh!

pytorchmergebot commented Aug 28, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 28, 2025

Merge failed

Uh oh!

chengjunlu Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

hoshibara Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

hoshibara commented Aug 28, 2025

Uh oh!

hoshibara commented Aug 28, 2025

Uh oh!

pytorch-bot bot commented Aug 28, 2025

Uh oh!

hoshibara commented Aug 28, 2025

Uh oh!

pytorch-bot bot commented Aug 28, 2025

Uh oh!

hoshibara commented Aug 29, 2025

Uh oh!

EikanWang commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Merge started

Uh oh!

guangyey Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hoshibara Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

etaf Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

hoshibara Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

hoshibara Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

liangan1 commented Dec 19, 2024 •

edited

Loading

pytorch-bot bot commented Dec 19, 2024 •

edited

Loading

linux-foundation-easycla bot commented Dec 19, 2024 •

edited

Loading

guangyey Sep 2, 2025 •

edited

Loading