Accelerate SDPA on Arm CPUs: Unroll exp_sum and max_mul kernels by fadara01 · Pull Request #177009 · pytorch/pytorch

fadara01 · 2026-03-10T07:44:17Z

Stack from ghstack (oldest at bottom):

We noticed that (e.g. in whisper with: B=8, num_heads=20, seqlen=1500, head_dim=64), 40% of our time in scaled-dot-production-attention is spent in the _exp_reduce_sum_fusion_kernel.
Some of that overhead was addressed by #176881 which introduces a faster exp.
We build on top of that here, and squeeze more perf out of SDPA through unrolling exp_sum and max_mul kernels
for better ILP.

The unrolling pattern used here is already present in other SDPA helper kernels like _scale_attn_mask_fusion_kernel

While using VectorizedN for unrolling, I noticed that:

we don't have a fast path to convert VectorizedN<float> to VectorizedN<bfloat16> for NEON, so I added that. We already have that for SVE.
we don't have a fast path to short-circuit identity conversions for VectorizedN (e.g. VectorizedN<float> to VectorizedN<float>), so I added that too

Performance

Using this SDPA benchmark, here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

B	Hq	Hkv	Lq	Lk	D	causal	gqa	Speedup from #176881 vs current	Speedup from #176881 and this PR vs current
1	32	8	2048	2048	128	True	True	+9.48%	+14.91%
1	32	8	1	2048	128	False	True	-1.42%	-2.79%
1	16	16	6400	6400	80	False	False	+5.18%	+11.60%
1	20	20	1500	1500	64	False	False	+6.63%	+11.80%
8	20	20	1500	1500	64	False	False	+9.31%	+17.12%

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

[ghstack-poisoned]

pytorch-bot · 2026-03-10T07:44:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177009

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 253a1ea with merge base 0951602 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 5, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #169481, #171135 but the issue was closed recently and a rebase is needed to make it pass)
test/dynamo/test_model_output.py::TestHFPretrained::test_pretrained
trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx950.1) (gh) (disabled by #169881 but the issue was closed recently and a rebase is needed to make it pass)
test/test_cuda_expandable_segments.py::TestCudaMallocAsync::test_raw_amdsmi_device_uuids

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-10T07:44:25Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

ghstack-source-id: a0ab242 Pull-Request: #177009

fadara01 · 2026-03-10T08:02:17Z

@pytorchbot label "topic: not user facing"

[ghstack-poisoned]

aten/src/ATen/native/cpu/FlashAttentionKernel.cpp

Skylion007

Some nits

ghstack-source-id: 65fa898 Pull-Request: pytorch/pytorch#177009

[ghstack-poisoned]

fadara01 · 2026-03-16T10:37:38Z

@pytorchbot merge

pytorchmergebot · 2026-03-16T10:39:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

yangw-dev · 2026-03-27T23:55:49Z

@ezyang do you mind take a look at this? since you are auto assigned to this change

aditew01 · 2026-03-30T07:52:51Z

maybe we need to gaurd the introduced logic under !defined(C10_MOBILE)

…rch#177009) We noticed that (e.g. in whisper with: B=8, num_heads=20, seqlen=1500, head_dim=64), 40% of our time in scaled-dot-production-attention is spent in the `_exp_reduce_sum_fusion_kernel`. Some of that overhead was addressed by pytorch#176881 which introduces a faster exp. We build on top of that here, and squeeze more perf out of SDPA through unrolling exp_sum and max_mul kernels for better ILP. The unrolling pattern used here is already present in other SDPA helper kernels like [_scale_attn_mask_fusion_kernel](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp#L53) While using `VectorizedN` for unrolling, I noticed that: - we don't have a fast path to convert `VectorizedN<float>` to `VectorizedN<bfloat16>` for NEON, so I added that. We already have that for SVE. - we don't have a fast path to short-circuit identity conversions for `VectorizedN` (e.g. `VectorizedN<float>` to `VectorizedN<float>`), so I added that too ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | Pull Request resolved: pytorch#177009 Approved by: https://github.com/jgong5, https://github.com/Skylion007 ghstack dependencies: pytorch#176881

…ls (pytorch#177009)" This reverts commit e5c56db. Reverted pytorch#177009 on behalf of https://github.com/yangw-dev due to sorry it seems this breaks internal tests xplat/caffe2/aten/src/ATen/cpu/vec/vec128/vec128_convert.h:390:17: error: use of undeclared identifier 'convert_float_half', D96767295. please reach out to meta internal folks to resolve this ([comment](pytorch#177009 (comment)))

fadara01 · 2026-03-30T15:36:19Z

@malfet - has meta's internal PyTorch been modified to work with this PR?
Are we okay to re-land this?

malfet · 2026-03-30T16:17:52Z

@fadara01 there are no Metal-internal PR associated with this one, but let me have a look at what is going on

…rch#177009) We noticed that (e.g. in whisper with: B=8, num_heads=20, seqlen=1500, head_dim=64), 40% of our time in scaled-dot-production-attention is spent in the `_exp_reduce_sum_fusion_kernel`. Some of that overhead was addressed by pytorch#176881 which introduces a faster exp. We build on top of that here, and squeeze more perf out of SDPA through unrolling exp_sum and max_mul kernels for better ILP. The unrolling pattern used here is already present in other SDPA helper kernels like [_scale_attn_mask_fusion_kernel](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp#L53) While using `VectorizedN` for unrolling, I noticed that: - we don't have a fast path to convert `VectorizedN<float>` to `VectorizedN<bfloat16>` for NEON, so I added that. We already have that for SVE. - we don't have a fast path to short-circuit identity conversions for `VectorizedN` (e.g. `VectorizedN<float>` to `VectorizedN<float>`), so I added that too ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | Pull Request resolved: pytorch#177009 Approved by: https://github.com/jgong5, https://github.com/Skylion007 ghstack dependencies: pytorch#176881

…ls (pytorch#177009)" This reverts commit e5c56db. Reverted pytorch#177009 on behalf of https://github.com/yangw-dev due to sorry it seems this breaks internal tests xplat/caffe2/aten/src/ATen/cpu/vec/vec128/vec128_convert.h:390:17: error: use of undeclared identifier 'convert_float_half', D96767295. please reach out to meta internal folks to resolve this ([comment](pytorch#177009 (comment)))

ezyang · 2026-04-07T09:00:37Z

⏺ Bash(git show main:aten/src/ATen/cpu/vec/vec128/vec128_convert.h | head -10)
  ⎿  PreToolUse:Bash hook error
  ⎿  #pragma once
     #include <ATen/cpu/vec/vec_base.h>
     #include <ATen/cpu/vec/vec_convert.h>
     … +7 lines (ctrl+o to expand)

⏺ OK so the original guard was defined(__aarch64__) &&
  !defined(CPU_CAPABILITY_SVE256) — no C10_MOBILE check. That means the existing
  code in this block was available on mobile. Adding !defined(C10_MOBILE) to the
  whole block would be a regression for mobile.

  I should revert my change and instead wrap just the two new specializations.

⏺ Update(aten/src/ATen/cpu/vec/vec128/vec128_convert.h)
  ⎿  PreToolUse:Edit hook error
  ⎿  Added 1 line, removed 1 line
       4
       5  namespace at::vec {
       6  inline namespace CPU_CAPABILITY {
       7 -#if !defined(C10_MOBILE) && defined(__aarch64__) && !defined(CPU
         -_CAPABILITY_SVE256)
       7 +#if (defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256))
       8
       9  // Enable auto-vectorization for clang-17+
      10  // GCC-12 has a bug: gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

⏺ Update(aten/src/ATen/cpu/vec/vec128/vec128_convert.h)
  ⎿  PreToolUse:Edit hook error
  ⎿  Added 2 lines
      373    }
      374  };
      375
      376 +#if !defined(C10_MOBILE)
      377  template <>
      378  struct VecConvert<BFloat16, 1, float, 2> {
      379    static inline VectorizedN<BFloat16, 1> apply(
     ...
      392      return result;
      393    }
      394  };
      395 +#endif // !defined(C10_MOBILE)
      396
      397  #endif // defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE25
           6)
      398  } // namespace CPU_CAPABILITY

⏺ Done. Wrapped the two new VecConvert specializations in #if
  !defined(C10_MOBILE) since convert_float_half and convert_float_bfloat16 are
  defined in headers that are guarded by !defined(C10_MOBILE). The rest of the
  existing code in the block remains available on mobile as before.

aditew01 · 2026-04-07T09:41:00Z

@ezyang does this PR need any additional changes, based on

Done. Wrapped the two new VecConvert specializations in #if
!defined(C10_MOBILE) since convert_float_half and convert_float_bfloat16 are
defined in headers that are guarded by !defined(C10_MOBILE). The rest of the
existing code in the block remains available on mobile as before.

or can we trigger a merge as is?

ezyang · 2026-04-07T12:44:45Z

@pytorchbot merge

pytorchmergebot · 2026-04-07T12:46:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-04-07T12:47:06Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

aditew01 · 2026-04-07T12:56:15Z

@pytorchbot merge -i

pytorchmergebot · 2026-04-07T12:58:19Z

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / macos-py3-arm64 / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#177009) We noticed that (e.g. in whisper with: B=8, num_heads=20, seqlen=1500, head_dim=64), 40% of our time in scaled-dot-production-attention is spent in the `_exp_reduce_sum_fusion_kernel`. Some of that overhead was addressed by pytorch#176881 which introduces a faster exp. We build on top of that here, and squeeze more perf out of SDPA through unrolling exp_sum and max_mul kernels for better ILP. The unrolling pattern used here is already present in other SDPA helper kernels like [_scale_attn_mask_fusion_kernel](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp#L53) While using `VectorizedN` for unrolling, I noticed that: - we don't have a fast path to convert `VectorizedN<float>` to `VectorizedN<bfloat16>` for NEON, so I added that. We already have that for SVE. - we don't have a fast path to short-circuit identity conversions for `VectorizedN` (e.g. `VectorizedN<float>` to `VectorizedN<float>`), so I added that too ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | Pull Request resolved: pytorch#177009 Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#176881

atalman · 2026-04-08T16:26:58Z

@pytorchmergebot revert -c ghfirst -m "Failing internally"

pytorchmergebot · 2026-04-08T16:28:51Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2026-04-08T16:29:00Z

@fadara01 your PR has been successfully reverted.

…ls (#177009)" This reverts commit e8dab08. Reverted #177009 on behalf of https://github.com/atalman due to Failing internally ([comment](#177009 (comment)))

atalman · 2026-04-08T16:37:29Z

Error:

vec128_convert.h:381:17: error: no matching function for call to 'convert_from_float'
  381 |     result[0] = convert_float_bfloat16(src[0], src[1]);
      |                 ^~~~~~~~~~~~~~~~~~~~~~
ATen/cpu/vec/vec_convert.h:153:29: note: candidate template ignored: couldn't infer template argument 'scalar_t'
  153 | inline Vectorized<scalar_t> convert_from_float(
      |                             ^
In file included from /executorch/kernels/optimized/cpu/op_exp.cpp:11:

Looks like its poiniting to: https://github.com/pytorch/executorch/blob/main/kernels/optimized/cpu/op_exp.cpp#L11

And one more error:

vec128_convert.h:390:17: error: use of undeclared identifier 'convert_float_half'
  390 |     result[0] = convert_float_half(src[0], src[1]);
      |                 ^~~~~~~~~~~~~~~~~~
In file included from /executorch/kernels/optimized/cpu/op_exp.cpp:13:

aditew01 · 2026-04-09T08:38:42Z

@atalman is it possible to publish a reproducer? or point to the possible fixes that can go in the PR ? I can't make much from the error trace published

Update

f6a6cc9

[ghstack-poisoned]

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Mar 10, 2026

fadara01 added a commit that referenced this pull request Mar 10, 2026

Accelerate SDPA on Arm CPUs: Unroll exp_sum and max_mul kernels

5fff063

ghstack-source-id: a0ab242 Pull-Request: #177009

fadara01 mentioned this pull request Mar 10, 2026

Accelerate SDPA on Arm CPUs: Implement fast exp in AdvSIMD vectorizer #176881

Closed

pytorchbot added the open source label Mar 10, 2026

fadara01 added the ciflow/linux-aarch64 linux aarch64 CI workflow label Mar 10, 2026

fadara01 requested a review from Xia-Weiwen March 10, 2026 08:02

pytorch-bot bot added the topic: not user facing topic category label Mar 10, 2026

fadara01 requested review from Skylion007, ezyang, jgong5 and mingfeima March 10, 2026 08:03

fadara01 mentioned this pull request Mar 10, 2026

Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.32 #177012

Open

fadara01 added 2 commits March 10, 2026 13:43

Update

361baa6

[ghstack-poisoned]

Update

0a0082b

[ghstack-poisoned]

jgong5 approved these changes Mar 10, 2026

View reviewed changes

Skylion007 reviewed Mar 10, 2026

View reviewed changes

aten/src/ATen/native/cpu/FlashAttentionKernel.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/cpu/FlashAttentionKernel.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/cpu/FlashAttentionKernel.cpp Show resolved Hide resolved

Skylion007 approved these changes Mar 10, 2026

View reviewed changes

sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 13, 2026

Accelerate SDPA on Arm CPUs: Unroll exp_sum and max_mul kernels

3e75597

ghstack-source-id: 65fa898 Pull-Request: pytorch/pytorch#177009

Update

253a1ea

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 16, 2026

pytorchmergebot added the merging label Mar 16, 2026

pytorchmergebot added the Merged label Mar 16, 2026

pytorchmergebot closed this in e5c56db Mar 16, 2026

malfet approved these changes Mar 30, 2026

View reviewed changes

pytorchmergebot added the merging label Apr 7, 2026

pytorchmergebot removed the merging label Apr 7, 2026

pytorchmergebot added the merging label Apr 7, 2026

pytorchmergebot closed this in e8dab08 Apr 7, 2026

pytorchmergebot removed the merging label Apr 7, 2026

pytorchmergebot reopened this Apr 8, 2026

Conversation

fadara01 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Uh oh!

pytorch-bot bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177009

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

pytorch-bot bot commented Mar 10, 2026

This PR needs a release notes: label

Uh oh!

fadara01 commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Skylion007 left a comment

Choose a reason for hiding this comment

Uh oh!

fadara01 commented Mar 16, 2026

Uh oh!

pytorchmergebot commented Mar 16, 2026

Merge started

Uh oh!

yangw-dev commented Mar 27, 2026

Uh oh!

aditew01 commented Mar 30, 2026

Uh oh!

fadara01 commented Mar 30, 2026

Uh oh!

malfet commented Mar 30, 2026

Uh oh!

ezyang commented Apr 7, 2026

Uh oh!

aditew01 commented Apr 7, 2026

Uh oh!

ezyang commented Apr 7, 2026

Uh oh!

pytorchmergebot commented Apr 7, 2026

Merge started

Uh oh!

pytorchmergebot commented Apr 7, 2026

Merge failed

Uh oh!

aditew01 commented Apr 7, 2026

Uh oh!

pytorchmergebot commented Apr 7, 2026

Merge started

Uh oh!

atalman commented Apr 8, 2026

Uh oh!

pytorchmergebot commented Apr 8, 2026

Uh oh!

pytorchmergebot commented Apr 8, 2026

Uh oh!

atalman commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditew01 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

fadara01 commented Mar 10, 2026 •

edited

Loading

pytorch-bot bot commented Mar 10, 2026 •

edited

Loading

This PR needs a `release notes:` label

atalman commented Apr 8, 2026 •

edited

Loading