Accelerate SDPA: Implement fast exp in SVE vectorizer by fadara01 · Pull Request #177645 · pytorch/pytorch

fadara01 · 2026-03-17T13:59:24Z

Stack from ghstack (oldest at bottom):

-> Accelerate SDPA: Implement fast exp in SVE vectorizer #177645

Similar to #176881 and #151441 this PR adds an SVE fast exponential implementation, intended for cases where outputs will be downcasted to FP16 / BF16 (e.g. attention softmax).

Implementation is similar to exp_u20, but:

approximates exp(r) - 1 as r instead of r + 0.5 r^2
does not split natural log (ln) into high / low parts
avoids special case code by clamping exp(x) to 0 for x < -87.346 and inf for x > 88.717

Accuracy

Tested in a similar fashion to #17688 by iterating over all possible FP32 bit patterns and calculates ULP between:

fexp_u20 with inputs in FP32, outputs converted to BF16/FP16
std::exp with inputs in FP32, outputs converted to BF16/FP16

From the accuracy study above, this exp is:

Accurate within a maximum of 1 ULP for FP16
Accurate within a maximum of 1 ULP for BF16 for inputs in [-87.346, max_float] & clamps inputs < -87.346 to zero.

Performance

Using this SDPA benchmark, here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V1 cores (with SVE256):

B	Hq	Hkv	Lq	Lk	D	causal	gqa	Speedup vs current
1	32	8	2048	2048	128	True	True	+7.20%
1	32	8	1	2048	128	False	True	+0.38% (noise)
1	16	16	6400	6400	80	False	False	+4.32%
1	20	20	1500	1500	64	False	False	+3.38%
8	20	20	1500	1500	64	False	False	+6.35%

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

[ghstack-poisoned]

fast exponential intended for cases where outputs will be downcasted to FP16 / BF16 (e.g. attention softmax). Accurate within 1 ULP for FP16 Accurate within 1 ULP for BF16 for inputs in [-87.346, max_float] & clamps inputs < -87.346 to zero. Implementation is similar to exp_u20, but: - approximates exp(r) - 1 as r instead of r + 0.5 r^2 - does not split natural log (ln) into high / low parts - avoids special case code by clamping exp(x) to 0 for x < -87.346 and inf for x > 88.717 ghstack-source-id: da328b4 Pull-Request: #177645

pytorch-bot · 2026-03-17T13:59:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177645

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ba6d46e with merge base f8e48d2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-17T13:59:34Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

fadara01 · 2026-03-17T14:00:15Z

@pytorchbot label "topic: not user facing"

fadara01 · 2026-03-17T15:02:11Z

Hi @Skylion007 / @jgong5 - this similar to #176881 but for SVE vectorizer - I'd appreciate your review on this :)

fadara01 · 2026-03-18T16:15:10Z

@pytorchbot merge

pytorchmergebot · 2026-03-18T16:17:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-03-18T16:17:37Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / Test run_test.py is usable without boto3

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

fadara01 · 2026-03-18T16:54:09Z

@pytorchbot merge

pytorchmergebot · 2026-03-18T16:57:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Similar to pytorch#176881 and pytorch#151441 this PR adds an SVE fast exponential implementation, intended for cases where outputs will be downcasted to FP16 / BF16 (e.g. attention softmax). Implementation is similar to exp_u20, but: - approximates exp(r) - 1 as r instead of r + 0.5 r^2 - does not split natural log (ln) into high / low parts - avoids special case code by clamping exp(x) to 0 for x < -87.346 and inf for x > 88.717 ## Accuracy Tested in a similar fashion to pytorch#17688 by iterating over all possible FP32 bit patterns and calculates ULP between: - `fexp_u20` with inputs in FP32, outputs converted to BF16/FP16 - `std::exp` with inputs in FP32, outputs converted to BF16/FP16 From the accuracy study above, this exp is: - Accurate within a maximum of 1 ULP for FP16 - Accurate within a maximum of 1 ULP for BF16 for inputs in [-87.346, max_float] & clamps inputs < -87.346 to zero. ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V1 cores (with SVE256): | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup vs current | |---:|---:|---:|---:|---:|---:|---|---|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +7.20% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | +0.38% (noise) | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +4.32% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +3.38% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.35% | Pull Request resolved: pytorch#177645 Approved by: https://github.com/Skylion007

Update

ba6d46e

[ghstack-poisoned]

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Mar 17, 2026

pytorch-bot bot added the topic: not user facing topic category label Mar 17, 2026

pytorchbot added the open source label Mar 17, 2026

fadara01 added the ciflow/linux-aarch64 linux aarch64 CI workflow label Mar 17, 2026

fadara01 requested review from Nicoshev, Skylion007 and jgong5 March 17, 2026 15:01

Skylion007 approved these changes Mar 18, 2026

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 18, 2026

pytorchmergebot added the merging label Mar 18, 2026

pytorchmergebot removed the merging label Mar 18, 2026

pytorchmergebot added the merging label Mar 18, 2026

pytorchmergebot added the Merged label Mar 18, 2026

pytorchmergebot closed this in 47e0d2c Mar 18, 2026

pytorchmergebot removed the merging label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate SDPA: Implement fast exp in SVE vectorizer#177645

Accelerate SDPA: Implement fast exp in SVE vectorizer#177645
fadara01 wants to merge 1 commit intogh/fadara01/13/basefrom
gh/fadara01/13/head

fadara01 commented Mar 17, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 17, 2026

Uh oh!

fadara01 commented Mar 17, 2026

Uh oh!

fadara01 commented Mar 17, 2026 •

edited

Loading

Uh oh!

fadara01 commented Mar 18, 2026

Uh oh!

pytorchmergebot commented Mar 18, 2026

Uh oh!

pytorchmergebot commented Mar 18, 2026

Uh oh!

fadara01 commented Mar 18, 2026

Uh oh!

pytorchmergebot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fadara01 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Accuracy

Performance

Uh oh!

pytorch-bot bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177645

✅ No Failures

Uh oh!

pytorch-bot bot commented Mar 17, 2026

This PR needs a release notes: label

Uh oh!

fadara01 commented Mar 17, 2026

Uh oh!

fadara01 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fadara01 commented Mar 18, 2026

Uh oh!

pytorchmergebot commented Mar 18, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 18, 2026

Merge failed

Uh oh!

fadara01 commented Mar 18, 2026

Uh oh!

pytorchmergebot commented Mar 18, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fadara01 commented Mar 17, 2026 •

edited

Loading

pytorch-bot bot commented Mar 17, 2026 •

edited

Loading

This PR needs a `release notes:` label

fadara01 commented Mar 17, 2026 •

edited

Loading