Skip to content

Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.32#177012

Open
fadara01 wants to merge 4 commits intogh/fadara01/12/basefrom
gh/fadara01/12/head
Open

Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.32#177012
fadara01 wants to merge 4 commits intogh/fadara01/12/basefrom
gh/fadara01/12/head

Conversation

@fadara01
Copy link
Copy Markdown
Collaborator

@fadara01 fadara01 commented Mar 10, 2026

Stack from ghstack (oldest at bottom):

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

Performance

Using this SDPA benchmark, here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

B Hq Hkv Lq Lk D causal gqa Speedup from #176881 vs current Speedup from #176881 and this PR vs current Speedup from #176881 , #177009 and this PR vs current
1 32 8 2048 2048 128 True True +9.48% +14.91% +35.60%
1 32 8 1 2048 128 False True -1.42% -2.79% -0.95%%
1 16 16 6400 6400 80 False False +5.18% +11.60% +27.95%
1 20 20 1500 1500 64 False False +6.63% +11.80% +24.86%
8 20 20 1500 1500 64 False False +9.31% +17.12% +31.82%

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32

[ghstack-poisoned]
@fadara01 fadara01 requested a review from jeffdaily as a code owner March 10, 2026 08:26
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177012

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit a8a4728 with merge base 0951602 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 10, 2026
fadara01 added a commit that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: cf38a01
Pull-Request: #177012
@fadara01 fadara01 changed the title Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.32 [DO NOT MERGE YET] Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.32 Mar 10, 2026
@fadara01 fadara01 marked this pull request as draft March 10, 2026 08:31
@fadara01
Copy link
Copy Markdown
Collaborator Author

CI is expected to fail, OpenBLAS v0.3.32 hasn't been released yet.
It will be released any time now - https://github.com/OpenMathLib/OpenBLAS/milestone/51

@fadara01 fadara01 changed the title [DO NOT MERGE YET] Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.32 Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.32 Mar 10, 2026
[ghstack-poisoned]
fadara01 added a commit that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 952fd9e
Pull-Request: #177012
[ghstack-poisoned]
fadara01 added a commit that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 596be25
Pull-Request: #177012
[ghstack-poisoned]
fadara01 added a commit that referenced this pull request Mar 16, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 545189c
Pull-Request: #177012
@fadara01 fadara01 marked this pull request as ready for review March 24, 2026 08:56
@fadara01
Copy link
Copy Markdown
Collaborator Author

OpenBLAS v0.3.32 has been released: https://github.com/OpenMathLib/OpenBLAS/tree/v0.3.32

@Skylion007 @jgong5 @aditew01 could you please have a look?

@aditew01 aditew01 added ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request ciflow/inductor labels Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants