Bf16 optimized heuristics by Anallear · Pull Request #172945 · pytorch/pytorch

Anallear · 2026-01-21T14:05:38Z

This PR introduces:
BGEMM backend integration for BF16 GEMM
A data-driven decision-tree heuristic for selecting BGEMM vs oneDNN
Significant CPU inference improvements

OpenBLAS update

OpenBLAS version updated to 0.3.31.dev (this build contains BGEMM kernels for BF16).
BGEMM vs SBGEMM: BGEMM is a BFloat16-native GEMM kernel that operates directly on BF16 inputs (avoids float32 conversion), improving BF16 GEMM throughput and lowering memory pressure. SBGEMM is the earlier path that required conversion to float and back (extra copy/convert cost). This PR enables BGEMM for BF16 shapes via a small selector+build bump.

Benchmark Setup

Model: LLaMA-3.1-8B (BF16)
Threads: 16
Instance: c8g.Metal-48xl
OpenBLAS v0.3.31.dev
oneDNN v3.12.0
My changes compared against upstream PyTorch: bfe8c20037dcf9a9169251b35ed8efc6c66476f3

• Self CPU total
Short prompt: 486.337s → 97.390s (4.99x faster)
Long prompt (repeated 512x): 951.762s → 486.337s (1.96x faster)
• aten::mm self CPU
Short prompt: 329.574s → 69.056s (4.77x faster)
Long prompt: 771.995s → 329.574s (2.34x faster)

The majority of gains come from improved BF16 GEMM selection.

Methodology

Forced BGEMM vs oneDNN execution for BF16 GEMM shapes
Built ground-truth performance dataset
Trained a decision tree
Derived a simple rule-based backend heuristic

Benchmark script

Benchmark.py

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

pytorch-bot · 2026-01-21T14:05:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172945

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8dd24fd with merge base 24e0e50 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-01-21T14:06:57Z

Unknown label ciflow/inductor-perf-test-nightly.
Currently recognized labels are

ciflow/b200
ciflow/b200-distributed
ciflow/b200-symm-mem
ciflow/binaries
ciflow/binaries_libtorch
ciflow/binaries_wheel
ciflow/dynamo
ciflow/h100
ciflow/h100-cutlass-backend
ciflow/h100-distributed
ciflow/h100-symm-mem
ciflow/inductor
ciflow/inductor-cu126
ciflow/inductor-micro-benchmark
ciflow/inductor-micro-benchmark-cpu-x86
ciflow/inductor-pallas
ciflow/inductor-perf-compare
ciflow/inductor-perf-test-nightly-rocm-mi300
ciflow/inductor-perf-test-nightly-rocm-mi355
ciflow/inductor-perf-test-nightly-x86-zen
ciflow/inductor-perf-test-nightly-xpu
ciflow/inductor-periodic
ciflow/inductor-rocm-mi200
ciflow/inductor-rocm-mi300
ciflow/linux-aarch64
ciflow/mps
ciflow/nightly
ciflow/op-benchmark
ciflow/periodic
ciflow/periodic-rocm-mi200
ciflow/periodic-rocm-mi300
ciflow/pull
ciflow/quantization-periodic
ciflow/riscv64
ciflow/rocm-mi200
ciflow/rocm-mi300
ciflow/rocm-mi355
ciflow/rocm-navi31
ciflow/s390
ciflow/slow
ciflow/slow-rocm-mi200
ciflow/torchbench
ciflow/triton_binaries
ciflow/trunk
ciflow/unstable
ciflow/vllm
ciflow/win-arm64
ciflow/xpu

robert-hardwick · 2026-01-21T14:07:54Z

@pytorchbot label "ciflow/linux-aarch64"

aditew01 · 2026-01-22T11:08:40Z

@pytorchbot rebase

pytorchmergebot · 2026-01-22T11:10:23Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2026-01-22T11:10:25Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/172945/head returned non-zero exit code 1

Rebasing (1/10)
hint: Recursive merging with submodules currently only supports trivial cases.
hint: Please manually handle the merging of each conflicted submodule.
hint: This can be accomplished with the following steps:
hint:  - come back to superproject and run:
hint:
hint:       git add third_party/ideep
hint:
hint:    to record the above merge or update
hint:  - resolve any other conflicts in the superproject
hint:  - commit the resulting index in the superproject
hint:
hint: Disable this message with "git config set advice.submoduleMergeConflict false"
CONFLICT (modify/delete): .ci/aarch64_linux/aarch64_ci_build.sh deleted in HEAD and modified in 3fbfdfb25df (tools update).  Version 3fbfdfb25df (tools update) of .ci/aarch64_linux/aarch64_ci_build.sh left in tree.
CONFLICT (modify/delete): .ci/aarch64_linux/aarch64_wheel_ci_build.py deleted in HEAD and modified in 3fbfdfb25df (tools update).  Version 3fbfdfb25df (tools update) of .ci/aarch64_linux/aarch64_wheel_ci_build.py left in tree.
Auto-merging .gitmodules
Auto-merging aten/src/ATen/CMakeLists.txt
Auto-merging aten/src/ATen/native/Activation.cpp
Auto-merging aten/src/ATen/native/LinearAlgebra.cpp
Auto-merging aten/src/ATen/native/cpu/int4mm_kernel.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/cpu/int4mm_kernel.cpp
Auto-merging aten/src/ATen/native/kleidiai/kai_kernels.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/kleidiai/kai_kernels.cpp
Auto-merging aten/src/ATen/native/kleidiai/kai_pack.h
CONFLICT (content): Merge conflict in aten/src/ATen/native/kleidiai/kai_pack.h
Auto-merging aten/src/ATen/native/kleidiai/kai_ukernel_interface.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/kleidiai/kai_ukernel_interface.cpp
Auto-merging aten/src/ATen/native/mkldnn/Matmul.cpp
Auto-merging cmake/Dependencies.cmake
Auto-merging setup.py
CONFLICT (content): Merge conflict in setup.py
Auto-merging test/inductor/test_mkldnn_pattern_matcher.py
Auto-merging test/inductor/test_torchinductor.py
CONFLICT (content): Merge conflict in test/inductor/test_torchinductor.py
Auto-merging test/test_linalg.py
Auto-merging third_party/LICENSES_BUNDLED.txt
CONFLICT (content): Merge conflict in third_party/LICENSES_BUNDLED.txt
CONFLICT (modify/delete): third_party/NVTX deleted in 3fbfdfb25df (tools update) and modified in HEAD.  Version HEAD of third_party/NVTX left in tree.
CONFLICT (modify/delete): third_party/cudnn_frontend deleted in 3fbfdfb25df (tools update) and modified in HEAD.  Version HEAD of third_party/cudnn_frontend left in tree.
CONFLICT (modify/delete): third_party/cutlass deleted in 3fbfdfb25df (tools update) and modified in HEAD.  Version HEAD of third_party/cutlass left in tree.
CONFLICT (modify/delete): third_party/flash-attention deleted in 3fbfdfb25df (tools update) and modified in HEAD.  Version HEAD of third_party/flash-attention left in tree.
Failed to merge submodule third_party/ideep (not checked out)
CONFLICT (submodule): Merge conflict in third_party/ideep
Auto-merging torch/_meta_registrations.py
error: could not apply 3fbfdfb25df... tools update
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 3fbfdfb25df... # tools update

Raised by https://github.com/pytorch/pytorch/actions/runs/21246174215

fadara01 · 2026-02-19T12:51:02Z

@pytorchbot label "ciflow/linux-aarch64"

fadara01

Great work thank you!
I added a few minor comments.

Could you please also make it explicit in the PR description that you update OpenBLAS version and that the new version contains BGEMM kernels and how they're different from SBGEMM etc.

Could you also attach the benchmark script you ran and the speedups achieved with this PR?

fadara01 · 2026-02-19T12:52:16Z

cmake/Modules/FindBLAS.cmake

  ENDIF(BLAS_HAS_SBGEMM)
+  set(CMAKE_REQUIRED_LIBRARIES)


NIT: why does this need to be changed

fadara01 · 2026-02-19T12:54:25Z

aten/src/ATen/native/mkldnn/Matmul.cpp

  bool tf32_usable = std::is_same_v<scalar_t, float> && use_mkldnn_tf32_matmul();
-  if ( !(bf16_usable || fp16_usable || bf32_usable || tf32_usable) ||
+  if (bf16_usable) {
+  // New BF16-only heuristic


NIT: let's have a better comment, maybe something on the lines of: "for these cases oneDNN is better than OpenBLAS"

fadara01 · 2026-02-19T12:55:46Z

.ci/docker/common/install_openblas.sh

@@ -20,7 +19,5 @@ CFLAGS=-O3
 BUILD_BFLOAT16=1
 "

-make -j8 ${OPENBLAS_BUILD_FLAGS} -C $OPENBLAS_CHECKOUT_DIR
-sudo make install -C $OPENBLAS_CHECKOUT_DIR
-
-rm -rf $OPENBLAS_CHECKOUT_DIR
+make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
+make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}


apart from the OpenBLAS version update, why are we modifying this?

aten/src/ATen/native/CPUBlas.cpp

fadara01 · 2026-02-19T13:04:34Z

aten/src/ATen/native/CPUBlas.cpp

   }
 #endif
-#if AT_BUILD_WITH_BLAS() && defined(BLAS_HAS_SBGEMM)
+#if AT_BUILD_WITH_BLAS() && (defined(BLAS_HAS_SBGEMM) || defined(BLAS_HAS_BGEMM))


Is the || defined(BLAS_HAS_BGEMM) redundant here?
will you ever have BLAS_HAS_BGEMM without BLAS_HAS_SBGEMM

Thanks for the suggestion. I kept both BLAS_HAS_SBGEMM and BLAS_HAS_BGEMM checks just to stay safe across OpenBLAS versions, since some older builds may only expose SBGEMM while newer ones define BGEMM.

I’m happy to simplify it if we think SBGEMM will always be there going forward , do you think it’s safe to rely on SBGEMM-only in future versions?

fadara01 · 2026-02-19T13:04:57Z

aten/src/ATen/native/CPUBlas.cpp

          c[j * ldc_ + i] = c10::convert<at::BFloat16>(float_v[j * m_ + i]);
        }
      }
+#endif //


NIT: // defined(BLAS_HAS_BGEMM)

fadara01 · 2026-02-20T16:11:40Z

@pytorchbot label "ciflow/linux-aarch64"

robert-hardwick · 2026-02-23T13:44:07Z

The failing CI
HTTP request sent, awaiting response... 404 Not Found

seems unrelated... but rebase didn't seem to fix it....

Anallear · 2026-02-25T23:44:00Z

@pytorchbot rebase

pytorchmergebot · 2026-02-25T23:45:54Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2026-02-25T23:45:59Z

Successfully rebased bf16OptimizedHeuristics onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bf16OptimizedHeuristics && git pull --rebase)

robert-hardwick · 2026-03-09T15:07:37Z

@pytorchbot label "ciflow/linux-aarch64"

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: cf38a01 Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 952fd9e Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 596be25 Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 545189c Pull-Request: #177012

aditew01 · 2026-03-17T10:10:14Z

@pytorchbot rebase

pytorchmergebot · 2026-03-17T10:12:00Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2026-03-17T10:12:04Z

Successfully rebased bf16OptimizedHeuristics onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bf16OptimizedHeuristics && git pull --rebase)

fadara01 · 2026-03-18T09:30:18Z

Hi @Anallear - nice speedups for llama!
Could you please share a more exhaustive benchmark scripts with a sweep of M,K,N, etc and the before/after numbers?

fadara01 · 2026-03-18T09:36:27Z

btw, it seems like the new version of OpenBLAS v0.3.32 will get released by the end of the week: OpenMathLib/OpenBLAS#5682

I raised a separate PR to update OpenBLAS to that version #177012
Let's do the version update in that PR, and have this PR just for the heuristics?
Could you please re-generate the heuristics against the new version of OpenBLAS, as we recently accelerated BGEMM/SBGEMM there

mingfeima · 2026-03-24T06:35:14Z

aten/src/ATen/native/mkldnn/Matmul.cpp

+  if (bf16_usable) {
+  // BF16 heuristic: use BGEMM for GEMV-like or small shapes,
+  // otherwise prefer oneDNN for larger workloads.
+  if ((m == 1 || n == 1) || (m * n * k <= 786432)) {
+    return false;
+  }


change the rule here might be troublesome for different platforms.

if you care about the performance on arm platform, i suggest that change this condition (whether to use oneDNN or not) only for arm.

Anallear requested a review from jeffdaily as a code owner January 21, 2026 14:05

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: releng release notes category labels Jan 21, 2026

Anallear mentioned this pull request Jan 21, 2026

Bf16 decision tree heuristic #172776

Closed

aditew01 added the ciflow/inductor-perf-test-nightly Trigger nightly inductor perf tests label Jan 21, 2026

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Jan 21, 2026

pytorchbot added the open source label Jan 21, 2026

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 21, 2026

Anallear force-pushed the bf16OptimizedHeuristics branch from c7cd14c to 9bd68e7 Compare January 23, 2026 14:48

pytorch-bot bot removed ciflow/inductor-perf-test-nightly Trigger nightly inductor perf tests ciflow/linux-aarch64 linux aarch64 CI workflow labels Jan 23, 2026

Anallear force-pushed the bf16OptimizedHeuristics branch from 9bd68e7 to 95f34c7 Compare February 17, 2026 16:45

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 19, 2026

fadara01 requested changes Feb 19, 2026

View reviewed changes

fadara01 reviewed Feb 19, 2026

View reviewed changes

aten/src/ATen/native/CPUBlas.cpp Show resolved Hide resolved

fadara01 reviewed Feb 19, 2026

View reviewed changes

Anallear force-pushed the bf16OptimizedHeuristics branch from 95f34c7 to 3e723ba Compare February 20, 2026 16:07

pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 20, 2026

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 20, 2026

Anallear force-pushed the bf16OptimizedHeuristics branch from 3e723ba to ea21fe0 Compare February 23, 2026 23:19

pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 23, 2026

Anallear force-pushed the bf16OptimizedHeuristics branch 2 times, most recently from 6e20e67 to 50da97f Compare February 25, 2026 12:42

pytorchmergebot force-pushed the bf16OptimizedHeuristics branch from 50da97f to 1d11997 Compare February 25, 2026 23:46

Anallear force-pushed the bf16OptimizedHeuristics branch from 1d11997 to 5e37303 Compare March 4, 2026 02:27

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Mar 9, 2026

fadara01 mentioned this pull request Mar 10, 2026

Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.32 #177012

Open

aditew01 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 16, 2026

aditew01 requested review from Skylion007 and albanD March 16, 2026 15:26

Add BGEMM heuristics

8dd24fd

pytorchmergebot force-pushed the bf16OptimizedHeuristics branch from 5e37303 to 8dd24fd Compare March 17, 2026 10:12

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Mar 17, 2026

aditew01 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 17, 2026

aditew01 requested a review from fadara01 March 18, 2026 09:22

mingfeima requested changes Mar 24, 2026

View reviewed changes

Conversation

Anallear commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172945

✅ No Failures

Uh oh!

pytorch-bot bot commented Jan 21, 2026

Uh oh!

robert-hardwick commented Jan 21, 2026

Uh oh!

aditew01 commented Jan 22, 2026

Uh oh!

pytorchmergebot commented Jan 22, 2026

Uh oh!

pytorchmergebot commented Jan 22, 2026

Uh oh!

fadara01 commented Feb 19, 2026

Uh oh!

fadara01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fadara01 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

fadara01 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

fadara01 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fadara01 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Anallear Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

fadara01 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

fadara01 commented Feb 20, 2026

Uh oh!

robert-hardwick commented Feb 23, 2026

Uh oh!

Anallear commented Feb 25, 2026

Uh oh!

pytorchmergebot commented Feb 25, 2026

Uh oh!

pytorchmergebot commented Feb 25, 2026

Uh oh!

robert-hardwick commented Mar 9, 2026

Uh oh!

aditew01 commented Mar 17, 2026

Uh oh!

pytorchmergebot commented Mar 17, 2026

Uh oh!

pytorchmergebot commented Mar 17, 2026

Uh oh!

fadara01 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fadara01 commented Mar 18, 2026

Uh oh!

mingfeima Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Anallear commented Jan 21, 2026 •

edited

Loading

pytorch-bot bot commented Jan 21, 2026 •

edited

Loading

fadara01 left a comment •

edited

Loading

fadara01 commented Mar 18, 2026 •

edited

Loading