Skip to content

Bf16 optimized heuristics#172945

Open
Anallear wants to merge 1 commit intopytorch:mainfrom
Anallear:bf16OptimizedHeuristics
Open

Bf16 optimized heuristics#172945
Anallear wants to merge 1 commit intopytorch:mainfrom
Anallear:bf16OptimizedHeuristics

Conversation

@Anallear
Copy link
Copy Markdown
Contributor

@Anallear Anallear commented Jan 21, 2026

This PR introduces:
BGEMM backend integration for BF16 GEMM
A data-driven decision-tree heuristic for selecting BGEMM vs oneDNN
Significant CPU inference improvements

OpenBLAS update

  • OpenBLAS version updated to 0.3.31.dev (this build contains BGEMM kernels for BF16).
  • BGEMM vs SBGEMM: BGEMM is a BFloat16-native GEMM kernel that operates directly on BF16 inputs (avoids float32 conversion), improving BF16 GEMM throughput and lowering memory pressure. SBGEMM is the earlier path that required conversion to float and back (extra copy/convert cost). This PR enables BGEMM for BF16 shapes via a small selector+build bump.

Benchmark Setup

  • Model: LLaMA-3.1-8B (BF16)
  • Threads: 16
  • Instance: c8g.Metal-48xl
  • OpenBLAS v0.3.31.dev
  • oneDNN v3.12.0
  • My changes compared against upstream PyTorch: bfe8c20037dcf9a9169251b35ed8efc6c66476f3

• Self CPU total
Short prompt: 486.337s → 97.390s (4.99x faster)
Long prompt (repeated 512x): 951.762s → 486.337s (1.96x faster)
• aten::mm self CPU
Short prompt: 329.574s → 69.056s (4.77x faster)
Long prompt: 771.995s → 329.574s (2.34x faster)

The majority of gains come from improved BF16 GEMM selection.

Methodology

  • Forced BGEMM vs oneDNN execution for BF16 GEMM shapes
  • Built ground-truth performance dataset
  • Trained a decision tree
  • Derived a simple rule-based backend heuristic

Benchmark script

Benchmark.py

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

@Anallear Anallear requested a review from jeffdaily as a code owner January 21, 2026 14:05
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 21, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172945

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8dd24fd with merge base 24e0e50 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: releng release notes category labels Jan 21, 2026
@aditew01 aditew01 added the ciflow/inductor-perf-test-nightly Trigger nightly inductor perf tests label Jan 21, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 21, 2026

Unknown label ciflow/inductor-perf-test-nightly.
Currently recognized labels are

  • ciflow/b200
  • ciflow/b200-distributed
  • ciflow/b200-symm-mem
  • ciflow/binaries
  • ciflow/binaries_libtorch
  • ciflow/binaries_wheel
  • ciflow/dynamo
  • ciflow/h100
  • ciflow/h100-cutlass-backend
  • ciflow/h100-distributed
  • ciflow/h100-symm-mem
  • ciflow/inductor
  • ciflow/inductor-cu126
  • ciflow/inductor-micro-benchmark
  • ciflow/inductor-micro-benchmark-cpu-x86
  • ciflow/inductor-pallas
  • ciflow/inductor-perf-compare
  • ciflow/inductor-perf-test-nightly-rocm-mi300
  • ciflow/inductor-perf-test-nightly-rocm-mi355
  • ciflow/inductor-perf-test-nightly-x86-zen
  • ciflow/inductor-perf-test-nightly-xpu
  • ciflow/inductor-periodic
  • ciflow/inductor-rocm-mi200
  • ciflow/inductor-rocm-mi300
  • ciflow/linux-aarch64
  • ciflow/mps
  • ciflow/nightly
  • ciflow/op-benchmark
  • ciflow/periodic
  • ciflow/periodic-rocm-mi200
  • ciflow/periodic-rocm-mi300
  • ciflow/pull
  • ciflow/quantization-periodic
  • ciflow/riscv64
  • ciflow/rocm-mi200
  • ciflow/rocm-mi300
  • ciflow/rocm-mi355
  • ciflow/rocm-navi31
  • ciflow/s390
  • ciflow/slow
  • ciflow/slow-rocm-mi200
  • ciflow/torchbench
  • ciflow/triton_binaries
  • ciflow/trunk
  • ciflow/unstable
  • ciflow/vllm
  • ciflow/win-arm64
  • ciflow/xpu

@robert-hardwick
Copy link
Copy Markdown
Collaborator

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Jan 21, 2026
@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 21, 2026
@aditew01
Copy link
Copy Markdown
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/172945/head returned non-zero exit code 1

Rebasing (1/10)
hint: Recursive merging with submodules currently only supports trivial cases.
hint: Please manually handle the merging of each conflicted submodule.
hint: This can be accomplished with the following steps:
hint:  - come back to superproject and run:
hint:
hint:       git add third_party/ideep
hint:
hint:    to record the above merge or update
hint:  - resolve any other conflicts in the superproject
hint:  - commit the resulting index in the superproject
hint:
hint: Disable this message with "git config set advice.submoduleMergeConflict false"
CONFLICT (modify/delete): .ci/aarch64_linux/aarch64_ci_build.sh deleted in HEAD and modified in 3fbfdfb25df (tools update).  Version 3fbfdfb25df (tools update) of .ci/aarch64_linux/aarch64_ci_build.sh left in tree.
CONFLICT (modify/delete): .ci/aarch64_linux/aarch64_wheel_ci_build.py deleted in HEAD and modified in 3fbfdfb25df (tools update).  Version 3fbfdfb25df (tools update) of .ci/aarch64_linux/aarch64_wheel_ci_build.py left in tree.
Auto-merging .gitmodules
Auto-merging aten/src/ATen/CMakeLists.txt
Auto-merging aten/src/ATen/native/Activation.cpp
Auto-merging aten/src/ATen/native/LinearAlgebra.cpp
Auto-merging aten/src/ATen/native/cpu/int4mm_kernel.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/cpu/int4mm_kernel.cpp
Auto-merging aten/src/ATen/native/kleidiai/kai_kernels.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/kleidiai/kai_kernels.cpp
Auto-merging aten/src/ATen/native/kleidiai/kai_pack.h
CONFLICT (content): Merge conflict in aten/src/ATen/native/kleidiai/kai_pack.h
Auto-merging aten/src/ATen/native/kleidiai/kai_ukernel_interface.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/kleidiai/kai_ukernel_interface.cpp
Auto-merging aten/src/ATen/native/mkldnn/Matmul.cpp
Auto-merging cmake/Dependencies.cmake
Auto-merging setup.py
CONFLICT (content): Merge conflict in setup.py
Auto-merging test/inductor/test_mkldnn_pattern_matcher.py
Auto-merging test/inductor/test_torchinductor.py
CONFLICT (content): Merge conflict in test/inductor/test_torchinductor.py
Auto-merging test/test_linalg.py
Auto-merging third_party/LICENSES_BUNDLED.txt
CONFLICT (content): Merge conflict in third_party/LICENSES_BUNDLED.txt
CONFLICT (modify/delete): third_party/NVTX deleted in 3fbfdfb25df (tools update) and modified in HEAD.  Version HEAD of third_party/NVTX left in tree.
CONFLICT (modify/delete): third_party/cudnn_frontend deleted in 3fbfdfb25df (tools update) and modified in HEAD.  Version HEAD of third_party/cudnn_frontend left in tree.
CONFLICT (modify/delete): third_party/cutlass deleted in 3fbfdfb25df (tools update) and modified in HEAD.  Version HEAD of third_party/cutlass left in tree.
CONFLICT (modify/delete): third_party/flash-attention deleted in 3fbfdfb25df (tools update) and modified in HEAD.  Version HEAD of third_party/flash-attention left in tree.
Failed to merge submodule third_party/ideep (not checked out)
CONFLICT (submodule): Merge conflict in third_party/ideep
Auto-merging torch/_meta_registrations.py
error: could not apply 3fbfdfb25df... tools update
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 3fbfdfb25df... # tools update

Raised by https://github.com/pytorch/pytorch/actions/runs/21246174215

@Anallear Anallear force-pushed the bf16OptimizedHeuristics branch from c7cd14c to 9bd68e7 Compare January 23, 2026 14:48
@pytorch-bot pytorch-bot bot removed ciflow/inductor-perf-test-nightly Trigger nightly inductor perf tests ciflow/linux-aarch64 linux aarch64 CI workflow labels Jan 23, 2026
@Anallear Anallear force-pushed the bf16OptimizedHeuristics branch from 9bd68e7 to 95f34c7 Compare February 17, 2026 16:45
@fadara01
Copy link
Copy Markdown
Collaborator

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 19, 2026
Copy link
Copy Markdown
Collaborator

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work thank you!
I added a few minor comments.

Could you please also make it explicit in the PR description that you update OpenBLAS version and that the new version contains BGEMM kernels and how they're different from SBGEMM etc.

Could you also attach the benchmark script you ran and the speedups achieved with this PR?

Comment on lines +346 to +347
ENDIF(BLAS_HAS_SBGEMM)
set(CMAKE_REQUIRED_LIBRARIES)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: why does this need to be changed

bool tf32_usable = std::is_same_v<scalar_t, float> && use_mkldnn_tf32_matmul();
if ( !(bf16_usable || fp16_usable || bf32_usable || tf32_usable) ||
if (bf16_usable) {
// New BF16-only heuristic
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: let's have a better comment, maybe something on the lines of: "for these cases oneDNN is better than OpenBLAS"

Comment on lines +13 to +23
@@ -20,7 +19,5 @@ CFLAGS=-O3
BUILD_BFLOAT16=1
"

make -j8 ${OPENBLAS_BUILD_FLAGS} -C $OPENBLAS_CHECKOUT_DIR
sudo make install -C $OPENBLAS_CHECKOUT_DIR

rm -rf $OPENBLAS_CHECKOUT_DIR No newline at end of file
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR} No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from the OpenBLAS version update, why are we modifying this?

}
#endif
#if AT_BUILD_WITH_BLAS() && defined(BLAS_HAS_SBGEMM)
#if AT_BUILD_WITH_BLAS() && (defined(BLAS_HAS_SBGEMM) || defined(BLAS_HAS_BGEMM))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the || defined(BLAS_HAS_BGEMM) redundant here?
will you ever have BLAS_HAS_BGEMM without BLAS_HAS_SBGEMM

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I kept both BLAS_HAS_SBGEMM and BLAS_HAS_BGEMM checks just to stay safe across OpenBLAS versions, since some older builds may only expose SBGEMM while newer ones define BGEMM.

I’m happy to simplify it if we think SBGEMM will always be there going forward , do you think it’s safe to rely on SBGEMM-only in future versions?

c[j * ldc_ + i] = c10::convert<at::BFloat16>(float_v[j * m_ + i]);
}
}
#endif //
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: // defined(BLAS_HAS_BGEMM)

@Anallear Anallear force-pushed the bf16OptimizedHeuristics branch from 95f34c7 to 3e723ba Compare February 20, 2026 16:07
@pytorch-bot pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 20, 2026
@fadara01
Copy link
Copy Markdown
Collaborator

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 20, 2026
@robert-hardwick
Copy link
Copy Markdown
Collaborator

The failing CI
HTTP request sent, awaiting response... 404 Not Found

seems unrelated... but rebase didn't seem to fix it....

@Anallear Anallear force-pushed the bf16OptimizedHeuristics branch from 3e723ba to ea21fe0 Compare February 23, 2026 23:19
@pytorch-bot pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Feb 23, 2026
@Anallear Anallear force-pushed the bf16OptimizedHeuristics branch 2 times, most recently from 6e20e67 to 50da97f Compare February 25, 2026 12:42
@Anallear
Copy link
Copy Markdown
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased bf16OptimizedHeuristics onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bf16OptimizedHeuristics && git pull --rebase)

@Anallear Anallear force-pushed the bf16OptimizedHeuristics branch from 1d11997 to 5e37303 Compare March 4, 2026 02:27
@robert-hardwick
Copy link
Copy Markdown
Collaborator

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Mar 9, 2026
fadara01 added a commit that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: cf38a01
Pull-Request: #177012
fadara01 added a commit that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 952fd9e
Pull-Request: #177012
fadara01 added a commit that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 596be25
Pull-Request: #177012
fadara01 added a commit that referenced this pull request Mar 16, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 545189c
Pull-Request: #177012
@aditew01 aditew01 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 16, 2026
@aditew01 aditew01 requested review from Skylion007 and albanD March 16, 2026 15:26
@aditew01
Copy link
Copy Markdown
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased bf16OptimizedHeuristics onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bf16OptimizedHeuristics && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the bf16OptimizedHeuristics branch from 5e37303 to 8dd24fd Compare March 17, 2026 10:12
@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow labels Mar 17, 2026
@aditew01 aditew01 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 17, 2026
@aditew01 aditew01 requested a review from fadara01 March 18, 2026 09:22
@fadara01
Copy link
Copy Markdown
Collaborator

fadara01 commented Mar 18, 2026

Hi @Anallear - nice speedups for llama!
Could you please share a more exhaustive benchmark scripts with a sweep of M,K,N, etc and the before/after numbers?

@fadara01
Copy link
Copy Markdown
Collaborator

btw, it seems like the new version of OpenBLAS v0.3.32 will get released by the end of the week: OpenMathLib/OpenBLAS#5682

I raised a separate PR to update OpenBLAS to that version #177012
Let's do the version update in that PR, and have this PR just for the heuristics?
Could you please re-generate the heuristics against the new version of OpenBLAS, as we recently accelerated BGEMM/SBGEMM there

Comment on lines +164 to +169
if (bf16_usable) {
// BF16 heuristic: use BGEMM for GEMV-like or small shapes,
// otherwise prefer oneDNN for larger workloads.
if ((m == 1 || n == 1) || (m * n * k <= 786432)) {
return false;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the rule here might be troublesome for different platforms.

if you care about the performance on arm platform, i suggest that change this condition (whether to use oneDNN or not) only for arm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: releng release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants