Use bfdot instruction in ARM bfloat16 dot product if compiling with it available#127488
Use bfdot instruction in ARM bfloat16 dot product if compiling with it available#127488swolchok wants to merge 4 commits intogh/swolchok/638/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127488
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 6 Unrelated FailuresAs of commit bf76663 with merge base 36e9b71 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| #ifdef __ARM_FEATURE_BF16 | ||
| return vbfdotq_f32(a, b, c); | ||
| #else | ||
| TORCH_CHECK(false, "unsupported operation used!"); |
There was a problem hiding this comment.
Should this be a static_assert False situation? I guess we want the function as a stub even if it's unavailable which precludes that.
There was a problem hiding this comment.
Should this be a static_assert False situation?
It should be statically dead in this situation; the TORCH_CHECK is just a backstop in case that isn't true in the future by accident.
| } | ||
|
|
||
| static C10_ALWAYS_INLINE float32x4_t f32_dot_bf16(float32x4_t a, bfloat16x8_t b, bfloat16x8_t c) { | ||
| #ifdef __ARM_FEATURE_BF16 |
There was a problem hiding this comment.
To trigger it to true one needs to compile with --march=armv8+bf16
malfet
left a comment
There was a problem hiding this comment.
Hmm, I don't see much perf changes:
Before
mv_nt torch.float32 4.99 usec
mv_nt torch.float16 9.70 usec
mv_nt torch.bfloat16 15.38 usec
mv_ta torch.float32 3.96 usec
mv_ta torch.float16 26.92 usec
mv_ta torch.bfloat16 156.50 usec
notrans torch.float32 3.96 usec
notrans torch.float16 28.00 usec
notrans torch.bfloat16 75.61 usec
trans_a torch.float32 3.94 usec
trans_a torch.float16 83.89 usec
trans_a torch.bfloat16 366.66 usec
trans_b torch.float32 7.44 usec
trans_b torch.float16 9.58 usec
trans_b torch.bfloat16 17.40 usec
after
mv_nt torch.float32 4.93 usec
mv_nt torch.float16 9.69 usec
mv_nt torch.bfloat16 15.18 usec
mv_ta torch.float32 3.84 usec
mv_ta torch.float16 26.41 usec
mv_ta torch.bfloat16 153.94 usec
notrans torch.float32 3.83 usec
notrans torch.float16 27.57 usec
notrans torch.bfloat16 74.61 usec
trans_a torch.float32 3.81 usec
trans_a torch.float16 83.84 usec
trans_a torch.bfloat16 366.30 usec
trans_b torch.float32 7.35 usec
trans_b torch.float16 9.50 usec
trans_b torch.bfloat16 17.38 usec
very surprising. I would recommend 2 things:
|
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
getting to the bottom of this now by benchmarking on phone |
Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
…M_FEATURE_BF16" Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
…M_FEATURE_BF16" Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
…M_FEATURE_BF16" Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
Summary: Pull Request resolved: #5249 Port of pytorch/pytorch#127488 . ghstack-source-id: 242493751 exported-using-ghexport Reviewed By: kimishpatel Differential Revision: D62159105 fbshipit-source-id: e8256ef21373ee5bbc58c15fa2dcf2753af839e0
|
this will be replaced by a PR to sync the latest work from executorch (including pytorch/executorch#5444 ), and hopefully follow-up to extract a file that can be identical between pytorch/pytorch and pytorch/executorch. |
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) ghstack-source-id: 243596406 Pull Request resolved: #136331
|
replacement PR is #136331 |
…t back to ATen BlasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…lasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…t back to ATen BlasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…lasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
Pull Request resolved: #136331 ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . ghstack-source-id: 244187352 Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
Pull Request resolved: #136331 ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . ghstack-source-id: 245455630 Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
…t back to ATen BlasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…lasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
Pull Request resolved: #136331 ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . ghstack-source-id: 245903000 Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
…t back to ATen BlasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…lasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…136331) ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) Pull Request resolved: #136331 Approved by: https://github.com/malfet, https://github.com/albanD ghstack dependencies: #136445
Pull Request resolved: pytorch/executorch#5249 Port of pytorch/pytorch#127488 . ghstack-source-id: 242493751 @exported-using-ghexport Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/)
Stack from ghstack (oldest at bottom):
Summary: We should use this intrinsic if the hardware supports it.
Test Plan: Currently untested because I don't have access to appropriate hardware.