Accelerate SDPA on Arm CPUs: Unroll exp_sum and max_mul kernels#177009
Accelerate SDPA on Arm CPUs: Unroll exp_sum and max_mul kernels#177009fadara01 wants to merge 4 commits intogh/fadara01/11/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177009
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 253a1ea with merge base 0951602 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@pytorchbot label "topic: not user facing" |
ghstack-source-id: 65fa898 Pull-Request: pytorch/pytorch#177009
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@ezyang do you mind take a look at this? since you are auto assigned to this change |
|
maybe we need to gaurd the introduced logic under |
…rch#177009) We noticed that (e.g. in whisper with: B=8, num_heads=20, seqlen=1500, head_dim=64), 40% of our time in scaled-dot-production-attention is spent in the `_exp_reduce_sum_fusion_kernel`. Some of that overhead was addressed by pytorch#176881 which introduces a faster exp. We build on top of that here, and squeeze more perf out of SDPA through unrolling exp_sum and max_mul kernels for better ILP. The unrolling pattern used here is already present in other SDPA helper kernels like [_scale_attn_mask_fusion_kernel](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp#L53) While using `VectorizedN` for unrolling, I noticed that: - we don't have a fast path to convert `VectorizedN<float>` to `VectorizedN<bfloat16>` for NEON, so I added that. We already have that for SVE. - we don't have a fast path to short-circuit identity conversions for `VectorizedN` (e.g. `VectorizedN<float>` to `VectorizedN<float>`), so I added that too ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | Pull Request resolved: pytorch#177009 Approved by: https://github.com/jgong5, https://github.com/Skylion007 ghstack dependencies: pytorch#176881
…ls (pytorch#177009)" This reverts commit e5c56db. Reverted pytorch#177009 on behalf of https://github.com/yangw-dev due to sorry it seems this breaks internal tests xplat/caffe2/aten/src/ATen/cpu/vec/vec128/vec128_convert.h:390:17: error: use of undeclared identifier 'convert_float_half', D96767295. please reach out to meta internal folks to resolve this ([comment](pytorch#177009 (comment)))
|
@malfet - has meta's internal PyTorch been modified to work with this PR? |
|
@fadara01 there are no Metal-internal PR associated with this one, but let me have a look at what is going on |
…rch#177009) We noticed that (e.g. in whisper with: B=8, num_heads=20, seqlen=1500, head_dim=64), 40% of our time in scaled-dot-production-attention is spent in the `_exp_reduce_sum_fusion_kernel`. Some of that overhead was addressed by pytorch#176881 which introduces a faster exp. We build on top of that here, and squeeze more perf out of SDPA through unrolling exp_sum and max_mul kernels for better ILP. The unrolling pattern used here is already present in other SDPA helper kernels like [_scale_attn_mask_fusion_kernel](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp#L53) While using `VectorizedN` for unrolling, I noticed that: - we don't have a fast path to convert `VectorizedN<float>` to `VectorizedN<bfloat16>` for NEON, so I added that. We already have that for SVE. - we don't have a fast path to short-circuit identity conversions for `VectorizedN` (e.g. `VectorizedN<float>` to `VectorizedN<float>`), so I added that too ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | Pull Request resolved: pytorch#177009 Approved by: https://github.com/jgong5, https://github.com/Skylion007 ghstack dependencies: pytorch#176881
…ls (pytorch#177009)" This reverts commit e5c56db. Reverted pytorch#177009 on behalf of https://github.com/yangw-dev due to sorry it seems this breaks internal tests xplat/caffe2/aten/src/ATen/cpu/vec/vec128/vec128_convert.h:390:17: error: use of undeclared identifier 'convert_float_half', D96767295. please reach out to meta internal folks to resolve this ([comment](pytorch#177009 (comment)))
|
|
@ezyang does this PR need any additional changes, based on
or can we trigger a merge as is? |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: trunk / macos-py3-arm64 / build Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…rch#177009) We noticed that (e.g. in whisper with: B=8, num_heads=20, seqlen=1500, head_dim=64), 40% of our time in scaled-dot-production-attention is spent in the `_exp_reduce_sum_fusion_kernel`. Some of that overhead was addressed by pytorch#176881 which introduces a faster exp. We build on top of that here, and squeeze more perf out of SDPA through unrolling exp_sum and max_mul kernels for better ILP. The unrolling pattern used here is already present in other SDPA helper kernels like [_scale_attn_mask_fusion_kernel](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp#L53) While using `VectorizedN` for unrolling, I noticed that: - we don't have a fast path to convert `VectorizedN<float>` to `VectorizedN<bfloat16>` for NEON, so I added that. We already have that for SVE. - we don't have a fast path to short-circuit identity conversions for `VectorizedN` (e.g. `VectorizedN<float>` to `VectorizedN<float>`), so I added that too ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | Pull Request resolved: pytorch#177009 Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#176881
|
@pytorchmergebot revert -c ghfirst -m "Failing internally" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@fadara01 your PR has been successfully reverted. |
…ls (#177009)" This reverts commit e8dab08. Reverted #177009 on behalf of https://github.com/atalman due to Failing internally ([comment](#177009 (comment)))
|
Error: Looks like its poiniting to: https://github.com/pytorch/executorch/blob/main/kernels/optimized/cpu/op_exp.cpp#L11 And one more error: |
|
@atalman is it possible to publish a reproducer? or point to the possible fixes that can go in the PR ? I can't make much from the error trace published |
Stack from ghstack (oldest at bottom):
We noticed that (e.g. in whisper with: B=8, num_heads=20, seqlen=1500, head_dim=64), 40% of our time in scaled-dot-production-attention is spent in the
_exp_reduce_sum_fusion_kernel.Some of that overhead was addressed by #176881 which introduces a faster exp.
We build on top of that here, and squeeze more perf out of SDPA through unrolling exp_sum and max_mul kernels
for better ILP.
The unrolling pattern used here is already present in other SDPA helper kernels like _scale_attn_mask_fusion_kernel
While using
VectorizedNfor unrolling, I noticed that:VectorizedN<float>toVectorizedN<bfloat16>for NEON, so I added that. We already have that for SVE.VectorizedN(e.g.VectorizedN<float>toVectorizedN<float>), so I added that tooPerformance
Using this SDPA benchmark, here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01