Extending the Pytorch vec backend for SVE (ARM)#119571
Extending the Pytorch vec backend for SVE (ARM)#119571maajidkhann wants to merge 11 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119571
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit c4499e7 with merge base a0207c8 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
3eb92ec to
63addbd
Compare
|
@huydhn Can you help us trigger the rest of the pipelines in the CI. |
63addbd to
cbecfbd
Compare
cbecfbd to
9122a23
Compare
|
@lezcano @nikitaved @IvanYashchuk |
|
The linalg changes lgtm. I don't think I'm the best reviewer for the rest of the PR. |
|
@jerryzh168 @salilsdesai @kimishpatel @digantdesai @jianyuh |
|
Hello @albanD @seemethere . This contribution PR size is little more than 2000 and we have all pipelines passing except the - Lint / pr-sanity-checks (pull_request). It's good for this contribution to go in as one rather than multiple patches as it is easy for reviewers to understand and approve and also to make sure it doesn't break the CI pipelines as this vec backend requires changes in multiple files. Can you please help us bypass this one check for this pull request. |
I definitely can. Unfortunately it will go very slow due to other things on my plate. This is an important contribution FWIW |
|
this seems to be low risk, @kimishpatel should we just stamp? if it doesn't affect code size etc. |
|
@mcr229 can you import this and build size bot to check build size impact of this. I havent looked enough to be assured of that. |
|
@mcr229 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Hi @maajidkhann , curious to know how much perf improvement you observed across those accelerated ops, any data you can share? thank you. |
|
This is awesome. Thanks for this PR. I will also try to review this. |
*Modified functions of vec_float.h and vec_double.h in SVE to work without sleef functions incase of build without sleef, using STL implementation of functions. The modified functions are: acos(), acosh(), asin(), atan(), atanh(), atan2(), copysign(), erf(), erfc(), exp(), exp2(), expm1(), fmod(), hypot(), log(), log2(), log10(), log1p(), sin(), sinh(), cos(), cosh(), tan(), tanh(),lgamma(), pow() *Made slight CMAKE include changes for SVE *.h files. Signed-off-by: maajidkhann <maajidkhan.n@fujitsu.com>
*SVE is not supported on Apple silicon. *CLANG compiler on MacOS doesn't support SVE either. Don't enable SVE build for darwin/MacOS. Signed-off-by: maajidkhann <maajidkhan.n@fujitsu.com>
Signed-off-by: maajidkhann <maajidkhan.n@fujitsu.com>
*Now we use the cpuinfo_get_max_arm_sve_length() call from CPUINFO library to determine SVE VL of the hardware and do the dispatch mechanism. Signed-off-by: maajidkhann <maajidkhan.n@fujitsu.com>
|
Successfully rebased |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Merge failedReason: Approvers from one of the following sets are needed:
|
Hi @malfet New failure occurs due to linker issue on macos SIMULATOR: periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1) (gh) This failure is produce with every PR with
ciflow/periodic
Also, this PR needs your approval (Core Maintainers) before merging. |
|
@pytorchbot merge -f "Lint + ARM CI + MacOS builds are green" |
|
[like] Jain, Abhishek reacted to your message:
…________________________________
From: Nikita Shulga ***@***.***>
Sent: Wednesday, September 18, 2024 6:57:03 PM
To: pytorch/pytorch ***@***.***>
Cc: Jain, Abhishek ***@***.***>; Comment ***@***.***>
Subject: Re: [pytorch/pytorch] Extending the Pytorch vec backend for SVE (ARM) (PR #119571)
@malfet approved this pull request.
—
Reply to this email directly, view it on GitHub<#119571 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BBGF5OIZKGDS3TETHRRLB7DZXHEH7AVCNFSM6AAAAABDB2HYXGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGMJTGUZDGOBRGM>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@maajidkhann I'm interested in this PR, but I would like to know if there would be any benefits based on this PR when using CUDA as the backend? |
Since the code in this PR is extending SVE support for ARM CPUs, it would primarily benefit CPU workloads running on ARM hardware and has SVE hardware support. If the user is leveraging CUDA as the backend for GPU acceleration, the PR would likely have no direct impact. CUDA workloads are accelerated by NVIDIA GPUs, and optimizations specific to ARM CPU vector extensions (like SVE) would not affect these GPU workloads. Possible Indirect Benefits: |
Thank you very much, this has addressed most of my questions. However, I still have one doubt: According to the blog lets-talk-about-the-pytorch-dispatcher and native_functions.yaml, not all operators have specifically registered implementations for CUDA, means some operators use a generic kernel implementation (CPU) across all devices. So could these operators potentially benefit from this change as well, in spite of I use CUDA as backend ? Unfortunately, I've tried reading PyTorch's documentation and source code, but haven't reached a very clear conclusion on this point. |
Yes, I think, potentially, certain operators that fall back on the CPU implementation could benefit from the SVE optimizations introduced by our PR. Here's why: Fallback to CPU: If the dispatcher falls back to a CPU implementation for an operator on a system using CUDA (because no CUDA-specific kernel exists), then the CPU version of the operator will be executed. In this case, your improvements to the ATen CPU backend (with SVE support for ARM) will help optimize that CPU-side execution. |
got it! thanks for your response |

Motivation:
In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes.
Reference Link: https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec
This PR:
Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE.
More about SVE ISA for ARM: https://developer.arm.com/Architectures/Scalable Vector Extensions
We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec.
Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @albanD