hashes: add SHA256 ARM hardware acceleration#5493
hashes: add SHA256 ARM hardware acceleration#5493apoelstra merged 2 commits intorust-bitcoin:masterfrom
Conversation
rust-bitcoin#1962 added SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we're still falling back to software_process_block(), which is ~4x slower. This adds support for ARM SHA2 crypto extensions. The code is inspired by https://github.com/noloader/SHA-Intrinsics/blob/4e754bec921a9f281b69bd681ca0065763aa911c/sha256-arm.c. variable names are kept the same for easier review and comparison. I did benchmarks on an AWS EC2 (t4g.small, Neoverse-N1): for 65kb: - without ARM: 266.71 µs - with ARM: 55.956 µs That's almost ~5x faster for larger blocks.
|
more optimizations can be done, like implementing 2/4/8-way parallelism to hash multiple inputs at once. This is useful for Merkle tree building. see #1962 (comment) and #1962 (comment) for context.
|
|
For what it's worth, I double checked the benchmark results on my ARM laptop (Pinebook Pro with RK3399 SoC). My results for your PR were: master: This PR: So, it's within the same ballpark of improvement as what you measured. I like it. Also, while I understand that the existing x86 is the same, is there a reason that we duplicate the code for each of the round blocks? It makes sense to avoid a loop due to potential performance problems, but a small inline macro gives us similar brevity with only a slight compile time cost. This is just a general question, not a criticism. |
cool, I did another test on an M1 the results for
we can use macros, I kept it intentionally unrolled (as mentioned in the PR desc) to stay close to Jeffrey Walton original code and to the impl in Core for easier verification and review. I also think using a macro would make it a bit harder to audit each round, tho I don’t feel strongly about it if we prefer to change it |
|
FYI I'm holding off on reviewing this as long as it's a draft. |
I converted this to draft while waiting to hear back from Jeremiah, since I duplicated his PR #4045 In the other PR, Kix suggested checking with Miri as well, I don’t think it needs to block this, we can do it as a follow-up |
| let (mut state0, mut state1); | ||
| let (abcd_save, efgh_save); | ||
|
|
||
| let (mut msg0, mut msg1, mut msg2, mut msg3); | ||
| let (mut tmp0, mut tmp1, mut tmp2); |
There was a problem hiding this comment.
| let (mut state0, mut state1); | |
| let (abcd_save, efgh_save); | |
| let (mut msg0, mut msg1, mut msg2, mut msg3); | |
| let (mut tmp0, mut tmp1, mut tmp2); | |
| // Variable names are also kept the same as in the original C code for easier comparison. | |
| let (mut state0, mut state1); | |
| let (abcd_save, efgh_save); | |
| let (mut msg0, mut msg1, mut msg2, mut msg3); | |
| let (mut tmp0, mut tmp1, mut tmp2); |
Is this true? I just reviewed by looking at process_block_simd_x86_intrinsics for comparison.
There was a problem hiding this comment.
I intentionally didn't put this comment because I changed the variable names abef_save/cdgh_save to abcd_save/efgh_save. Core has these variable names incorrect because they copied the original C code from Jeffrey, which is also incorrect for ARM.
-
ARM SHA256 intrinsics use
abcd/efgh(alphabetical order) for the state variables. (see Documentaion) -
x86 uses
abef/cdghbecause SHA-NI instructions store state variables in that order (for optimization reasons)
There was a problem hiding this comment.
lol, exactly like you said in the PR description. You must love my reviews ...
There was a problem hiding this comment.
Thanks for your patience and efforts man.
#1962 adds SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we’re still falling back to
software_process_block(), which is ~4x slower according to benchmarks I ran on my system.The code is inspired by https://github.com/noloader/SHA-Intrinsics/tree/4e754bec921a9f281b69bd681ca0065763aa911c. Variable names are intentionally kept the same for easier review and comparison, although I fixed some incorrect variable names in the original implementation (more details in noloader/SHA-Intrinsics#16).
these are some benchmarks I ran on an AWS EC2 instance (t4g.small) with a Neoverse-N1 CPU:
without ARM acceleration
with ARM
that’s almost ~5x faster for larger blocks