hashes: add SHA256 ARM hardware acceleration by jrakibi · Pull Request #5493 · rust-bitcoin/rust-bitcoin

jrakibi · 2026-01-08T00:48:17Z

#1962 adds SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we’re still falling back to software_process_block(), which is ~4x slower according to benchmarks I ran on my system.

The code is inspired by https://github.com/noloader/SHA-Intrinsics/tree/4e754bec921a9f281b69bd681ca0065763aa911c. Variable names are intentionally kept the same for easier review and comparison, although I fixed some incorrect variable names in the original implementation (more details in noloader/SHA-Intrinsics#16).

these are some benchmarks I ran on an AWS EC2 instance (t4g.small) with a Neoverse-N1 CPU:

without ARM acceleration

sha256/engine_input/10
  time:   [49.947 ns 49.955 ns 49.965 ns]
  thrpt:  [190.87 MiB/s 190.91 MiB/s 190.94 MiB/s]

sha256/engine_input/1024
  time:   [4.1740 µs 4.1744 µs 4.1747 µs]
  thrpt:  [233.92 MiB/s 233.94 MiB/s 233.96 MiB/s]

sha256/engine_input/65536
  time:   [266.68 µs 266.71 µs 266.75 µs]
  thrpt:  [234.31 MiB/s 234.34 MiB/s 234.36 MiB/s]

with ARM

sha256/engine_input/10
  time:   [16.928 ns 16.930 ns 16.931 ns]
  thrpt:  [563.26 MiB/s 563.31 MiB/s 563.36 MiB/s]

sha256/engine_input/1024
  time:   [875.00 ns 875.07 ns 875.14 ns]
  thrpt:  [1.0897 GiB/s 1.0898 GiB/s 1.0899 GiB/s]

sha256/engine_input/65536
  time:   [55.939 µs 55.956 µs 55.979 µs]
  thrpt:  [1.0903 GiB/s 1.0908 GiB/s 1.0911 GiB/s]

that’s almost ~5x faster for larger blocks

rust-bitcoin#1962 added SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we're still falling back to software_process_block(), which is ~4x slower. This adds support for ARM SHA2 crypto extensions. The code is inspired by https://github.com/noloader/SHA-Intrinsics/blob/4e754bec921a9f281b69bd681ca0065763aa911c/sha256-arm.c. variable names are kept the same for easier review and comparison. I did benchmarks on an AWS EC2 (t4g.small, Neoverse-N1): for 65kb: - without ARM: 266.71 µs - with ARM: 55.956 µs That's almost ~5x faster for larger blocks.

jrakibi · 2026-01-08T01:02:32Z

more optimizations can be done, like implementing 2/4/8-way parallelism to hash multiple inputs at once. This is useful for Merkle tree building. see #1962 (comment) and #1962 (comment) for context. 
 
I’ll wait for feedback on this first to see if there’s interest in going further. We could also optimize other hash functions, but SHA-256 is what really matters since it’s used more frequently, so am not planning to go down that path for other hashes unless there’s real interest

mpbagot · 2026-01-11T14:47:21Z

For what it's worth, I double checked the benchmark results on my ARM laptop (Pinebook Pro with RK3399 SoC). My results for your PR were:

master:

test sha256::benches::sha256_10  ... bench:          88.75 ns/iter (+/- 0.42) = 113 MB/s
test sha256::benches::sha256_1k  ... bench:       6,932.80 ns/iter (+/- 40.11) = 147 MB/s
test sha256::benches::sha256_64k ... bench:     442,439.57 ns/iter (+/- 1,227.15) = 148 MB/s

This PR:

test sha256::benches::sha256_10  ... bench:          35.16 ns/iter (+/- 0.12) = 285 MB/s
test sha256::benches::sha256_1k  ... bench:       1,433.02 ns/iter (+/- 4.54) = 714 MB/s
test sha256::benches::sha256_64k ... bench:      90,760.13 ns/iter (+/- 94.94) = 722 MB/s

So, it's within the same ballpark of improvement as what you measured. I like it.

Also, while I understand that the existing x86 is the same, is there a reason that we duplicate the code for each of the round blocks? It makes sense to avoid a loop due to potential performance problems, but a small inline macro gives us similar brevity with only a slight compile time cost. This is just a general question, not a criticism.

jrakibi · 2026-01-14T08:42:14Z

So, it's within the same ballpark of improvement as what you measured. I like it.

cool, I did another test on an M1 the results for sha256/engine_input/65536 were 196.78 µs in master and 32.126 µs in the PR (~6x faster)

is there a reason that we duplicate the code ... a small inline macro gives us similar brevity with only a slight compile time cost.

we can use macros, I kept it intentionally unrolled (as mentioned in the PR desc) to stay close to Jeffrey Walton original code and to the impl in Core for easier verification and review. I also think using a macro would make it a bit harder to audit each round, tho I don’t feel strongly about it if we prefer to change it

apoelstra · 2026-01-19T17:37:39Z

FYI I'm holding off on reviewing this as long as it's a draft.

jrakibi · 2026-01-22T10:05:52Z

FYI I'm holding off on reviewing this as long as it's a draft.

I converted this to draft while waiting to hear back from Jeremiah, since I duplicated his PR #4045
It’s now ready for review. Also added aarch64 cross testing to CI

In the other PR, Kix suggested checking with Miri as well, I don’t think it needs to block this, we can do it as a follow-up

apoelstra

ACK baaab03; successfully ran local tests; though I do not have an aarch64 machine. I reviewed the code to the extent of checking that it looks like a hash function implementation

tcharding · 2026-01-29T00:06:08Z

hashes/src/sha256/crypto.rs

+        let (mut state0, mut state1);
+        let (abcd_save, efgh_save);
+
+        let (mut msg0, mut msg1, mut msg2, mut msg3);
+        let (mut tmp0, mut tmp1, mut tmp2);


Suggested change

let (mut state0, mut state1);

let (abcd_save, efgh_save);

let (mut msg0, mut msg1, mut msg2, mut msg3);

let (mut tmp0, mut tmp1, mut tmp2);

// Variable names are also kept the same as in the original C code for easier comparison.

let (mut state0, mut state1);

let (abcd_save, efgh_save);

let (mut msg0, mut msg1, mut msg2, mut msg3);

let (mut tmp0, mut tmp1, mut tmp2);

Is this true? I just reviewed by looking at process_block_simd_x86_intrinsics for comparison.

I intentionally didn't put this comment because I changed the variable names abef_save/cdgh_save to abcd_save/efgh_save. Core has these variable names incorrect because they copied the original C code from Jeffrey, which is also incorrect for ARM.  

ARM SHA256 intrinsics use abcd/efgh (alphabetical order) for the state variables. (see Documentaion)

x86 uses abef/cdgh because SHA-NI instructions store state variables in that order (for optimization reasons)

lol, exactly like you said in the PR description. You must love my reviews ...

Thanks for your patience and efforts man.

tcharding

code review ACK baaab03 - looks ok when compared to other code in the file. The tests passing speaks for the correctness AFAIU. No further understanding implied and no local testing done by me.

jrakibi marked this pull request as draft January 8, 2026 00:48

github-actions bot added the C-hashes PRs modifying the hashes crate label Jan 8, 2026

jrakibi marked this pull request as ready for review January 8, 2026 18:14

jrakibi mentioned this pull request Jan 18, 2026

hashes: aarch64 acceleration support for sha256 #4045

Closed

jrakibi marked this pull request as draft January 18, 2026 13:11

jrakibi mentioned this pull request Jan 19, 2026

Tracking issue: SIMD optimizations for double-sha256 on 64 byte inputs #5540

Open

6 tasks

add aarch64 cross testing (needed for ARM SHA acceleration)

baaab03

github-actions bot added the ci label Jan 22, 2026

jrakibi marked this pull request as ready for review January 22, 2026 10:05

apoelstra approved these changes Jan 28, 2026

View reviewed changes

tcharding reviewed Jan 29, 2026

View reviewed changes

tcharding approved these changes Jan 29, 2026

View reviewed changes

apoelstra merged commit 3ebcd5a into rust-bitcoin:master Jan 29, 2026
29 checks passed

jrakibi mentioned this pull request Feb 17, 2026

Use SIMD on ARM #2422

Closed

johnny9 mentioned this pull request Mar 11, 2026

Socratic Seminar 50 TriangleBitDevs/TriangleBitDevs.github.io#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hashes: add SHA256 ARM hardware acceleration#5493

hashes: add SHA256 ARM hardware acceleration#5493
apoelstra merged 2 commits intorust-bitcoin:masterfrom
jrakibi:08-09-sha256-arm

jrakibi commented Jan 8, 2026

Uh oh!

jrakibi commented Jan 8, 2026 •

edited

Loading

Uh oh!

mpbagot commented Jan 11, 2026

Uh oh!

jrakibi commented Jan 14, 2026

Uh oh!

apoelstra commented Jan 19, 2026

Uh oh!

jrakibi commented Jan 22, 2026

Uh oh!

apoelstra left a comment

Uh oh!

tcharding Jan 29, 2026

Uh oh!

jrakibi Jan 29, 2026 •

edited

Loading

Uh oh!

tcharding Jan 29, 2026

Uh oh!

tcharding Jan 29, 2026

Uh oh!

tcharding left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jrakibi commented Jan 8, 2026

Uh oh!

jrakibi commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpbagot commented Jan 11, 2026

Uh oh!

jrakibi commented Jan 14, 2026

Uh oh!

apoelstra commented Jan 19, 2026

Uh oh!

jrakibi commented Jan 22, 2026

Uh oh!

apoelstra left a comment

Choose a reason for hiding this comment

Uh oh!

tcharding Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jrakibi Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tcharding Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

tcharding Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

tcharding left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jrakibi commented Jan 8, 2026 •

edited

Loading

jrakibi Jan 29, 2026 •

edited

Loading

tcharding left a comment •

edited

Loading