Skip to content

hashes: add SHA256 ARM hardware acceleration#5493

Merged
apoelstra merged 2 commits intorust-bitcoin:masterfrom
jrakibi:08-09-sha256-arm
Jan 29, 2026
Merged

hashes: add SHA256 ARM hardware acceleration#5493
apoelstra merged 2 commits intorust-bitcoin:masterfrom
jrakibi:08-09-sha256-arm

Conversation

@jrakibi
Copy link
Copy Markdown
Contributor

@jrakibi jrakibi commented Jan 8, 2026

#1962 adds SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we’re still falling back to software_process_block(), which is ~4x slower according to benchmarks I ran on my system.

The code is inspired by https://github.com/noloader/SHA-Intrinsics/tree/4e754bec921a9f281b69bd681ca0065763aa911c. Variable names are intentionally kept the same for easier review and comparison, although I fixed some incorrect variable names in the original implementation (more details in noloader/SHA-Intrinsics#16).

these are some benchmarks I ran on an AWS EC2 instance (t4g.small) with a Neoverse-N1 CPU:

without ARM acceleration

sha256/engine_input/10
  time:   [49.947 ns 49.955 ns 49.965 ns]
  thrpt:  [190.87 MiB/s 190.91 MiB/s 190.94 MiB/s]

sha256/engine_input/1024
  time:   [4.1740 µs 4.1744 µs 4.1747 µs]
  thrpt:  [233.92 MiB/s 233.94 MiB/s 233.96 MiB/s]

sha256/engine_input/65536
  time:   [266.68 µs 266.71 µs 266.75 µs]
  thrpt:  [234.31 MiB/s 234.34 MiB/s 234.36 MiB/s]

with ARM

sha256/engine_input/10
  time:   [16.928 ns 16.930 ns 16.931 ns]
  thrpt:  [563.26 MiB/s 563.31 MiB/s 563.36 MiB/s]

sha256/engine_input/1024
  time:   [875.00 ns 875.07 ns 875.14 ns]
  thrpt:  [1.0897 GiB/s 1.0898 GiB/s 1.0899 GiB/s]

sha256/engine_input/65536
  time:   [55.939 µs 55.956 µs 55.979 µs]
  thrpt:  [1.0903 GiB/s 1.0908 GiB/s 1.0911 GiB/s]

that’s almost ~5x faster for larger blocks

rust-bitcoin#1962 added SIMD SHA256 intrinsics for x86 machines. However, for
ARM machines we're still falling back to software_process_block(),
which is ~4x slower.

This adds support for ARM SHA2 crypto extensions. The code is
inspired by https://github.com/noloader/SHA-Intrinsics/blob/4e754bec921a9f281b69bd681ca0065763aa911c/sha256-arm.c.
variable names are kept the same for easier review and comparison.

I did benchmarks on an AWS EC2 (t4g.small, Neoverse-N1):

for 65kb:
- without ARM: 266.71 µs
- with ARM: 55.956 µs

That's almost ~5x faster for larger blocks.
@jrakibi jrakibi marked this pull request as draft January 8, 2026 00:48
@github-actions github-actions bot added the C-hashes PRs modifying the hashes crate label Jan 8, 2026
@jrakibi
Copy link
Copy Markdown
Contributor Author

jrakibi commented Jan 8, 2026

more optimizations can be done, like implementing 2/4/8-way parallelism to hash multiple inputs at once. This is useful for Merkle tree building. see #1962 (comment) and #1962 (comment) for context.


I’ll wait for feedback on this first to see if there’s interest in going further. We could also optimize other hash functions, but SHA-256 is what really matters since it’s used more frequently, so am not planning to go down that path for other hashes unless there’s real interest

@jrakibi jrakibi marked this pull request as ready for review January 8, 2026 18:14
@mpbagot
Copy link
Copy Markdown
Contributor

mpbagot commented Jan 11, 2026

For what it's worth, I double checked the benchmark results on my ARM laptop (Pinebook Pro with RK3399 SoC). My results for your PR were:

master:

test sha256::benches::sha256_10  ... bench:          88.75 ns/iter (+/- 0.42) = 113 MB/s
test sha256::benches::sha256_1k  ... bench:       6,932.80 ns/iter (+/- 40.11) = 147 MB/s
test sha256::benches::sha256_64k ... bench:     442,439.57 ns/iter (+/- 1,227.15) = 148 MB/s

This PR:

test sha256::benches::sha256_10  ... bench:          35.16 ns/iter (+/- 0.12) = 285 MB/s
test sha256::benches::sha256_1k  ... bench:       1,433.02 ns/iter (+/- 4.54) = 714 MB/s
test sha256::benches::sha256_64k ... bench:      90,760.13 ns/iter (+/- 94.94) = 722 MB/s

So, it's within the same ballpark of improvement as what you measured. I like it.

Also, while I understand that the existing x86 is the same, is there a reason that we duplicate the code for each of the round blocks? It makes sense to avoid a loop due to potential performance problems, but a small inline macro gives us similar brevity with only a slight compile time cost. This is just a general question, not a criticism.

@jrakibi
Copy link
Copy Markdown
Contributor Author

jrakibi commented Jan 14, 2026

So, it's within the same ballpark of improvement as what you measured. I like it.

cool, I did another test on an M1 the results for sha256/engine_input/65536 were 196.78 µs in master and 32.126 µs in the PR (~6x faster)

is there a reason that we duplicate the code ... a small inline macro gives us similar brevity with only a slight compile time cost.

we can use macros, I kept it intentionally unrolled (as mentioned in the PR desc) to stay close to Jeffrey Walton original code and to the impl in Core for easier verification and review. I also think using a macro would make it a bit harder to audit each round, tho I don’t feel strongly about it if we prefer to change it

@apoelstra
Copy link
Copy Markdown
Member

FYI I'm holding off on reviewing this as long as it's a draft.

@github-actions github-actions bot added the ci label Jan 22, 2026
@jrakibi jrakibi marked this pull request as ready for review January 22, 2026 10:05
@jrakibi
Copy link
Copy Markdown
Contributor Author

jrakibi commented Jan 22, 2026

FYI I'm holding off on reviewing this as long as it's a draft.

I converted this to draft while waiting to hear back from Jeremiah, since I duplicated his PR #4045
It’s now ready for review. Also added aarch64 cross testing to CI

In the other PR, Kix suggested checking with Miri as well, I don’t think it needs to block this, we can do it as a follow-up

Copy link
Copy Markdown
Member

@apoelstra apoelstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK baaab03; successfully ran local tests; though I do not have an aarch64 machine. I reviewed the code to the extent of checking that it looks like a hash function implementation

Comment on lines +567 to +571
let (mut state0, mut state1);
let (abcd_save, efgh_save);

let (mut msg0, mut msg1, mut msg2, mut msg3);
let (mut tmp0, mut tmp1, mut tmp2);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let (mut state0, mut state1);
let (abcd_save, efgh_save);
let (mut msg0, mut msg1, mut msg2, mut msg3);
let (mut tmp0, mut tmp1, mut tmp2);
// Variable names are also kept the same as in the original C code for easier comparison.
let (mut state0, mut state1);
let (abcd_save, efgh_save);
let (mut msg0, mut msg1, mut msg2, mut msg3);
let (mut tmp0, mut tmp1, mut tmp2);

Is this true? I just reviewed by looking at process_block_simd_x86_intrinsics for comparison.

Copy link
Copy Markdown
Contributor Author

@jrakibi jrakibi Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally didn't put this comment because I changed the variable names abef_save/cdgh_save to abcd_save/efgh_save. Core has these variable names incorrect because they copied the original C code from Jeffrey, which is also incorrect for ARM. 


  • ARM SHA256 intrinsics use abcd/efgh (alphabetical order) for the state variables. (see Documentaion)

  • x86 uses abef/cdgh because SHA-NI instructions store state variables in that order (for optimization reasons)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, exactly like you said in the PR description. You must love my reviews ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience and efforts man.

Copy link
Copy Markdown
Member

@tcharding tcharding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code review ACK baaab03 - looks ok when compared to other code in the file. The tests passing speaks for the correctness AFAIU. No further understanding implied and no local testing done by me.

@apoelstra apoelstra merged commit 3ebcd5a into rust-bitcoin:master Jan 29, 2026
29 checks passed
@jrakibi jrakibi mentioned this pull request Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-hashes PRs modifying the hashes crate ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants