Add simd sha256 intrinsics for x86 machines by sanket1729 · Pull Request #1962 · rust-bitcoin/rust-bitcoin

sanket1729 · 2023-07-27T10:20:31Z

This is my first time dabbling into architecture specific code and simd. The algorithm is a word to word translation of the C code from https://github.com/noloader/SHA-Intrinsics/blob/4899efc81d1af159c1fd955936c673139f35aea9/sha256-x86.c .

Some benchmarks:

With simd

test sha256::benches::sha256_10                   ... bench:          11 ns/iter (+/- 0) = 909 MB/s
test sha256::benches::sha256_1k                   ... bench:         712 ns/iter (+/- 2) = 1438 MB/s
test sha256::benches::sha256_64k                  ... bench:      45,597 ns/iter (+/- 189) = 1437 MB/s

Without simd

test sha256::benches::sha256_10                   ... bench:          47 ns/iter (+/- 0) = 212 MB/s
test sha256::benches::sha256_1k                   ... bench:       4,243 ns/iter (+/- 17) = 241 MB/s
test sha256::benches::sha256_64k                  ... bench:     271,263 ns/iter (+/- 1,610) = 241 MB/s

RCasatta · 2023-07-27T13:49:46Z

Thanks for this!

I confirm I have similar benchmark improvements on my machine.

The error in CI is just formatting, a cargo +nightly fmt is needed.

Any rationale on choosing the code to base the implementation on over using the code in bitcoin core or in sha2 crate?

Is having some fuzzing between hardware and software implementation a good idea?

apoelstra · 2023-07-27T18:27:08Z

@RCasatta the cited code is public domain. If we took from Bitcoin Core we'd have to ask the author to let us relicense under CC1.0 (which would probably be fine, I assume it's sipa or bluematt, but it's still extra work).

apoelstra

ACK 957d16223b4fd9982cad0114a8dc614ba58338b1

sanket1729 · 2023-07-27T18:33:38Z

Any rationale on choosing the code to base the implementation on over using the code in bitcoin core or in sha2 crate?

No strong reason. Bitcoin core code is also based on the same implementation. The sha256.cpp has the following note on top of it.

// Based on https://github.com/noloader/SHA-Intrinsics/blob/master/sha256-x86.c,
// Written and placed in public domain by Jeffrey Walton.
// Based on code from Intel, and by Sean Gulley for the miTLS project.

As for sha2 crate, I was not sure if the we would need permission to relicense it based on CC0. I am sure we can jut copy and they would be okay, with it. But this one just seemed simpler. I don't mind switching to another implementation if that makes things easier.

sanket1729 · 2023-07-27T18:35:34Z

Is having some fuzzing between hardware and software implementation a good idea?

As for fuzzing, I don't think it would be required, because this code either works or fails completely. There are no data dependant branches. Maybe, @apoelstra has thoughts here.

apoelstra · 2023-07-27T18:42:16Z

I agree with that. I think in the past we've had fuzztests that check compatibility between our crate and the RustCrypto one, and it (unsurprisingly) never ever found anything.

yancyribbens · 2023-07-28T14:38:36Z

I notice the C implementation doesn't create a new function call for each block, although I suppose creating 64 new stack frames probably doesn't impact performance.

apoelstra · 2023-07-28T14:50:44Z

A private function that is only called once is pretty-much guaranteed to be inlined.

sanket1729 · 2023-07-28T20:01:44Z

I notice the C implementation doesn't create a new function call for each block, although I suppose creating 64 new stack frames probably doesn't impact performance.

Yeah, I didn't want to touch other parts of the code or change the architecture of the code by moving the loop into process_block. If it turns to have a performance impact we can move the loop inside the function in a separate PR. I would like keep this PR only related to substituting the equivalent logic to simd.

TheBlueMatt · 2023-07-28T20:09:23Z

Optimizing just sha is nice, but there's also a lot to be gained interleaving multiple sha256 operations at least in merkle tree building - we are very often taking 2/4/8/etc neighboring blocks of 64 bytes and hashing each block at once. IIRC in Bitcoin Core pre-sha-ni it made a huge difference, though I'm not sure what the improvement was interleaving post-sha-ni. No reason to hold this PR up on that or anything, just noting it in case you're excited to do more :)

apoelstra

ACK 546c012

sanket1729 · 2023-08-02T00:32:28Z

we are very often taking 2/4/8/etc neighboring blocks of 64 bytes and hashing each block at once. IIRC in Bitcoin Core pre-sha-ni it made a huge difference

I did some superficial simd instruction counting for 2/4/8 blocks interleaving methods from bitcoin core.
Using Tranform2_sha256d from bitcoin core saves about 28 calls to _mm_sha256msg1_epu32 compared to calling sha256::hash repeatedly. Both of the still have the same number of calls to _mm_sha256rnds2_epu32.

Overall, one sha256d(3 transforms) computation done without transform2 takes about 192 _mm_sha256rnds2_epu32 and 84 _mm_sha256msg1_epu32.

Based on my superficial evaluation, there is still some performance benefit to be gained, but certainly not a significant one.

tcharding

I'd be lying if I said I groked this code, can we have a test (possibly during fuzzing) that hashes with the software implementation and the intrinsics one and asserts they are the same? (Although I forget why we stopped fuzzing against the Rust Crypto's version and if the reason we did that counters my request.)

tcharding · 2023-08-02T01:20:25Z

hashes/src/sha256.rs

+    #[cfg(all(feature = "std", any(target_arch = "x86", target_arch = "x86_64")))]
+    #[target_feature(enable = "sha,sse2,ssse3,sse4.1")]
+    unsafe fn process_block_simd_x86_intrinsics(&mut self) {
+        // Code translated and based on from


Suggested change

// Code translated and based on from

// Code translated from and based on

sanket1729 · 2023-08-02T09:24:10Z

that hashes with the software implementation and the intrinsics one and asserts they are the same?

My expectation is that all the fixed tests fail if the implementation was wrong. They are checked indirectly as the we get the same hash value in all fixed tests with and without simd.
My opinion on fuzzing #1962 (comment).

I wanted to add some tests for this. I don't know how to test this in CI. Add target=native and hope that the CI machine has instructions to test it?

apoelstra · 2023-08-02T13:45:46Z

Although I forget why we stopped fuzzing against the Rust Crypto's version and if the reason we did that counters my request.

If you can get a test vector that passes and a test vector that fails such a test, you'd pretty-much need to break the hash function. I don't think there's any need for testing correctness, beyond the existing fixed test vectors.

tcharding · 2023-08-02T22:04:01Z

Oh yes , my bad, I should have ~~convinced myself of unit test coverage before posting.~~ read the thread before reviewing.

tcharding · 2023-08-02T22:43:06Z

What's the plan from here @sanket1729, are we going to do the other hashes from noloader code too (sha1, sha512)?

sanket1729 · 2023-08-02T23:23:49Z

@tcharding, can do. But I don't think those would be nearly as impactful as sha256 change. Other hash functions are just not used that frequently in bitcoin. sha256 is in used transaction hash, block hash, checksums etc.

I think our efforts are better spent improving the sha256 hash engine. I think that having functions for Transform2, Transform4 and Transform8 might be more useful.
Basically, we should be able to write something like:
https://github.com/bitcoin/bitcoin/blob/2fa60f0b683cefd7956273986dafe3bde00c98fd/src/crypto/sha256.cpp#L731

I don't plan on doing that anytime soon. So it's up for grabs if anyone is interested.

apoelstra · 2023-08-02T23:37:59Z

Fully agreed that we only really need to put work into sha2, and any spare effort that people want to put into this should be focused on sha2.

(I wouldn't refuse work on other hash functions if somebody did it, of course, but if they're making a choice upfront, sha2 is all that really matters.)

tcharding · 2023-08-03T05:09:41Z

Cool, cheers.

TheBlueMatt · 2023-08-04T06:45:13Z

I did some superficial simd instruction counting for 2/4/8 blocks interleaving methods from bitcoin core.

So my point about performance here wasn't about the instruction count, but rather the data inverleaving allowing the CPU to better parallelize internally, which can allow for much higher throughput without fewer instructions.

sanket1729 · 2023-08-04T07:28:34Z

Ah thanks for elaborating. Makes much more sense now.

…

On Thu, Aug 3, 2023, 11:45 PM Matt Corallo ***@***.***> wrote: I did some superficial simd instruction counting for 2/4/8 blocks interleaving methods from bitcoin core. So my point about performance here wasn't about the instruction count, but rather the data inverleaving allowing the CPU to better parallelize internally, which can allow for much higher throughput without fewer instructions. — Reply to this email directly, view it on GitHub <#1962 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUQEOPPDULLNXLOFCWQM63XTSLAHANCNFSM6AAAAAA2Z3LIJ4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tcharding · 2023-08-15T20:07:14Z

What's the status of this one @apoelstra? Are you are hoping for another review?

apoelstra · 2023-08-16T15:04:39Z

Oh, I don't think I noticed your ACK @tcharding. Can you re-ack with the commit ID?

tcharding

ACK 546c012

tcharding · 2023-08-17T00:28:03Z

I must have clicked "approve" before somehow, woops. All good now.

rust-bitcoin#1962 added SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we're still falling back to software_process_block(), which is ~4x slower. This adds support for ARM SHA2 crypto extensions. The code is inspired by https://github.com/noloader/SHA-Intrinsics/blob/4e754bec921a9f281b69bd681ca0065763aa911c/sha256-arm.c. Variable names are kept the same for easier review and comparison. I did benchmarks on an AWS EC2 (t4g.small, Neoverse-N1): for 65kb: - without ARM: 266.71 µs - with ARM: 55.956 µs That's almost ~5x faster for larger blocks.

rust-bitcoin#1962 added SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we're still falling back to software_process_block(), which is ~4x slower. This adds support for ARM SHA2 crypto extensions. The code is inspired by https://github.com/noloader/SHA-Intrinsics/blob/4e754bec921a9f281b69bd681ca0065763aa911c/sha256-arm.c. variable names are kept the same for easier review and comparison. I did benchmarks on an AWS EC2 (t4g.small, Neoverse-N1): for 65kb: - without ARM: 266.71 µs - with ARM: 55.956 µs That's almost ~5x faster for larger blocks.

baaab03 add aarch64 cross testing (needed for ARM SHA acceleration) (jrakibi) 2299350 hashes: add SHA256 ARM hardware acceleration (jrakibi) Pull request description: #1962 adds SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we’re still falling back to `software_process_block()`, which is ~4x slower according to benchmarks I ran on my system. The code is inspired by https://github.com/noloader/SHA-Intrinsics/tree/4e754bec921a9f281b69bd681ca0065763aa911c. Variable names are intentionally kept the same for easier review and comparison, although I fixed some incorrect variable names in the original implementation (more details in noloader/SHA-Intrinsics#16). these are some benchmarks I ran on an AWS EC2 instance (t4g.small) with a Neoverse-N1 CPU: without ARM acceleration ``` sha256/engine_input/10 time: [49.947 ns 49.955 ns 49.965 ns] thrpt: [190.87 MiB/s 190.91 MiB/s 190.94 MiB/s] sha256/engine_input/1024 time: [4.1740 µs 4.1744 µs 4.1747 µs] thrpt: [233.92 MiB/s 233.94 MiB/s 233.96 MiB/s] sha256/engine_input/65536 time: [266.68 µs 266.71 µs 266.75 µs] thrpt: [234.31 MiB/s 234.34 MiB/s 234.36 MiB/s] ``` with ARM ``` sha256/engine_input/10 time: [16.928 ns 16.930 ns 16.931 ns] thrpt: [563.26 MiB/s 563.31 MiB/s 563.36 MiB/s] sha256/engine_input/1024 time: [875.00 ns 875.07 ns 875.14 ns] thrpt: [1.0897 GiB/s 1.0898 GiB/s 1.0899 GiB/s] sha256/engine_input/65536 time: [55.939 µs 55.956 µs 55.979 µs] thrpt: [1.0903 GiB/s 1.0908 GiB/s 1.0911 GiB/s] ``` that’s almost ~5x faster for larger blocks ACKs for top commit: apoelstra: ACK baaab03; successfully ran local tests; though I do not have an aarch64 machine. I reviewed the code to the extent of checking that it looks like a hash function implementation tcharding: code review ACK baaab03 - looks ok when compared to other code in the file. The tests passing speaks for the correctness AFAIU. No further understanding implied and no local testing done by me. Tree-SHA512: ec5e54dfa92991727ebae80b42e4e9e96be55db17c1288587e548352c3b4e01016f2102accf5b766bcf5b088d4d85621d9d53f19d678b9c477c4ac72e9bc8249

This was referenced Jul 27, 2023

Accellerate hashes using SIMD or SHA-NI #1424

Closed

Remove sha2 dep once bitcoin_hashes provide hardware accelerated sha256 RCasatta/bitcoin_slices#28

Open

sanket1729 force-pushed the test branch from eb5b5b0 to 957d162 Compare July 27, 2023 18:24

apoelstra previously approved these changes Jul 27, 2023

View reviewed changes

Add simd sha256 intrinsics for x86 machines

546c012

sanket1729 dismissed apoelstra’s stale review via 546c012 July 27, 2023 18:36

sanket1729 force-pushed the test branch from 957d162 to 546c012 Compare July 27, 2023 18:36

apoelstra approved these changes Jul 31, 2023

View reviewed changes

tcharding approved these changes Aug 2, 2023

View reviewed changes

tcharding approved these changes Aug 17, 2023

View reviewed changes

tcharding mentioned this pull request Aug 17, 2023

hashes: Release v0.13.0 #1917

Merged

apoelstra merged commit af85528 into rust-bitcoin:master Aug 17, 2023

sanket1729 mentioned this pull request Jan 6, 2024

Cheap but slow program BlockstreamResearch/simplicity#209

Closed

tcharding mentioned this pull request Aug 22, 2024

Txid and BlockHash should expose the preimage #1300

Open

jrakibi mentioned this pull request Jan 8, 2026

hashes: add SHA256 ARM hardware acceleration #5493

Merged

	// Code translated and based on from
	// Code translated from and based on

Conversation

sanket1729 commented Jul 27, 2023

Uh oh!

RCasatta commented Jul 27, 2023

Uh oh!

apoelstra commented Jul 27, 2023

Uh oh!

apoelstra left a comment

Choose a reason for hiding this comment

Uh oh!

sanket1729 commented Jul 27, 2023

Uh oh!

sanket1729 commented Jul 27, 2023

Uh oh!

apoelstra commented Jul 27, 2023

Uh oh!

yancyribbens commented Jul 28, 2023

Uh oh!

apoelstra commented Jul 28, 2023

Uh oh!

sanket1729 commented Jul 28, 2023

Uh oh!

TheBlueMatt commented Jul 28, 2023

Uh oh!

apoelstra left a comment

Choose a reason for hiding this comment

Uh oh!

sanket1729 commented Aug 2, 2023

Uh oh!

tcharding left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tcharding Aug 2, 2023

Choose a reason for hiding this comment

Uh oh!

sanket1729 commented Aug 2, 2023

Uh oh!

apoelstra commented Aug 2, 2023

Uh oh!

tcharding commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcharding commented Aug 2, 2023

Uh oh!

sanket1729 commented Aug 2, 2023

Uh oh!

apoelstra commented Aug 2, 2023

Uh oh!

tcharding commented Aug 3, 2023

Uh oh!

TheBlueMatt commented Aug 4, 2023

Uh oh!

sanket1729 commented Aug 4, 2023 via email

Uh oh!

tcharding commented Aug 15, 2023

Uh oh!

apoelstra commented Aug 16, 2023

Uh oh!

tcharding left a comment

Choose a reason for hiding this comment

Uh oh!

tcharding commented Aug 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tcharding left a comment •

edited

Loading

tcharding commented Aug 2, 2023 •

edited

Loading