hashes: Add optimized ARM SHA256d for 64-byte inputs#5888
Merged
apoelstra merged 5 commits intorust-bitcoin:masterfrom Mar 29, 2026
Merged
hashes: Add optimized ARM SHA256d for 64-byte inputs#5888apoelstra merged 5 commits intorust-bitcoin:masterfrom
apoelstra merged 5 commits intorust-bitcoin:masterfrom
Conversation
Add the three transforms for sha256(sha256(64_bytes)) Transforms 1 & 2 compute the inner `sha256(64_bytes)`: - Transform 1: 64 bytes already fill the first block (no padding needed), so we apply 64 rounds of compression normally. After T1: state = state + initial_state - Transform 2: the message schedule at this step is constant and known in advance (64 bytes of padding), so we precompute W[i]+K[i] into MIDS and apply 64 rounds (no schedule expansion needed). After T2: state = state + saved_T1_state (state now contains sha256(64_bytes)) - Transform 3 computes sha256(output_of_T1_and_T2). The output is 32 bytes, so the block contains 32 bytes and the rest is known padding. we precompute the message schedule for the known words (w8-w15) into FINS. After T3: state = state + initial_state (state now contains the sha256d result)
ARM SHA256H/H2 instructions take 4 cycles to produce a result. the next `SHA256H` needs the result of the current one as input, so the CPU has to wait until it is ready. (in Transform 2 for eg, 3 out of every 4 cycles are wasted doing nothing) we fill those wasted cycles by computing a second independent hash alongside the first. See https://developer.arm.com/documentation/PJDOC-466751330-7215/r4p1/ (Section 3.20) for Cortex-A76 instruction timings. The exact latency may differ on other ARM chips but the concept is the same
Add sha256d dispatcher that currently handles 2-way ARM, with the idea to extend to 4-way, 8-way, and 2-way x86. we also expose a public API that will be used in merkle root computation (in a follow up PR)
6 tasks
test block counts 0 through 32 this allows our test to go through all SIMD distpatch paths we have: - 1 block: software path - 2 blocks: 2-way - 3 blocks: 2-way + last block software - 4 blocks: 2-way for now, (4-way once we add it) - 8 blocks: 2-way for now (8-way once we add it) - ... and so on up to 32
Member
|
Gonna one-ACK merge; this feels like it's purely in my court. |
tcharding
added a commit
to tcharding/rust-bitcoin
that referenced
this pull request
Mar 31, 2026
Add attribute to a bunch of stuff recently introduced in rust-bitcoin#5888.
apoelstra
added a commit
that referenced
this pull request
Mar 31, 2026
8ba4bbf Run the formatter (Tobin C. Harding) 3be528f hashes: fmt skip a bunch of stuff (Tobin C. Harding) Pull request description: Add attribute to a bunch of stuff recently introduced in #5888. ACKs for top commit: apoelstra: ACK 8ba4bbf; successfully ran local tests Tree-SHA512: 7895b0428d11a4b70183aa9478ef26f42dafb3aa47bd0f3b96593bb8b3f4fd32af9d9ca1ec1bf410ae713ab2035d2a9e08eaf5dc72729435c850813bcb52f5d6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Based on bitcoin/bitcoin#13191 and bitcoin/bitcoin#24115
first commit: Add the three transforms for
sha256(sha256(64_bytes))Transforms 1 & 2 compute the inner
sha256(64_bytes):W[i]+K[i]and apply 64 rounds without schedule expansionsha256(output_of_T2), the output ofT2is 32 bytes, so the block contains 32 bytes and the rest is known padding. We precompute the message schedule for the known words (w8-w15)commit 2: Interleave 2 independent hashes in each group of 4 rounds to fill CPU wasted cycles.
commit 3: Expose the public API and prepare dispatcher for 4-way/8-way optimization
commit 4: Add tests
A full benchmark for this will be added in a follow-up PR where we will use this for Merkle root computation (for ref, a quick local bench with 1000 leaves shows ~1.54× speedup)
Addresses part of #5540