Skip to content

hashes: Add optimized ARM SHA256d for 64-byte inputs#5888

Merged
apoelstra merged 5 commits intorust-bitcoin:masterfrom
jrakibi:sha256d-2way-arm-simd
Mar 29, 2026
Merged

hashes: Add optimized ARM SHA256d for 64-byte inputs#5888
apoelstra merged 5 commits intorust-bitcoin:masterfrom
jrakibi:sha256d-2way-arm-simd

Conversation

@jrakibi
Copy link
Copy Markdown
Contributor

@jrakibi jrakibi commented Mar 24, 2026

Based on bitcoin/bitcoin#13191 and bitcoin/bitcoin#24115

first commit: Add the three transforms for sha256(sha256(64_bytes))
Transforms 1 & 2 compute the inner sha256(64_bytes):

  • Transform 1: 64 bytes already fill the first block (no padding needed), so we apply 64 rounds of compression normally
  • Transform 2: the message schedule at this step is constant and known in advance (64bytes padding block), so we precompute W[i]+K[i] and apply 64 rounds without schedule expansion
  • Transform 3: compute sha256(output_of_T2), the output of T2 is 32 bytes, so the block contains 32 bytes and the rest is known padding. We precompute the message schedule for the known words (w8-w15)

commit 2: Interleave 2 independent hashes in each group of 4 rounds to fill CPU wasted cycles.
commit 3: Expose the public API and prepare dispatcher for 4-way/8-way optimization
commit 4: Add tests

A full benchmark for this will be added in a follow-up PR where we will use this for Merkle root computation (for ref, a quick local bench with 1000 leaves shows ~1.54× speedup)

Addresses part of #5540

jrakibi added 3 commits March 24, 2026 02:40
Add the three transforms for sha256(sha256(64_bytes))

Transforms 1 & 2 compute the inner `sha256(64_bytes)`:

- Transform 1: 64 bytes already fill the first block (no
padding needed), so we apply 64 rounds of compression
normally.
After T1: state = state + initial_state

- Transform 2: the message schedule at this step is constant
and known in advance (64 bytes of padding), so
we precompute W[i]+K[i] into MIDS and apply 64 rounds (no
schedule expansion needed).
After T2: state = state + saved_T1_state
(state now contains sha256(64_bytes))

- Transform 3 computes sha256(output_of_T1_and_T2). The output
is 32 bytes, so the block contains 32 bytes and the rest is
known padding. we precompute the message
schedule for the known words (w8-w15) into FINS.
After T3: state = state + initial_state
(state now contains the sha256d result)
ARM SHA256H/H2 instructions take 4 cycles to produce a
result. the next `SHA256H` needs the result of the current
one as input, so the CPU has to wait until it is ready.
(in Transform 2 for eg, 3 out of every 4 cycles are wasted
doing nothing)

we fill those wasted cycles by computing a second independent
hash alongside the first.

See https://developer.arm.com/documentation/PJDOC-466751330-7215/r4p1/
(Section 3.20) for Cortex-A76 instruction timings. The exact
latency may differ on other ARM chips but the concept is the
same
Add sha256d dispatcher that currently handles 2-way ARM,
with the idea to extend to 4-way, 8-way, and 2-way x86.
we also expose a public API that will be used in merkle
root computation (in a follow up PR)
@github-actions github-actions bot added the C-hashes PRs modifying the hashes crate label Mar 24, 2026
@jrakibi jrakibi marked this pull request as draft March 24, 2026 05:43
jrakibi added 2 commits March 25, 2026 18:50
test block counts 0 through 32
this allows our test to go through all SIMD distpatch paths we have:

- 1 block: software path
- 2 blocks: 2-way
- 3 blocks: 2-way + last block software
- 4 blocks: 2-way for now, (4-way once we add it)
- 8 blocks: 2-way for now (8-way once we add it)
- ... and so on up to 32
@jrakibi jrakibi marked this pull request as ready for review March 25, 2026 11:01
Copy link
Copy Markdown
Member

@apoelstra apoelstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 32220e6; successfully ran local tests; I didn't really review the SIMD code in the first two patches.

@apoelstra
Copy link
Copy Markdown
Member

Gonna one-ACK merge; this feels like it's purely in my court.

@apoelstra apoelstra merged commit d3a85f2 into rust-bitcoin:master Mar 29, 2026
28 checks passed
tcharding added a commit to tcharding/rust-bitcoin that referenced this pull request Mar 31, 2026
Add attribute to a bunch of stuff recently introduced in rust-bitcoin#5888.
apoelstra added a commit that referenced this pull request Mar 31, 2026
8ba4bbf Run the formatter (Tobin C. Harding)
3be528f hashes: fmt skip a bunch of stuff (Tobin C. Harding)

Pull request description:

  Add attribute to a bunch of stuff recently introduced in #5888.


ACKs for top commit:
  apoelstra:
    ACK 8ba4bbf; successfully ran local tests


Tree-SHA512: 7895b0428d11a4b70183aa9478ef26f42dafb3aa47bd0f3b96593bb8b3f4fd32af9d9ca1ec1bf410ae713ab2035d2a9e08eaf5dc72729435c850813bcb52f5d6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-hashes PRs modifying the hashes crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants