Skip to content

sha256-arm: fix ABEF_SAVE/CDGH_SAVE var names#16

Open
jrakibi wants to merge 1 commit intonoloader:masterfrom
jrakibi:07-01-fix-sha256-arm-var-names
Open

sha256-arm: fix ABEF_SAVE/CDGH_SAVE var names#16
jrakibi wants to merge 1 commit intonoloader:masterfrom
jrakibi:07-01-fix-sha256-arm-var-names

Conversation

@jrakibi
Copy link
Copy Markdown

@jrakibi jrakibi commented Jan 8, 2026

PR #14 renamed ABEF_SAVE/CDGH_SAVE to ABCD_SAVE/EFGH_SAVE but missed the declaration.

As noted in the original commit of #14: ARM keeps state in natural order [A,B,C,D] and [E,F,G,H] unlike x86 SHA-NI which uses [A,B,E,F] and [C,D,G,H]. Doc

apoelstra added a commit to rust-bitcoin/rust-bitcoin that referenced this pull request Jan 29, 2026
baaab03 add aarch64 cross testing (needed for ARM SHA acceleration) (jrakibi)
2299350 hashes: add SHA256 ARM hardware acceleration (jrakibi)

Pull request description:

  #1962 adds SIMD SHA256 intrinsics for x86 machines. However, for ARM machines we’re still falling back to `software_process_block()`, which is ~4x slower according to benchmarks I ran on my system.
  
  The code is inspired by https://github.com/noloader/SHA-Intrinsics/tree/4e754bec921a9f281b69bd681ca0065763aa911c. Variable names are intentionally kept the same for easier review and comparison, although I fixed some incorrect variable names in the original implementation (more details in noloader/SHA-Intrinsics#16).
  
  these are some benchmarks I ran on an AWS EC2 instance (t4g.small) with a Neoverse-N1 CPU: 
  
  without ARM acceleration
  
  ```
  sha256/engine_input/10
    time:   [49.947 ns 49.955 ns 49.965 ns]
    thrpt:  [190.87 MiB/s 190.91 MiB/s 190.94 MiB/s]
  
  sha256/engine_input/1024
    time:   [4.1740 µs 4.1744 µs 4.1747 µs]
    thrpt:  [233.92 MiB/s 233.94 MiB/s 233.96 MiB/s]
  
  sha256/engine_input/65536
    time:   [266.68 µs 266.71 µs 266.75 µs]
    thrpt:  [234.31 MiB/s 234.34 MiB/s 234.36 MiB/s]
  ```
  
  with ARM
  ```
  sha256/engine_input/10
    time:   [16.928 ns 16.930 ns 16.931 ns]
    thrpt:  [563.26 MiB/s 563.31 MiB/s 563.36 MiB/s]
  
  sha256/engine_input/1024
    time:   [875.00 ns 875.07 ns 875.14 ns]
    thrpt:  [1.0897 GiB/s 1.0898 GiB/s 1.0899 GiB/s]
  
  sha256/engine_input/65536
    time:   [55.939 µs 55.956 µs 55.979 µs]
    thrpt:  [1.0903 GiB/s 1.0908 GiB/s 1.0911 GiB/s]
  ```
  that’s almost ~5x faster for larger blocks


ACKs for top commit:
  apoelstra:
    ACK baaab03; successfully ran local tests; though I do not have an aarch64 machine. I reviewed the code to the extent of checking that it looks like a hash function implementation
  tcharding:
    code review ACK baaab03 - looks ok when compared to other code in the file. The tests passing speaks for the correctness AFAIU. No further understanding implied and no local testing done by me.


Tree-SHA512: ec5e54dfa92991727ebae80b42e4e9e96be55db17c1288587e548352c3b4e01016f2102accf5b766bcf5b088d4d85621d9d53f19d678b9c477c4ac72e9bc8249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant