Skip to content

Conversation

@theuni
Copy link
Member

@theuni theuni commented Dec 16, 2025

Exploit modern simd to calculate 2/4/6/8/16 states at a time depending on the size of the input.

This demonstrates a 2x speedup on x86-64 and 3x for arm+neon. Platforms which require runtime detection (avx2/avx512) improve performance even further, and will come as a follow-up.

Rather than hand-writing assembly or using arch-specific intrinsics, this is written using compiler built-ins understood by gcc and clang.

In practice (at least on x86_64 and armv8), the compilers are able to produce assembly that's not much worse than hand-written.

This means that every architecture can benefit from its own vectorized instructions without having to write/maintain an implementation for each one. But because each will vary in ability to exploit the parallelism, we allow (via ifdefs) each architecture to opt-out of some or all multi-state calculation at compile-time..

Here, as a starting point, x86-64 and arm+neon have been defined based on local benchmarks.

Local profiling revealed that chacha20 accounts for a substantial amount of the network thread's time. It's not clear to me if speeding up chacha20 will improve network performance/latency, but it will definitely make it more efficient.

This is part 1 of a series of PR's for chacha20. I think it makes sense to take a look at the generic implementation and tune the architecture-specific defines for parallel blocks before adding the runtime-dependent platforms.

My WIP branch which includes avx2/avx512 can be seen here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/

I've been hacking on this for quite a while, trying every imaginable tweak and comparing lots of resulting asm/ir. I'm happy to answer any questions about any choices made which aren't immediately obvious.

Edit: some more impl details:

I wrestled with gcc/clang a good bit, tweaking something and comparing the generated code output. A few things I found, which may explain some of the decisions I made:

  • gcc really wanted to inline some of the helpers, which comes at a very substantial performance cost due to register clobbering (and with avx2, vzeroupper). Hence, all helpers are decorated with ALWAYS_INLINE.
  • gcc/clang do well with the vec256 loads/stores with minimal fussing. Though loading each element with [] is clumsy and verbose, it avoids compiler-specific layout assumptions. Other things I tried (which produced the same asm):
    • casting directly to using unaligned_vec256 __attribute__((aligned (1))) = vec256
    • memcpy into ^^
    • clang's __builtin_masked_load
  • Loop unrolling was hit-or-miss without #pragma GCC unroll n, and I tried to avoid macros for loops, hence the awkward recursive inline template loops. But in practice, I see those unrolled 100% of the time.
  • I used std::get in the helpers for some extra compile-time safety (this actually pointed out some off-by-one's that would've been annoying to track down)
  • I avoided using any lambdas or classes for fear of compilers missing obvious optimizations
  • All vec256 are passed by reference to avoid an annoying clang warning about returning a vector changing the abi (this is specific to x86 when not compiling with -avx). Even though our functions are all inlined, I didn't see any harm in making that adjustment.

Exploit modern simd to calculate 2/4/6/8/16 states at a time depending on the
size of the input.

Demonstrates a 2x speedup on x86-64 and 3x for arm+neon. Platforms which
require runtime detection (avx2/avx512) improve performance even further, and
will come as a follow-up.

Rather than hand-writing assembly or using arch-specific intrinsics, this is
written using compiler built-ins understood by gcc and clang.

In practice (at least on x86_64 and armv8), the compilers are able to produce
assembly that's not much worse than hand-written.

This means that every architecture can benefit from its own vectorized
instructions without having to write/maintain an implementation for each one.
But because each will vary in ability to exploit the parallelism, we allow
(via ifdefs) each architecture to opt-out of some or all multi-state
calculation at compile-time..

Here, as a starting point, x86-64 and arm+neon have been defined based on local
benchmarks.
@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 16, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34083.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

  • "32bytes" -> "32 bytes" [missing space between number and unit]
  • "which used to increment the states" -> "which are used to increment the states" [missing auxiliary verb "are"]
  • "256bit registers" -> "256‑bit registers" [missing hyphen for the compound adjective; can be written as "256-bit" or "256 bit" for clarity]

2026-01-13

@theuni
Copy link
Member Author

theuni commented Dec 16, 2025

Adding pings for a few people I've discussed this with: @sipa @ajtowns @l0rinc

Copy link
Contributor

@ajtowns ajtowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits. Approach looks very nice, and at least going by the bench results, gives a good improvement.

# define CHACHA20_VEC_DISABLE_STATES_2
#endif

#include <crypto/chacha20_vec.ipp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why separate this into an .ipp file if it's only included in a single .cpp file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See branch here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/

I decided to exclude the impls which require runtime detection from this PR, as I think how that should be done is a separate conversation.

To answer your question more specifically: some impls may require different compilation flags (-mavx2/-mavx512vl), which have to be in their own compilation units.

// (sse2 and neon respectively) are safe to use without runtime detection.

#if defined(__x86_64__) || defined(__amd64__)
# define CHACHA20_VEC_DISABLE_STATES_16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using #if !defined for these seems old school. Why not static constexpr bools, and if constexpr (..)?

Writing:

#if defined(__x86_64__) || defined(__amd64__)
#  define CHACHA20_VEC_DISABLE_STATES_16 true
#  define CHACHA20_VEC_DISABLE_STATES_8  true
#  define CHACHA20_VEC_DISABLE_STATES_6  true
#  define CHACHA20_VEC_DISABLE_STATES_4  false
#  define CHACHA20_VEC_DISABLE_STATES_2  false
#elif defined(__ARM_NEON)
...

static constexpr bool CHACHA20_VEC_ALL_MULTI_STATES_DISABLED =
    CHACHA20_VEC_DISABLE_STATES_16 &&
    CHACHA20_VEC_DISABLE_STATES_8 &&
    CHACHA20_VEC_DISABLE_STATES_6 &&
    CHACHA20_VEC_DISABLE_STATES_4 &&
    CHACHA20_VEC_DISABLE_STATES_2;

 void chacha20_crypt_vectorized(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, const std::array<uint32_t, 12>& input) noexcept
 {
    if constexpr (CHACHA20_VEC_ALL_MULTI_STATES_DISABLED) return;
...
    if constexpr (!CHACHA20_VEC_DISABLE_STATES_16) {
         while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 16) {
...

seems to work right, and also seems like it avoids needing to put all the potentially unused code in a #if block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a branch that replaces the #defines and abstracts the recursive template stuff a bit more for your perusal https://github.com/ajtowns/bitcoin/commits/202512-pr34083-templates/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for if constexpr, should simplify the code a lot (especially if we migrate from recursive templates as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if constexpr certainly makes more sense where possible. I'll have a look, thanks!


namespace {

using vec256 = uint32_t __attribute__((__vector_size__(32)));
Copy link
Contributor

@ajtowns ajtowns Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_assert(sizeof(vec256) == 32) ? (Or sizeof(vec256) == 32 || ALL_MULTI_STATES_DISABLED)

Copy link
Contributor

@l0rinc l0rinc Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used this before but the internets tells me we could use the attribute syntax here, something like:

Suggested change
using vec256 = uint32_t __attribute__((__vector_size__(32)));
template <typename T, size_t N>
using VectorType [[gnu::vector_size(sizeof(T) * N)]] = T;
using vec256 = VectorType<uint32_t, 8>;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they're compiler-specific, the code makes no assumptions about the size/structure/alignment of vec256. The only accesses are via operator[]. So afaik, there's no need to check for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant just so you get an error if you're using a compiler ignores the __attribute__ entirely but still passes the #if in _base.cpp.

@l0rinc
Copy link
Contributor

l0rinc commented Dec 17, 2025

Looking forward to reviewing it - quick question before I do: was it tested on any big-endian systems which don't revert to non-vectorized run?

the layout to prepare for the second round.
After the second round, they are used (in reverse) to restore the original
layout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that you're inverting the logic here: instead of

#define QUARTERROUND(a,b,c,d) \
    X(a,b,d,16) X(c,d,b,12) X(a,d,b,8) X(c,d,b,7)

This does the X(a,b,d,16) all four times via simd, then X(c,d,b,12) all four times, etc. And the simd part relies on the data for those four steps being exactly +4 units apart, which is why the shuffling and unshuffling is necessary for the second round. (Okay, not four times but eight times because each vec256 is two blocks)

Could use a little more explanation in the comment I think? But makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me too, but I'm not sure why doing it this way is preferable. Intuitively I would have expected that the vectors hold the same word spread over multiple blocks. But if I understand your approach here, we have multiple words over two blocks. Is it just easier to express the quarter rounds in this way?

assert(blocks * ChaCha20Aligned::BLOCKLEN == out_bytes.size());
#ifdef ENABLE_CHACHA20_VEC
// Only use the vectorized implementations if the counter will not overflow.
const bool overflow = static_cast<uint64_t>(input[8]) + blocks > std::numeric_limits<uint32_t>::max();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overflow happens every 274GB I guess, which presumably isn't worth putting much effort in to optimising around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually doing the opposite.. it's keeping the vectorized impl from having to worry about this case. In the current code, we have:

++j12;
if (!j12) ++j13;
...
input[8] = j12;
input[9] = j13;

So effectively input[8] and input[9] are treated as a single uint64_t. It's not possible to express "cast to uint64_t elements and increment" or "increment and overflow over there" with the vector extensions, so it turned out to be easier to just forbid the overflow cases from being vectorized at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about whether the code should just do the non-vectorized stuff to get past the overflow then immediately go back to vectorizing, rather than waiting for the next call. I think what you've got makes sense.

Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code roughly, I like that we're focusing on this and looking forward to specializing other parts of the code that are this critical (looking at you, SipHash!).

My biggest objection currently is that it's broken on big-endian systems - left a suggestion how to reproduce and fix it. This also reveals the lack of testing - we have to find a way to selectively enable and disable these optimizations to make sure we have tests that compare their outputs.
I agree with AJ that we could modernize this a bit with constexpr and less general recursive template magic since C++20 allows us to use simple loops and constexpr conditions and lambdas - it could reduce a lot of duplication while maintaining performance.
There's also some repetition (e.g. ALWAYS_INLINE and internal_bswap_32), already defined elsewhere which I think we could use instead.
Which leads me to think we should extract the new primitives used here (__builtin_shufflevector, vec_rotl, vec_byteswap, vec256) to a reusable header - tested and benchmarked separately from the chacha work.

if constexpr (std::endian::native == std::endian::big)
{
vec256 ret;
ret[0] = __builtin_bswap32(vec[0]);
Copy link
Contributor

@l0rinc l0rinc Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assume that all big endian systems have __builtin_bswap32 available?

// All internal_bswap_* functions can be replaced with std::byteswap once we
// require c++23. Both libstdc++ and libc++ implement std::byteswap via these
// builtins.
indicates we could use our existing helpers here instead:

ALWAYS_INLINE void vec_byteswap(vec256& vec)
{
    if constexpr (std::endian::native == std::endian::big) {
        for (size_t i = 0; i < 8; ++i) {
            vec[i] = internal_bswap_32(vec[i]);
        }
    }
}

A small, fixed-size loop like this should be unrolled by every compiler, it's what we did in https://github.com/bitcoin/bitcoin/pull/31144/files#diff-f26d4597a7f5a5d4aa30053032d49427b402ef4fd7a1b3194cb75b2551d58ca4R48 as well.

(nit: could you please reformat the patch, it differs slightly from how clang-format is set for new code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes, I meant to change that to internal_bswap_32 before pushing. Will do.


/** Store a vector in all array elements */
template <size_t I, size_t ITER = 0>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me the existing standard test vectors are too short (mostly < 128 bytes) to trigger the vectorized optimization - can we extend the tests to runa and compare the new specializations (could show as skipped when the given architecture isn't available locally)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do.

# define CHACHA20_VEC_DISABLE_STATES_8
# define CHACHA20_VEC_DISABLE_STATES_6
# define CHACHA20_VEC_DISABLE_STATES_4
# define CHACHA20_VEC_DISABLE_STATES_2
Copy link
Contributor

@l0rinc l0rinc Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing vectorized operations on an emulated big-endian system (otherwise it falls back to the scalar implementation due to the safety checks):

diff --git a/src/crypto/chacha20_vec_base.cpp b/src/crypto/chacha20_vec_base.cpp
index 9fda9452a1..a2aa1e5552 100644
--- a/src/crypto/chacha20_vec_base.cpp
+++ b/src/crypto/chacha20_vec_base.cpp
@@ -20,7 +20,7 @@
 #  define CHACHA20_VEC_DISABLE_STATES_8
 #  define CHACHA20_VEC_DISABLE_STATES_6
 #  define CHACHA20_VEC_DISABLE_STATES_4
-#  define CHACHA20_VEC_DISABLE_STATES_2
+//#  define CHACHA20_VEC_DISABLE_STATES_2
 #endif

 #include <crypto/chacha20_vec.ipp>

and running the crypto_tests:

brew install podman pigz qemu
podman machine init
podman machine start

podman run --platform linux/s390x -it --rm ubuntu:latest /bin/bash -c \
  'apt-get update && \
   DEBIAN_FRONTEND=noninteractive apt-get install -y \
   git build-essential cmake ccache pkg-config \
   libevent-dev libboost-dev libssl-dev libsqlite3-dev python3 && \
   git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && \
   git fetch origin pull/34083/head:chacha20-vec && git checkout chacha20-vec && \
   sed -i "s/#  define CHACHA20_VEC_DISABLE_STATES_2/\/\/#  define CHACHA20_VEC_DISABLE_STATES_2/" src/crypto/chacha20_vec_base.cpp && \
   cmake -B build -DBUILD_BENCH=OFF -DBUILD_GUI=OFF -DBUILD_DAEMON=OFF -DBUILD_TX=OFF -DENABLE_IPC=OFF -DENABLE_WALLET=OFF -DENABLE_ZMQ=OFF -DENABLE_UPNP=OFF -DENABLE_NATPMP=OFF && \
   cmake --build build --target test_bitcoin -j1 && \ ./build/bin/test_bitcoin --run_test=crypto_tests'

We're getting a lot of failures:

Running 17 test cases...
./test/crypto_tests.cpp(155): error: in "crypto_tests/chacha20_testvector": check hexout == HexStr(outres) has failed [a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4042e02785ececfa4b4bb5e8ead0440e20b6e8db09d881a7c6132f420e52795042bdfa7773d8a9051447b3291ce1411c680465552aa6c405b7764d5e87bea85ad00f8449ed8f72d0d662ab052691ca66424bc86d2df80ea41f43abf937d3259dc4b2d0dfb48a6c9139ddd7f76966e928e635553ba76c5c879d7b35d49eb2e62b0871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221 != c2ece71ceded38c04f376ca225cc3d6b463409986f263e9db1994a204b6844ec6e9695cfc52b7003e3b6961ceac96a18138981cb4ee5919649b263d542b82b114a57f52d8ba2a2f4540af4f7f49f0b1072a7f9891b97ceb5c61c20420143685f169add2373c4b505122eda2f5df3535d555637680dc5a722b83209519ae7f147c11a9158f48479c9926ea7412ac69d6a584ac97768e412e1514fa7b737ce389dc4bbd9df9cc422b95ccfc5926171fe20e9284834a77646878d7a34c485b3e7300c3589a8448b3bc53b980c6bced42938afc1398c2f1a3a6c2887c5be68a18039cb2fba983271c1202a73af95c79ae29faafb40973694b4afa523f2ef13b84300c39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221]
...
./test/crypto_tests.cpp(185): error: in "crypto_tests/chacha20_testvector": check hexout == HexStr(outres) has failed [a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4042e02785ececfa4b4bb5e8ead0440e20b6e8db09d881a7c6132f420e52795042bdfa7773d8a9051447b3291ce1411c680465552aa6c405b7764d5e87bea85ad00f8449ed8f72d0d662ab052691ca66424bc86d2df80ea41f43abf937d3259dc4b2d0dfb48a6c9139ddd7f76966e928e635553ba76c5c879d7b35d49eb2e62b0871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221 != a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4a57f52d8ba2a2f4540af4f7f49f0b1072a7f9891b97ceb5c61c20420143685f169add2373c4b505122eda2f5df3535d555637680dc5a722b83209519ae7f147c11a9158f48479c9926ea7412ac69d6a584ac97768e412e1514fa7b737ce389dc4bbd9df9cc422b95ccfc5926171fe20e9284834a77646878d7a34c485b3e7300871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221]
./test/crypto_tests.cpp(272): error: in "crypto_tests/chacha20poly1305_testvectors": check cipher == expected_cipher has failed
...
./test/crypto_tests.cpp(337): error: in "crypto_tests/chacha20poly1305_testvectors": check decipher == plain has failed

*** 37 failures are detected in the test module "Bitcoin Core Test Suite"

This appears to be an endianness issue in vec_read_xor_write: the current implementation swaps the result of the XOR, but on Big Endian systems we must swap the state vec before the XOR.

Changing it to:

diff --git a/src/crypto/chacha20_vec.ipp b/src/crypto/chacha20_vec.ipp
index 46a159ce01..cfc0535a92 100644
--- a/src/crypto/chacha20_vec.ipp
+++ b/src/crypto/chacha20_vec.ipp
@@ -174,8 +174,9 @@ ALWAYS_INLINE void vec_read_xor_write(std::span<const std::byte, 32> in_bytes, s
 {
     std::array<uint32_t, 8> temparr;
     memcpy(temparr.data(), in_bytes.data(), in_bytes.size());
-    vec256 tempvec = vec ^ (vec256){temparr[0], temparr[1], temparr[2], temparr[3], temparr[4], temparr[5], temparr[6], temparr[7]};
+    vec256 tempvec = vec;
     vec_byteswap(tempvec);
+    tempvec ^= (vec256){temparr[0], temparr[1], temparr[2], temparr[3], temparr[4], temparr[5], temparr[6], temparr[7]};
     temparr = {tempvec[0], tempvec[1], tempvec[2], tempvec[3], tempvec[4], tempvec[5], tempvec[6], tempvec[7]};
     memcpy(out_bytes.data(), temparr.data(), out_bytes.size());
 }

Makes it pass for me.

We should find a way to exercise this via CI and benchmarks, maybe similarly to SHA256AutoDetect in

std::string SHA256AutoDetect(sha256_implementation::UseImplementation use_implementation)
{
std::string ret = "standard";
Transform = sha256::Transform;
TransformD64 = sha256::TransformD64;
TransformD64_2way = nullptr;
TransformD64_4way = nullptr;
TransformD64_8way = nullptr;
#if !defined(DISABLE_OPTIMIZED_SHA256)
#if defined(HAVE_GETCPUID)
bool have_sse4 = false;
bool have_xsave = false;
bool have_avx = false;
[[maybe_unused]] bool have_avx2 = false;
[[maybe_unused]] bool have_x86_shani = false;
[[maybe_unused]] bool enabled_avx = false;
uint32_t eax, ebx, ecx, edx;
GetCPUID(1, 0, eax, ebx, ecx, edx);
if (use_implementation & sha256_implementation::USE_SSE4) {
have_sse4 = (ecx >> 19) & 1;
}
have_xsave = (ecx >> 27) & 1;
have_avx = (ecx >> 28) & 1;
if (have_xsave && have_avx) {
enabled_avx = AVXEnabled();
}
if (have_sse4) {
GetCPUID(7, 0, eax, ebx, ecx, edx);
if (use_implementation & sha256_implementation::USE_AVX2) {
have_avx2 = (ebx >> 5) & 1;
}
if (use_implementation & sha256_implementation::USE_SHANI) {
have_x86_shani = (ebx >> 29) & 1;
}
}
#if defined(ENABLE_SSE41) && defined(ENABLE_X86_SHANI)
if (have_x86_shani) {
Transform = sha256_x86_shani::Transform;
TransformD64 = TransformD64Wrapper<sha256_x86_shani::Transform>;
TransformD64_2way = sha256d64_x86_shani::Transform_2way;
ret = "x86_shani(1way;2way)";
have_sse4 = false; // Disable SSE4/AVX2;
have_avx2 = false;
}
#endif
if (have_sse4) {
#if defined(__x86_64__) || defined(__amd64__)
Transform = sha256_sse4::Transform;
TransformD64 = TransformD64Wrapper<sha256_sse4::Transform>;
ret = "sse4(1way)";
#endif
#if defined(ENABLE_SSE41)
TransformD64_4way = sha256d64_sse41::Transform_4way;
ret += ";sse41(4way)";
#endif
}
#if defined(ENABLE_AVX2)
if (have_avx2 && have_avx && enabled_avx) {
TransformD64_8way = sha256d64_avx2::Transform_8way;
ret += ";avx2(8way)";
}
#endif
#endif // defined(HAVE_GETCPUID)
#if defined(ENABLE_ARM_SHANI)
bool have_arm_shani = false;
if (use_implementation & sha256_implementation::USE_SHANI) {
#if defined(__linux__)
#if defined(__arm__) // 32-bit
if (getauxval(AT_HWCAP2) & HWCAP2_SHA2) {
have_arm_shani = true;
}
#endif
#if defined(__aarch64__) // 64-bit
if (getauxval(AT_HWCAP) & HWCAP_SHA2) {
have_arm_shani = true;
}
#endif
#endif
#if defined(__APPLE__)
int val = 0;
size_t len = sizeof(val);
if (sysctlbyname("hw.optional.arm.FEAT_SHA256", &val, &len, nullptr, 0) == 0) {
have_arm_shani = val != 0;
}
#endif
}
if (have_arm_shani) {
Transform = sha256_arm_shani::Transform;
TransformD64 = TransformD64Wrapper<sha256_arm_shani::Transform>;
TransformD64_2way = sha256d64_arm_shani::Transform_2way;
ret = "arm_shani(1way;2way)";
}
#endif
#endif // DISABLE_OPTIMIZED_SHA256
assert(SelfTest());
return ret;
}
which would enable us running benchmarks and tests selectively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! I forgot to mention in the PR description that big-endian was best-effort and untested. I figured our c-i would catch any obvious bugs. Agree it's not great that it didn't :(

Thanks for the quick fix too :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same disappointment when implementing the obfuscation optimization. :)

@maflcko do we have a big-endian nightly that supports vectorized operations? Would it have caught this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #33436, but it was slow despite having the gui and the fuzz tests disabled. I am running it as part of nightly, so any issues will be caught before a release, but I am not sure if catching everything in pull requests is possible.

Maybe it is possible to split the task into two: One for a cross-compile, which should be faster than a "native" compile via qemu. And another to run the tests, similar to the windows-cross tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I ran the CI config locally, but it didn't catch this, as the platform opts out. The CI doesn't run the sed -i "s/# define CHACHA20_VEC_DISABLE_STATES_2/\/\/# define CHACHA20_VEC_DISABLE_STATES_2/" src/crypto/chacha20_vec_base.cpp && \ portion, so it would not have caught this, even if the task was run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking @maflcko, that's why I asked.
The vectorized operations are obviously supported, but for some reason were not triggered for me either.
@theuni, is this just an emulation anomaly or we were just too cautious?
I understdood that @achow101 has access to real big-endian power9 machine that we might be able to test this on when it's ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@l0rinc As the code is written at the moment, all platforms must opt-in to vectorization as opposed to opting out. I'm not sure that's the best approach, but I figured that was a reasonable starting point.

My reasoning for that was: consider non-x86, non-arm+neon platforms. For the most part, I'm assuming they're under-powered. Enabling (for example) 4x blocks/sec for mipsel would probably cause a drastic slowdown. Obviously there are lots of other powerful platforms, but I figured those would be added over time.

So if there's a big-endian architecture (or more ideally, a specific instruction set ala __ARM_NEON) that demonstrates a performance gain from calculating multiple states at once, it should be added here.

Of course, we could go the opposite direction and opt all platforms IN to all states, and instead opt-out the slow ones.

tl;dr: If we want a big-endian platform to be supported, we need to opt one in or provide an override.

Comment on lines +90 to +102
template <size_t BITS, size_t I, size_t ITER = 0>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
vec256& x = std::get<ITER>(arr0);
const vec256& y = std::get<ITER>(arr1);
vec256& z = std::get<ITER>(arr2);

x += y;
z ^= x;
vec_rotl<BITS>(z);

if constexpr(ITER + 1 < I ) arr_add_xor_rot<BITS, I, ITER + 1>(arr0, arr1, arr2);
}
Copy link
Contributor

@l0rinc l0rinc Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, it seems to me we're doing heavy recursive templates here to unroll loops.
My understanding (and experience with the mentioned Obfuscation PR) is that modern compilers are good at unrolling fixed-bound loops.
Can you please try if this also works and results in the same speedup?

Suggested change
template <size_t BITS, size_t I, size_t ITER = 0>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
vec256& x = std::get<ITER>(arr0);
const vec256& y = std::get<ITER>(arr1);
vec256& z = std::get<ITER>(arr2);
x += y;
z ^= x;
vec_rotl<BITS>(z);
if constexpr(ITER + 1 < I ) arr_add_xor_rot<BITS, I, ITER + 1>(arr0, arr1, arr2);
}
template <size_t BITS, size_t I>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
for (size_t i{0}; i < I; ++i) {
arr0[i] += arr1[i];
arr2[i] ^= arr0[i];
vec_rotl<BITS>(arr2[i]);
}
}

(nit: the size is often N instead of I)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, with a compile-time loop:

/** Perform add/xor/rotate for the round function */
template <size_t BITS, size_t I>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
    [&]<size_t... ITER>(std::index_sequence<ITER...>) {
        ((
            arr0[ITER] += arr1[ITER],
            arr2[ITER] ^= arr0[ITER],
            vec_rotl<BITS>(arr2[ITER])
        ), ...);
    }(std::make_index_sequence<I>());
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See updated title description where I touched on this.

I found (with lots of experimentation here) that different compilers use complicated and unpredictable heuristics to decide whether or not to unroll loops. Even if an unroll-able loop is detected, unrolling may be skipped because of the function size (as is the case here because it's huge).

So, I'd like to not take chances on unrolling. That means one of:

  • Manual unrolling
  • Macro-based (as-in REPEAT10())
  • A pragma
  • Recursive template based

c++26 introduces template for, which is what we really want.

I'm up for whichever of those is generally preferred, as long as it's explicit rather than implicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we could help that would make you reconsider? I don't mind running the benchmarks on a few platforms, as long as we can have simpler code. The current template magic is not something I would like to see more of - modern C++20 should be able to handle the situations we have here. I don't mind experimenting with this if you don't want to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theuni See my std::make_index_sequence based I demonstrated above. It's template based, but doesn't need recursion, and doesn't rely on compiler unrolling. It simply expands to an expression (a[0] = v, a[1] = v, a[2] = v, ...) e.g.

Copy link
Contributor

@l0rinc l0rinc Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.agner.org/optimize/optimizing_cpp.pdf contains a few useful examples:

The automatic vectorization works best if the following conditions are satisfied:

  1. Use a compiler with good support for automatic vectorization, such as Gnu, Clang,
    or Intel.

  2. Use the latest version of the compiler. The compilers are becoming better and better
    at vectorization.

  3. If the arrays or structures are accessed through pointers or references then tell the
    compiler explicitly that pointers do not alias, if appropriate, using the __restrict or
    restrict keyword.

  4. Use appropriate compiler options to enable the desired instruction set
    (/arch:SSE2, /arch:AVX etc. for Windows, -msse2, -mavx512f, etc. for Linux)

  5. Use the less restrictive floating point options. For Gnu and Clang compilers, use the
    options -O2 -fno-trapping-math -fno-math-errno -fno-signed-zeros
    (-ffast-math works as well, but functions like isnan(x) do not work under
    -ffast-math).

  6. Align arrays and big structures by 16 for SSE2, preferably 32 for AVX and preferably
    64 for AVX512.

  7. The loop count should preferably be a constant that is divisible by the number of
    elements in a vector.

  8. If arrays are accessed through pointers so that the alignment is not visible in the
    scope of the function where you want vectorization then follow the advice given
    above.

  9. Minimize the use of branches at the vector element level.

  10. Avoid table lookup at the vector element level.

Similarly in https://en.algorithmica.org/hpc/simd/auto-vectorization useful hints:

The other way, specific to SIMD, is the “ignore vector dependencies” pragma. It is a general way to inform the compiler that there are no dependencies between the loop iterations:

#pragma GCC ivdep
for (int i = 0; i < n; i++)

Comment on lines +25 to +33
#if defined(__has_attribute)
# if __has_attribute(always_inline)
# define ALWAYS_INLINE __attribute__ ((always_inline)) inline
# endif
#endif

#if !defined(ALWAYS_INLINE)
# define ALWAYS_INLINE inline
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use

bitcoin/src/attributes.h

Lines 19 to 25 in e16c22f

#if defined(__GNUC__)
# define ALWAYS_INLINE inline __attribute__((always_inline))
#elif defined(_MSC_VER)
# define ALWAYS_INLINE __forceinline
#else
# error No known always_inline attribute for this platform.
#endif
instead?

Comment on lines +295 to +309
#if !defined(CHACHA20_VEC_DISABLE_STATES_16)
while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 16) {
multi_block_crypt<16>(in_bytes, out_bytes, state0, state1, state2);
state2 += (vec256){16, 0, 0, 0, 16, 0, 0, 0};
in_bytes = in_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 16);
out_bytes = out_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 16);
}
#endif
#if !defined(CHACHA20_VEC_DISABLE_STATES_8)
while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 8) {
multi_block_crypt<8>(in_bytes, out_bytes, state0, state1, state2);
state2 += (vec256){8, 0, 0, 0, 8, 0, 0, 0};
in_bytes = in_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 8);
out_bytes = out_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 8);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of repetition here - could we extract that to an ALWAYS_INLINE lambda and have something like:

    if constexpr (ENABLE_16) process_blocks(16);
    else if constexpr (ENABLE_8)  process_blocks(8);
...

I haven't implemented it locally, maybe it's naive, let me know what you think.

// (sse2 and neon respectively) are safe to use without runtime detection.

#if defined(__x86_64__) || defined(__amd64__)
# define CHACHA20_VEC_DISABLE_STATES_16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for if constexpr, should simplify the code a lot (especially if we migrate from recursive templates as well)


namespace {

using vec256 = uint32_t __attribute__((__vector_size__(32)));
Copy link
Contributor

@l0rinc l0rinc Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used this before but the internets tells me we could use the attribute syntax here, something like:

Suggested change
using vec256 = uint32_t __attribute__((__vector_size__(32)));
template <typename T, size_t N>
using VectorType [[gnu::vector_size(sizeof(T) * N)]] = T;
using vec256 = VectorType<uint32_t, 8>;

ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
std::get<ITER>(arr) = vec;
if constexpr(ITER + 1 < I ) arr_set_vec256<I, ITER + 1>(arr, vec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's possible to use compile-time loops here instead of recursive templates (if runtime loops don't get optimized sufficiently):

/** Store a vector in all array elements */
template <size_t I>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
    [&]<size_t... ITER>(std::index_sequence<ITER...>) {
        ((std::get<ITER>(arr) = vec),...);
    }(std::make_index_sequence<I>());
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided lambdas as I've historically observed lots of optims being skipped (with clang, at least) when using them. But since you/@ajtowns/@l0rinc have all made the same comment, I'll play around and see if these indeed compile down to nothing as one would hope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done with a helper function too, but that's not as concise:

template <size_t I, size_t... ITER>
ALWAYS_INLINE void arr_set_vec256_inner(std::array<vec256, I>& arr, const vec256& vec, std::index_sequence<ITER...>)
{
    ((std::get<ITER>(arr) = vec),...);
}

/** Store a vector in all array elements */
template <size_t I>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
    arr_set_vec256_inner(arr, vec, std::make_index_sequence<I>());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing this as:

    template<size_t... ITER> using iseq = std::index_sequence<ITER>;
    static constexpr ISEQ = std::make_index_sequence<I>();
    
    template<size_t... ITER>
    ALWAYS_INLINE void arr_set_vec256(iseq<ITER>, std::array<vec256, I>& arr, const vec256& vec)
    {
        ((std::get<ITER>(arr) = vec), ...);
    }

   ...
        arr_set_vec256(ISEQ, arr0, num256);

doesn't seem too bad, and avoids lambdas? Possibly still a bit clumsy if you can't capture I by putting everything in a class? Might still be a bit clumsy for add_xor_rot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for the aversion to the current approach? It seems easier to read and probably also to debug to me.

@theuni
Copy link
Member Author

theuni commented Dec 17, 2025

An additional note about the actual vectorizing algorithm itself...

There's a different (and arguably more obvious) algorithm that's possible when calculating exactly 8 states. Rather than loading each vec256 with partial info from 2 states, it's possible to use 16 vec256 where each vector contains 1 element for each of the 8 states. It looks like:

vec256 x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15;

broadcast(x0, 0x61707865);
broadcast(x1, 0x3320646e);
broadcast(x2, 0x79622d32);
broadcast(x3, 0x6b206574);
broadcast(x4, input[0]);
...
broadcast(x15, input[11]);

QUARTERROUND( x0, x4, x8,x12);
QUARTERROUND( x1, x5, x9,x13);
...

vec256 j0, j1, j2, j3, ... j31;
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 0, j0, j1);
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 1, j2, j3);
...
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 15, j30, j31);

vec_read_xor_write(in_bytes, out_bytes, j0);
vec_read_xor_write(in_bytes, out_bytes, j1);
...

The problem with this is the overhead of the transpose at the end. My experiments showed (on x86_64+avx2, at least) that this was actually slower than what's currently implemented. It's possible that this approach is better for other architectures, but I left it out of this PR to reduce the initial complexity.

Edit: This is what the linux kernel does for x86_64+avx2. I grabbed their .S and hacked it into Core to benchmark, only to find that the impl here actually outperformed their hand-written asm. That was neat :)

@theuni
Copy link
Member Author

theuni commented Dec 17, 2025

Ok, after some experimenting, I am very hesitant to add the helpers as suggested above. For example, consider @ajtowns's seemingly innocuous Repeat function:

template <size_t REPS, typename Fn>
ALWAYS_INLINE void Repeat(Fn&& fn)
{
    if constexpr (REPS > 0) {
        fn();
        Repeat<REPS-1>(std::forward<Fn>(fn));
    }
}

This causes a 10% slowdown on this branch, but with avx2 the performance is abysmal:

Baseline(master):

|                1.38 |      723,275,584.44 | `CHACHA20_1MB`

This PR:

|                0.71 |    1,408,751,284.34 | `CHACHA20_1MB`

202512-pr34083-templates

|                0.79 |    1,258,748,451.87 | `CHACHA20_1MB`

This PR + avx2 commits:

|                0.50 |    1,988,653,922.83 |  `CHACHA20_1MB`

202512-pr34083-templates + avx2 commits:

|                1.34 |      748,148,324.75 | `CHACHA20_1MB`

The problem in AJ's branch that some functions end up un-inlined. For clang, it's doubleround. For gcc it's arr_read_xor_write and doubleround.

Both are fixed with:

diff --git a/src/crypto/chacha20_vec.ipp b/src/crypto/chacha20_vec.ipp
index 344c68259a5..16d7da45633 100644
--- a/src/crypto/chacha20_vec.ipp
+++ b/src/crypto/chacha20_vec.ipp
@@ -164 +164 @@ public:
-        Repeat<10>([&]() {
+        Repeat<10>([&]() __attribute__ ((always_inline)) {

That means, to be safe, every lambda would need that attribute, which is pretty ugly and easy to forget. So I think my aversion to lambdas in this code was somewhat justified.

Further, clang's inlining docs state that variadic functions aren't inlined. Experimenting now, that's obviously not true as of v21-git. But clearly it was in the recent past.

So.. I'd really prefer to keep things as simple here as possible. I know the recursive template functions aren't exactly pretty, but I don't think they're THAT bad? Plus, with c++26, it all goes away in favor of template for :)

@ajtowns
Copy link
Contributor

ajtowns commented Dec 18, 2025

So.. I'd really prefer to keep things as simple here as possible. I know the recursive template functions aren't exactly pretty, but I don't think they're THAT bad? Plus, with c++26, it all goes away in favor of template for :)

Could we get some comments in the code as to compiler versions (and architecture) that failed to be efficient enough with lambdas, to hopefully avoid future people modernizing the code without checking that these problems have gone away? Presumably will be a long time before we can modernize to template for (which doesn't even seem to be on cppref's compiler support page yet?)...

(Only thing that I do think is "that bad" about the recursive templates is using ITER and I -- i should be the loop variable, not the size!)

@theuni
Copy link
Member Author

theuni commented Dec 18, 2025

Could we get some comments in the code as to compiler versions (and architecture) that failed to be efficient enough with lambdas, to hopefully avoid future people modernizing the code without checking that these problems have gone away? Presumably will be a long time before we can modernize to template for (which doesn't even seem to be on cppref's compiler support page yet?)...

(Only thing that I do think is "that bad" about the recursive templates is using ITER and I -- i should be the loop variable, not the size!)

Sure, I can definitely add some more context and clean up the wonky variable names.

Another brain dump after thinking about all this some more last night:

  1. I don't think lambdas specifically are the problem, more specifically, the issue is: can the compiler infer enough about the "callback" function to inline it?

There are a few considerations there. @sipa's make_index_sequence suggestion is executed immediately, so I imagine that's more likely to be inlined than AJ's Repeat. But still, from a compiler's POV, if it's evaluating: "here's a function call without ALWAYS_INLINE and my function is already huge, should I exclude it from inlining?", imo it'd be reasonable for it to conclude "yes".

So to be safe, whatever we do, I think we should strive to annotate all functions. And imo, annotating a bunch of lambdas with an inline attribute feels weird.

(as an aside: There's also the flatten attribute, which could be added to multi_block_crypt as a belt-and-suspenders)

The index_sequence trick is neat. If that ends up looking cleaner/more obvious than the recursive iteration, that works for me. I'll play around with it.

  1. Not all loops have to be unrolled.

In my testing, loop unrolling always showed slightly better performance (presumably because calculating multiple blocks is otherwise branch-free), but it's not nearly as performance-critical as the inlining. For example, skipping unrolling of doubleround leads to much smaller code, which I imagine could end up being faster on some architectures.

Is there any way we could help that would make you reconsider? I don't mind running the benchmarks on a few platforms, as long as we can have simpler code. The current template magic is not something I would like to see more of - modern C++20 should be able to handle the situations we have here.

My primary goal with this code (and hopefully setting a precedent for other multi-arch simd code... poly1305 is next ;) is to be as explicit to the compiler about what we want as possible. Ideally as portably as possible. So if we want our functions inlined or loops unrolled, we should attempt to communicate those things opposed to leaving them implicit. Unfortunately, modern c++ doesn't have ways to express either of those yet, but we can make our intentions clear enough.

@ajtowns
Copy link
Contributor

ajtowns commented Dec 19, 2025

And imo, annotating a bunch of lambdas with an inline attribute feels weird.

You could #define AI __attribute__((always_inline)), then you'd just be adding AI to all the lambdas, which, if nothing else, would be very modern?

My primary goal with this code (and hopefully setting a precedent for other multi-arch simd code... poly1305 is next ;) is to be as explicit to the compiler about what we want as possible. Ideally as portably as possible.

The above was kind-of a joke, but, perhaps you could combine it with a clang-tidy plugin that lets CI check that all lambdas in a particular namespace are annotated in that way? That might be both clear to compilers and reasonably friendly to human authors/reviewers?

(Explicit recursive templates and prohibiting lambdas are fine by me though; that's still a big step up from inline asm)

@l0rinc
Copy link
Contributor

l0rinc commented Dec 22, 2025

I have measured its effect on IBD and happy to say it produces a measurable speedup.

image
IBD | 926619 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
COMMITS="938d7aacabd0bb3784bb3e529b1ed06bb2891864 3bddf59cd3f201ecf8d65bb1f6c0cde5c39595e9"; \
STOP=926619; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall -9 bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block #1' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

938d7aa Merge #33657: rest: allow reading partial block data from storage
3bddf59 chacha20: Add generic vectorized chacha20 implementation

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=926619 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 938d7aacabd0bb3784bb3e529b1ed06bb2891864)
  Time (mean ± σ):     45198.232 s ± 539.231 s    [User: 57185.689 s, System: 4367.881 s]
  Range (min … max):   44816.939 s … 45579.526 s    2 runs
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=926619 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 3bddf59cd3f201ecf8d65bb1f6c0cde5c39595e9)
  Time (mean ± σ):     43699.641 s ± 25.532 s    [User: 57610.736 s, System: 4058.824 s]
  Range (min … max):   43681.587 s … 43717.695 s    2 runs
 
Relative speed comparison
        1.03 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=926619 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 938d7aacabd0bb3784bb3e529b1ed06bb2891864)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=926619 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 3bddf59cd3f201ecf8d65bb1f6c0cde5c39595e9)

@ajtowns
Copy link
Contributor

ajtowns commented Dec 24, 2025

I have measured its effect on IBD and happy to say it produces a measurable speedup.

Presumably that indicates this IBD run is compute bound (vs disk or network), and that ChaCha20 is using up maybe 7% of CPU time (or 7% of a core when we're bottlenecked on something single-threaded) prior to this PR. That seems like a lot? Is this due to FastRandomContext, or something else? Or is it just that obfuscating all the block data is a large component of IBD CPU currently? This seems a bit surprising to me.

namespace CHACHA20_NAMESPACE {
#endif

void chacha20_crypt_vectorized(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, const std::array<uint32_t, 12>& input) noexcept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should get a short description too. Wasn't immediately clear to me that it attempts to peel off larger blocks first before potentially finishing with a few smaller blocks.

ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
std::get<ITER>(arr) = vec;
if constexpr(ITER + 1 < I ) arr_set_vec256<I, ITER + 1>(arr, vec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for the aversion to the current approach? It seems easier to read and probably also to debug to me.

the layout to prepare for the second round.
After the second round, they are used (in reverse) to restore the original
layout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me too, but I'm not sure why doing it this way is preferable. Intuitively I would have expected that the vectors hold the same word spread over multiple blocks. But if I understand your approach here, we have multiple words over two blocks. Is it just easier to express the quarter rounds in this way?

@theuni
Copy link
Member Author

theuni commented Jan 6, 2026

Presumably that indicates this IBD run is compute bound (vs disk or network), and that ChaCha20 is using up maybe 7% of CPU time (or 7% of a core when we're bottlenecked on something single-threaded) prior to this PR. That seems like a lot? Is this due to FastRandomContext, or something else? Or is it just that obfuscating all the block data is a large component of IBD CPU currently? This seems a bit surprising to me.

Not sure if you missed this in the description, but that's exactly what this is trying to improve:

Local profiling revealed that chacha20 accounts for a substantial amount of the network thread's time. It's not clear to me if speeding up chacha20 will improve network performance/latency, but it will definitely make it more efficient.

IBD would indeed be slowed by Chacha20 via single-threaded bip324 handling on the network thread. While profiling my POC multi-process net binary I observed ChaCha20Aligned::Crypt() accounting for 30% of the net thread's cpu time.

So it's not surprising to me at all that 3x'ing that function (it's only 2x here, avx2/avx512 improve performance even more) speeds up IBD.

theuni and others added 2 commits January 13, 2026 19:05
Co-authored-by: Lőrinc <pap.lorinc@gmail.com>
__builtin_shufflevector emits non-optimal code for this case, so avoid using it
here. See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107563#c14
@theuni
Copy link
Member Author

theuni commented Jan 13, 2026

I finally managed to track down the gcc slowdown that @l0rinc mentioned during last week's IRC meeting. The culprit was this gcc bug. Thankfully, it's easily worked around by simply not calling the guilty builtin.

Also pushed @l0rinc's fix for big-endian.

Now that gcc/clang are more on par and the impl seems feasible again, I'll address the other feedback.

@l0rinc l0rinc mentioned this pull request Jan 14, 2026
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants