-
Notifications
You must be signed in to change notification settings - Fork 38.7k
Add initial vectorized chacha20 implementation for 2-3x speedup #34083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Exploit modern simd to calculate 2/4/6/8/16 states at a time depending on the size of the input. Demonstrates a 2x speedup on x86-64 and 3x for arm+neon. Platforms which require runtime detection (avx2/avx512) improve performance even further, and will come as a follow-up. Rather than hand-writing assembly or using arch-specific intrinsics, this is written using compiler built-ins understood by gcc and clang. In practice (at least on x86_64 and armv8), the compilers are able to produce assembly that's not much worse than hand-written. This means that every architecture can benefit from its own vectorized instructions without having to write/maintain an implementation for each one. But because each will vary in ability to exploit the parallelism, we allow (via ifdefs) each architecture to opt-out of some or all multi-state calculation at compile-time.. Here, as a starting point, x86-64 and arm+neon have been defined based on local benchmarks.
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34083. ReviewsSee the guideline for information on the review process. LLM Linter (✨ experimental)Possible typos and grammar issues:
2026-01-13 |
ajtowns
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits. Approach looks very nice, and at least going by the bench results, gives a good improvement.
| # define CHACHA20_VEC_DISABLE_STATES_2 | ||
| #endif | ||
|
|
||
| #include <crypto/chacha20_vec.ipp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why separate this into an .ipp file if it's only included in a single .cpp file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See branch here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/
I decided to exclude the impls which require runtime detection from this PR, as I think how that should be done is a separate conversation.
To answer your question more specifically: some impls may require different compilation flags (-mavx2/-mavx512vl), which have to be in their own compilation units.
| // (sse2 and neon respectively) are safe to use without runtime detection. | ||
|
|
||
| #if defined(__x86_64__) || defined(__amd64__) | ||
| # define CHACHA20_VEC_DISABLE_STATES_16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using #if !defined for these seems old school. Why not static constexpr bools, and if constexpr (..)?
Writing:
#if defined(__x86_64__) || defined(__amd64__)
# define CHACHA20_VEC_DISABLE_STATES_16 true
# define CHACHA20_VEC_DISABLE_STATES_8 true
# define CHACHA20_VEC_DISABLE_STATES_6 true
# define CHACHA20_VEC_DISABLE_STATES_4 false
# define CHACHA20_VEC_DISABLE_STATES_2 false
#elif defined(__ARM_NEON)
...
static constexpr bool CHACHA20_VEC_ALL_MULTI_STATES_DISABLED =
CHACHA20_VEC_DISABLE_STATES_16 &&
CHACHA20_VEC_DISABLE_STATES_8 &&
CHACHA20_VEC_DISABLE_STATES_6 &&
CHACHA20_VEC_DISABLE_STATES_4 &&
CHACHA20_VEC_DISABLE_STATES_2;
void chacha20_crypt_vectorized(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, const std::array<uint32_t, 12>& input) noexcept
{
if constexpr (CHACHA20_VEC_ALL_MULTI_STATES_DISABLED) return;
...
if constexpr (!CHACHA20_VEC_DISABLE_STATES_16) {
while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 16) {
...seems to work right, and also seems like it avoids needing to put all the potentially unused code in a #if block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a branch that replaces the #defines and abstracts the recursive template stuff a bit more for your perusal https://github.com/ajtowns/bitcoin/commits/202512-pr34083-templates/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for if constexpr, should simplify the code a lot (especially if we migrate from recursive templates as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if constexpr certainly makes more sense where possible. I'll have a look, thanks!
|
|
||
| namespace { | ||
|
|
||
| using vec256 = uint32_t __attribute__((__vector_size__(32))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_assert(sizeof(vec256) == 32) ? (Or sizeof(vec256) == 32 || ALL_MULTI_STATES_DISABLED)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used this before but the internets tells me we could use the attribute syntax here, something like:
| using vec256 = uint32_t __attribute__((__vector_size__(32))); | |
| template <typename T, size_t N> | |
| using VectorType [[gnu::vector_size(sizeof(T) * N)]] = T; | |
| using vec256 = VectorType<uint32_t, 8>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they're compiler-specific, the code makes no assumptions about the size/structure/alignment of vec256. The only accesses are via operator[]. So afaik, there's no need to check for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant just so you get an error if you're using a compiler ignores the __attribute__ entirely but still passes the #if in _base.cpp.
|
Looking forward to reviewing it - quick question before I do: was it tested on any big-endian systems which don't revert to non-vectorized run? |
| the layout to prepare for the second round. | ||
| After the second round, they are used (in reverse) to restore the original | ||
| layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that you're inverting the logic here: instead of
#define QUARTERROUND(a,b,c,d) \
X(a,b,d,16) X(c,d,b,12) X(a,d,b,8) X(c,d,b,7)This does the X(a,b,d,16) all four times via simd, then X(c,d,b,12) all four times, etc. And the simd part relies on the data for those four steps being exactly +4 units apart, which is why the shuffling and unshuffling is necessary for the second round. (Okay, not four times but eight times because each vec256 is two blocks)
Could use a little more explanation in the comment I think? But makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me too, but I'm not sure why doing it this way is preferable. Intuitively I would have expected that the vectors hold the same word spread over multiple blocks. But if I understand your approach here, we have multiple words over two blocks. Is it just easier to express the quarter rounds in this way?
| assert(blocks * ChaCha20Aligned::BLOCKLEN == out_bytes.size()); | ||
| #ifdef ENABLE_CHACHA20_VEC | ||
| // Only use the vectorized implementations if the counter will not overflow. | ||
| const bool overflow = static_cast<uint64_t>(input[8]) + blocks > std::numeric_limits<uint32_t>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overflow happens every 274GB I guess, which presumably isn't worth putting much effort in to optimising around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually doing the opposite.. it's keeping the vectorized impl from having to worry about this case. In the current code, we have:
++j12;
if (!j12) ++j13;
...
input[8] = j12;
input[9] = j13;So effectively input[8] and input[9] are treated as a single uint64_t. It's not possible to express "cast to uint64_t elements and increment" or "increment and overflow over there" with the vector extensions, so it turned out to be easier to just forbid the overflow cases from being vectorized at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about whether the code should just do the non-vectorized stuff to get past the overflow then immediately go back to vectorizing, rather than waiting for the next call. I think what you've got makes sense.
l0rinc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code roughly, I like that we're focusing on this and looking forward to specializing other parts of the code that are this critical (looking at you, SipHash!).
My biggest objection currently is that it's broken on big-endian systems - left a suggestion how to reproduce and fix it. This also reveals the lack of testing - we have to find a way to selectively enable and disable these optimizations to make sure we have tests that compare their outputs.
I agree with AJ that we could modernize this a bit with constexpr and less general recursive template magic since C++20 allows us to use simple loops and constexpr conditions and lambdas - it could reduce a lot of duplication while maintaining performance.
There's also some repetition (e.g. ALWAYS_INLINE and internal_bswap_32), already defined elsewhere which I think we could use instead.
Which leads me to think we should extract the new primitives used here (__builtin_shufflevector, vec_rotl, vec_byteswap, vec256) to a reusable header - tested and benchmarked separately from the chacha work.
| if constexpr (std::endian::native == std::endian::big) | ||
| { | ||
| vec256 ret; | ||
| ret[0] = __builtin_bswap32(vec[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assume that all big endian systems have __builtin_bswap32 available?
Lines 14 to 16 in 432b18c
| // All internal_bswap_* functions can be replaced with std::byteswap once we | |
| // require c++23. Both libstdc++ and libc++ implement std::byteswap via these | |
| // builtins. |
ALWAYS_INLINE void vec_byteswap(vec256& vec)
{
if constexpr (std::endian::native == std::endian::big) {
for (size_t i = 0; i < 8; ++i) {
vec[i] = internal_bswap_32(vec[i]);
}
}
}A small, fixed-size loop like this should be unrolled by every compiler, it's what we did in https://github.com/bitcoin/bitcoin/pull/31144/files#diff-f26d4597a7f5a5d4aa30053032d49427b402ef4fd7a1b3194cb75b2551d58ca4R48 as well.
(nit: could you please reformat the patch, it differs slightly from how clang-format is set for new code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes, I meant to change that to internal_bswap_32 before pushing. Will do.
|
|
||
| /** Store a vector in all array elements */ | ||
| template <size_t I, size_t ITER = 0> | ||
| ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me the existing standard test vectors are too short (mostly < 128 bytes) to trigger the vectorized optimization - can we extend the tests to runa and compare the new specializations (could show as skipped when the given architecture isn't available locally)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do.
| # define CHACHA20_VEC_DISABLE_STATES_8 | ||
| # define CHACHA20_VEC_DISABLE_STATES_6 | ||
| # define CHACHA20_VEC_DISABLE_STATES_4 | ||
| # define CHACHA20_VEC_DISABLE_STATES_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing vectorized operations on an emulated big-endian system (otherwise it falls back to the scalar implementation due to the safety checks):
diff --git a/src/crypto/chacha20_vec_base.cpp b/src/crypto/chacha20_vec_base.cpp
index 9fda9452a1..a2aa1e5552 100644
--- a/src/crypto/chacha20_vec_base.cpp
+++ b/src/crypto/chacha20_vec_base.cpp
@@ -20,7 +20,7 @@
# define CHACHA20_VEC_DISABLE_STATES_8
# define CHACHA20_VEC_DISABLE_STATES_6
# define CHACHA20_VEC_DISABLE_STATES_4
-# define CHACHA20_VEC_DISABLE_STATES_2
+//# define CHACHA20_VEC_DISABLE_STATES_2
#endif
#include <crypto/chacha20_vec.ipp>and running the crypto_tests:
brew install podman pigz qemu
podman machine init
podman machine start
podman run --platform linux/s390x -it --rm ubuntu:latest /bin/bash -c \
'apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y \
git build-essential cmake ccache pkg-config \
libevent-dev libboost-dev libssl-dev libsqlite3-dev python3 && \
git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && \
git fetch origin pull/34083/head:chacha20-vec && git checkout chacha20-vec && \
sed -i "s/# define CHACHA20_VEC_DISABLE_STATES_2/\/\/# define CHACHA20_VEC_DISABLE_STATES_2/" src/crypto/chacha20_vec_base.cpp && \
cmake -B build -DBUILD_BENCH=OFF -DBUILD_GUI=OFF -DBUILD_DAEMON=OFF -DBUILD_TX=OFF -DENABLE_IPC=OFF -DENABLE_WALLET=OFF -DENABLE_ZMQ=OFF -DENABLE_UPNP=OFF -DENABLE_NATPMP=OFF && \
cmake --build build --target test_bitcoin -j1 && \ ./build/bin/test_bitcoin --run_test=crypto_tests'We're getting a lot of failures:
Running 17 test cases...
./test/crypto_tests.cpp(155): error: in "crypto_tests/chacha20_testvector": check hexout == HexStr(outres) has failed [a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4042e02785ececfa4b4bb5e8ead0440e20b6e8db09d881a7c6132f420e52795042bdfa7773d8a9051447b3291ce1411c680465552aa6c405b7764d5e87bea85ad00f8449ed8f72d0d662ab052691ca66424bc86d2df80ea41f43abf937d3259dc4b2d0dfb48a6c9139ddd7f76966e928e635553ba76c5c879d7b35d49eb2e62b0871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221 != c2ece71ceded38c04f376ca225cc3d6b463409986f263e9db1994a204b6844ec6e9695cfc52b7003e3b6961ceac96a18138981cb4ee5919649b263d542b82b114a57f52d8ba2a2f4540af4f7f49f0b1072a7f9891b97ceb5c61c20420143685f169add2373c4b505122eda2f5df3535d555637680dc5a722b83209519ae7f147c11a9158f48479c9926ea7412ac69d6a584ac97768e412e1514fa7b737ce389dc4bbd9df9cc422b95ccfc5926171fe20e9284834a77646878d7a34c485b3e7300c3589a8448b3bc53b980c6bced42938afc1398c2f1a3a6c2887c5be68a18039cb2fba983271c1202a73af95c79ae29faafb40973694b4afa523f2ef13b84300c39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221]
...
./test/crypto_tests.cpp(185): error: in "crypto_tests/chacha20_testvector": check hexout == HexStr(outres) has failed [a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4042e02785ececfa4b4bb5e8ead0440e20b6e8db09d881a7c6132f420e52795042bdfa7773d8a9051447b3291ce1411c680465552aa6c405b7764d5e87bea85ad00f8449ed8f72d0d662ab052691ca66424bc86d2df80ea41f43abf937d3259dc4b2d0dfb48a6c9139ddd7f76966e928e635553ba76c5c879d7b35d49eb2e62b0871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221 != a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4a57f52d8ba2a2f4540af4f7f49f0b1072a7f9891b97ceb5c61c20420143685f169add2373c4b505122eda2f5df3535d555637680dc5a722b83209519ae7f147c11a9158f48479c9926ea7412ac69d6a584ac97768e412e1514fa7b737ce389dc4bbd9df9cc422b95ccfc5926171fe20e9284834a77646878d7a34c485b3e7300871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221]
./test/crypto_tests.cpp(272): error: in "crypto_tests/chacha20poly1305_testvectors": check cipher == expected_cipher has failed
...
./test/crypto_tests.cpp(337): error: in "crypto_tests/chacha20poly1305_testvectors": check decipher == plain has failed
*** 37 failures are detected in the test module "Bitcoin Core Test Suite"This appears to be an endianness issue in vec_read_xor_write: the current implementation swaps the result of the XOR, but on Big Endian systems we must swap the state vec before the XOR.
Changing it to:
diff --git a/src/crypto/chacha20_vec.ipp b/src/crypto/chacha20_vec.ipp
index 46a159ce01..cfc0535a92 100644
--- a/src/crypto/chacha20_vec.ipp
+++ b/src/crypto/chacha20_vec.ipp
@@ -174,8 +174,9 @@ ALWAYS_INLINE void vec_read_xor_write(std::span<const std::byte, 32> in_bytes, s
{
std::array<uint32_t, 8> temparr;
memcpy(temparr.data(), in_bytes.data(), in_bytes.size());
- vec256 tempvec = vec ^ (vec256){temparr[0], temparr[1], temparr[2], temparr[3], temparr[4], temparr[5], temparr[6], temparr[7]};
+ vec256 tempvec = vec;
vec_byteswap(tempvec);
+ tempvec ^= (vec256){temparr[0], temparr[1], temparr[2], temparr[3], temparr[4], temparr[5], temparr[6], temparr[7]};
temparr = {tempvec[0], tempvec[1], tempvec[2], tempvec[3], tempvec[4], tempvec[5], tempvec[6], tempvec[7]};
memcpy(out_bytes.data(), temparr.data(), out_bytes.size());
}Makes it pass for me.
We should find a way to exercise this via CI and benchmarks, maybe similarly to SHA256AutoDetect in
Lines 585 to 690 in bdb8ead
| std::string SHA256AutoDetect(sha256_implementation::UseImplementation use_implementation) | |
| { | |
| std::string ret = "standard"; | |
| Transform = sha256::Transform; | |
| TransformD64 = sha256::TransformD64; | |
| TransformD64_2way = nullptr; | |
| TransformD64_4way = nullptr; | |
| TransformD64_8way = nullptr; | |
| #if !defined(DISABLE_OPTIMIZED_SHA256) | |
| #if defined(HAVE_GETCPUID) | |
| bool have_sse4 = false; | |
| bool have_xsave = false; | |
| bool have_avx = false; | |
| [[maybe_unused]] bool have_avx2 = false; | |
| [[maybe_unused]] bool have_x86_shani = false; | |
| [[maybe_unused]] bool enabled_avx = false; | |
| uint32_t eax, ebx, ecx, edx; | |
| GetCPUID(1, 0, eax, ebx, ecx, edx); | |
| if (use_implementation & sha256_implementation::USE_SSE4) { | |
| have_sse4 = (ecx >> 19) & 1; | |
| } | |
| have_xsave = (ecx >> 27) & 1; | |
| have_avx = (ecx >> 28) & 1; | |
| if (have_xsave && have_avx) { | |
| enabled_avx = AVXEnabled(); | |
| } | |
| if (have_sse4) { | |
| GetCPUID(7, 0, eax, ebx, ecx, edx); | |
| if (use_implementation & sha256_implementation::USE_AVX2) { | |
| have_avx2 = (ebx >> 5) & 1; | |
| } | |
| if (use_implementation & sha256_implementation::USE_SHANI) { | |
| have_x86_shani = (ebx >> 29) & 1; | |
| } | |
| } | |
| #if defined(ENABLE_SSE41) && defined(ENABLE_X86_SHANI) | |
| if (have_x86_shani) { | |
| Transform = sha256_x86_shani::Transform; | |
| TransformD64 = TransformD64Wrapper<sha256_x86_shani::Transform>; | |
| TransformD64_2way = sha256d64_x86_shani::Transform_2way; | |
| ret = "x86_shani(1way;2way)"; | |
| have_sse4 = false; // Disable SSE4/AVX2; | |
| have_avx2 = false; | |
| } | |
| #endif | |
| if (have_sse4) { | |
| #if defined(__x86_64__) || defined(__amd64__) | |
| Transform = sha256_sse4::Transform; | |
| TransformD64 = TransformD64Wrapper<sha256_sse4::Transform>; | |
| ret = "sse4(1way)"; | |
| #endif | |
| #if defined(ENABLE_SSE41) | |
| TransformD64_4way = sha256d64_sse41::Transform_4way; | |
| ret += ";sse41(4way)"; | |
| #endif | |
| } | |
| #if defined(ENABLE_AVX2) | |
| if (have_avx2 && have_avx && enabled_avx) { | |
| TransformD64_8way = sha256d64_avx2::Transform_8way; | |
| ret += ";avx2(8way)"; | |
| } | |
| #endif | |
| #endif // defined(HAVE_GETCPUID) | |
| #if defined(ENABLE_ARM_SHANI) | |
| bool have_arm_shani = false; | |
| if (use_implementation & sha256_implementation::USE_SHANI) { | |
| #if defined(__linux__) | |
| #if defined(__arm__) // 32-bit | |
| if (getauxval(AT_HWCAP2) & HWCAP2_SHA2) { | |
| have_arm_shani = true; | |
| } | |
| #endif | |
| #if defined(__aarch64__) // 64-bit | |
| if (getauxval(AT_HWCAP) & HWCAP_SHA2) { | |
| have_arm_shani = true; | |
| } | |
| #endif | |
| #endif | |
| #if defined(__APPLE__) | |
| int val = 0; | |
| size_t len = sizeof(val); | |
| if (sysctlbyname("hw.optional.arm.FEAT_SHA256", &val, &len, nullptr, 0) == 0) { | |
| have_arm_shani = val != 0; | |
| } | |
| #endif | |
| } | |
| if (have_arm_shani) { | |
| Transform = sha256_arm_shani::Transform; | |
| TransformD64 = TransformD64Wrapper<sha256_arm_shani::Transform>; | |
| TransformD64_2way = sha256d64_arm_shani::Transform_2way; | |
| ret = "arm_shani(1way;2way)"; | |
| } | |
| #endif | |
| #endif // DISABLE_OPTIMIZED_SHA256 | |
| assert(SelfTest()); | |
| return ret; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! I forgot to mention in the PR description that big-endian was best-effort and untested. I figured our c-i would catch any obvious bugs. Agree it's not great that it didn't :(
Thanks for the quick fix too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same disappointment when implementing the obfuscation optimization. :)
@maflcko do we have a big-endian nightly that supports vectorized operations? Would it have caught this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #33436, but it was slow despite having the gui and the fuzz tests disabled. I am running it as part of nightly, so any issues will be caught before a release, but I am not sure if catching everything in pull requests is possible.
Maybe it is possible to split the task into two: One for a cross-compile, which should be faster than a "native" compile via qemu. And another to run the tests, similar to the windows-cross tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I ran the CI config locally, but it didn't catch this, as the platform opts out. The CI doesn't run the sed -i "s/# define CHACHA20_VEC_DISABLE_STATES_2/\/\/# define CHACHA20_VEC_DISABLE_STATES_2/" src/crypto/chacha20_vec_base.cpp && \ portion, so it would not have caught this, even if the task was run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking @maflcko, that's why I asked.
The vectorized operations are obviously supported, but for some reason were not triggered for me either.
@theuni, is this just an emulation anomaly or we were just too cautious?
I understdood that @achow101 has access to real big-endian power9 machine that we might be able to test this on when it's ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@l0rinc As the code is written at the moment, all platforms must opt-in to vectorization as opposed to opting out. I'm not sure that's the best approach, but I figured that was a reasonable starting point.
My reasoning for that was: consider non-x86, non-arm+neon platforms. For the most part, I'm assuming they're under-powered. Enabling (for example) 4x blocks/sec for mipsel would probably cause a drastic slowdown. Obviously there are lots of other powerful platforms, but I figured those would be added over time.
So if there's a big-endian architecture (or more ideally, a specific instruction set ala __ARM_NEON) that demonstrates a performance gain from calculating multiple states at once, it should be added here.
Of course, we could go the opposite direction and opt all platforms IN to all states, and instead opt-out the slow ones.
tl;dr: If we want a big-endian platform to be supported, we need to opt one in or provide an override.
| template <size_t BITS, size_t I, size_t ITER = 0> | ||
| ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2) | ||
| { | ||
| vec256& x = std::get<ITER>(arr0); | ||
| const vec256& y = std::get<ITER>(arr1); | ||
| vec256& z = std::get<ITER>(arr2); | ||
|
|
||
| x += y; | ||
| z ^= x; | ||
| vec_rotl<BITS>(z); | ||
|
|
||
| if constexpr(ITER + 1 < I ) arr_add_xor_rot<BITS, I, ITER + 1>(arr0, arr1, arr2); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, it seems to me we're doing heavy recursive templates here to unroll loops.
My understanding (and experience with the mentioned Obfuscation PR) is that modern compilers are good at unrolling fixed-bound loops.
Can you please try if this also works and results in the same speedup?
| template <size_t BITS, size_t I, size_t ITER = 0> | |
| ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2) | |
| { | |
| vec256& x = std::get<ITER>(arr0); | |
| const vec256& y = std::get<ITER>(arr1); | |
| vec256& z = std::get<ITER>(arr2); | |
| x += y; | |
| z ^= x; | |
| vec_rotl<BITS>(z); | |
| if constexpr(ITER + 1 < I ) arr_add_xor_rot<BITS, I, ITER + 1>(arr0, arr1, arr2); | |
| } | |
| template <size_t BITS, size_t I> | |
| ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2) | |
| { | |
| for (size_t i{0}; i < I; ++i) { | |
| arr0[i] += arr1[i]; | |
| arr2[i] ^= arr0[i]; | |
| vec_rotl<BITS>(arr2[i]); | |
| } | |
| } |
(nit: the size is often N instead of I)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, with a compile-time loop:
/** Perform add/xor/rotate for the round function */
template <size_t BITS, size_t I>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
[&]<size_t... ITER>(std::index_sequence<ITER...>) {
((
arr0[ITER] += arr1[ITER],
arr2[ITER] ^= arr0[ITER],
vec_rotl<BITS>(arr2[ITER])
), ...);
}(std::make_index_sequence<I>());
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See updated title description where I touched on this.
I found (with lots of experimentation here) that different compilers use complicated and unpredictable heuristics to decide whether or not to unroll loops. Even if an unroll-able loop is detected, unrolling may be skipped because of the function size (as is the case here because it's huge).
So, I'd like to not take chances on unrolling. That means one of:
- Manual unrolling
- Macro-based (as-in
REPEAT10()) - A pragma
- Recursive template based
c++26 introduces template for, which is what we really want.
I'm up for whichever of those is generally preferred, as long as it's explicit rather than implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we could help that would make you reconsider? I don't mind running the benchmarks on a few platforms, as long as we can have simpler code. The current template magic is not something I would like to see more of - modern C++20 should be able to handle the situations we have here. I don't mind experimenting with this if you don't want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theuni See my std::make_index_sequence based I demonstrated above. It's template based, but doesn't need recursion, and doesn't rely on compiler unrolling. It simply expands to an expression (a[0] = v, a[1] = v, a[2] = v, ...) e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.agner.org/optimize/optimizing_cpp.pdf contains a few useful examples:
The automatic vectorization works best if the following conditions are satisfied:
Use a compiler with good support for automatic vectorization, such as Gnu, Clang,
or Intel.Use the latest version of the compiler. The compilers are becoming better and better
at vectorization.If the arrays or structures are accessed through pointers or references then tell the
compiler explicitly that pointers do not alias, if appropriate, using the __restrict or
restrict keyword.Use appropriate compiler options to enable the desired instruction set
(/arch:SSE2, /arch:AVX etc. for Windows, -msse2, -mavx512f, etc. for Linux)Use the less restrictive floating point options. For Gnu and Clang compilers, use the
options -O2 -fno-trapping-math -fno-math-errno -fno-signed-zeros
(-ffast-math works as well, but functions like isnan(x) do not work under
-ffast-math).Align arrays and big structures by 16 for SSE2, preferably 32 for AVX and preferably
64 for AVX512.The loop count should preferably be a constant that is divisible by the number of
elements in a vector.If arrays are accessed through pointers so that the alignment is not visible in the
scope of the function where you want vectorization then follow the advice given
above.Minimize the use of branches at the vector element level.
Avoid table lookup at the vector element level.
Similarly in https://en.algorithmica.org/hpc/simd/auto-vectorization useful hints:
The other way, specific to SIMD, is the “ignore vector dependencies” pragma. It is a general way to inform the compiler that there are no dependencies between the loop iterations:
#pragma GCC ivdep
for (int i = 0; i < n; i++)
| #if defined(__has_attribute) | ||
| # if __has_attribute(always_inline) | ||
| # define ALWAYS_INLINE __attribute__ ((always_inline)) inline | ||
| # endif | ||
| #endif | ||
|
|
||
| #if !defined(ALWAYS_INLINE) | ||
| # define ALWAYS_INLINE inline | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use
Lines 19 to 25 in e16c22f
| #if defined(__GNUC__) | |
| # define ALWAYS_INLINE inline __attribute__((always_inline)) | |
| #elif defined(_MSC_VER) | |
| # define ALWAYS_INLINE __forceinline | |
| #else | |
| # error No known always_inline attribute for this platform. | |
| #endif |
| #if !defined(CHACHA20_VEC_DISABLE_STATES_16) | ||
| while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 16) { | ||
| multi_block_crypt<16>(in_bytes, out_bytes, state0, state1, state2); | ||
| state2 += (vec256){16, 0, 0, 0, 16, 0, 0, 0}; | ||
| in_bytes = in_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 16); | ||
| out_bytes = out_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 16); | ||
| } | ||
| #endif | ||
| #if !defined(CHACHA20_VEC_DISABLE_STATES_8) | ||
| while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 8) { | ||
| multi_block_crypt<8>(in_bytes, out_bytes, state0, state1, state2); | ||
| state2 += (vec256){8, 0, 0, 0, 8, 0, 0, 0}; | ||
| in_bytes = in_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 8); | ||
| out_bytes = out_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 8); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of repetition here - could we extract that to an ALWAYS_INLINE lambda and have something like:
if constexpr (ENABLE_16) process_blocks(16);
else if constexpr (ENABLE_8) process_blocks(8);
...I haven't implemented it locally, maybe it's naive, let me know what you think.
| // (sse2 and neon respectively) are safe to use without runtime detection. | ||
|
|
||
| #if defined(__x86_64__) || defined(__amd64__) | ||
| # define CHACHA20_VEC_DISABLE_STATES_16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for if constexpr, should simplify the code a lot (especially if we migrate from recursive templates as well)
|
|
||
| namespace { | ||
|
|
||
| using vec256 = uint32_t __attribute__((__vector_size__(32))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used this before but the internets tells me we could use the attribute syntax here, something like:
| using vec256 = uint32_t __attribute__((__vector_size__(32))); | |
| template <typename T, size_t N> | |
| using VectorType [[gnu::vector_size(sizeof(T) * N)]] = T; | |
| using vec256 = VectorType<uint32_t, 8>; |
| ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec) | ||
| { | ||
| std::get<ITER>(arr) = vec; | ||
| if constexpr(ITER + 1 < I ) arr_set_vec256<I, ITER + 1>(arr, vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's possible to use compile-time loops here instead of recursive templates (if runtime loops don't get optimized sufficiently):
/** Store a vector in all array elements */
template <size_t I>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
[&]<size_t... ITER>(std::index_sequence<ITER...>) {
((std::get<ITER>(arr) = vec),...);
}(std::make_index_sequence<I>());
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I avoided lambdas as I've historically observed lots of optims being skipped (with clang, at least) when using them. But since you/@ajtowns/@l0rinc have all made the same comment, I'll play around and see if these indeed compile down to nothing as one would hope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be done with a helper function too, but that's not as concise:
template <size_t I, size_t... ITER>
ALWAYS_INLINE void arr_set_vec256_inner(std::array<vec256, I>& arr, const vec256& vec, std::index_sequence<ITER...>)
{
((std::get<ITER>(arr) = vec),...);
}
/** Store a vector in all array elements */
template <size_t I>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
arr_set_vec256_inner(arr, vec, std::make_index_sequence<I>());
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing this as:
template<size_t... ITER> using iseq = std::index_sequence<ITER>;
static constexpr ISEQ = std::make_index_sequence<I>();
template<size_t... ITER>
ALWAYS_INLINE void arr_set_vec256(iseq<ITER>, std::array<vec256, I>& arr, const vec256& vec)
{
((std::get<ITER>(arr) = vec), ...);
}
...
arr_set_vec256(ISEQ, arr0, num256);doesn't seem too bad, and avoids lambdas? Possibly still a bit clumsy if you can't capture I by putting everything in a class? Might still be a bit clumsy for add_xor_rot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for the aversion to the current approach? It seems easier to read and probably also to debug to me.
|
An additional note about the actual vectorizing algorithm itself... There's a different (and arguably more obvious) algorithm that's possible when calculating exactly 8 states. Rather than loading each vec256 x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15;
broadcast(x0, 0x61707865);
broadcast(x1, 0x3320646e);
broadcast(x2, 0x79622d32);
broadcast(x3, 0x6b206574);
broadcast(x4, input[0]);
...
broadcast(x15, input[11]);
QUARTERROUND( x0, x4, x8,x12);
QUARTERROUND( x1, x5, x9,x13);
...
vec256 j0, j1, j2, j3, ... j31;
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 0, j0, j1);
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 1, j2, j3);
...
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 15, j30, j31);
vec_read_xor_write(in_bytes, out_bytes, j0);
vec_read_xor_write(in_bytes, out_bytes, j1);
...
The problem with this is the overhead of the transpose at the end. My experiments showed (on x86_64+avx2, at least) that this was actually slower than what's currently implemented. It's possible that this approach is better for other architectures, but I left it out of this PR to reduce the initial complexity. Edit: This is what the linux kernel does for x86_64+avx2. I grabbed their .S and hacked it into Core to benchmark, only to find that the impl here actually outperformed their hand-written asm. That was neat :) |
|
Ok, after some experimenting, I am very hesitant to add the helpers as suggested above. For example, consider @ajtowns's seemingly innocuous Repeat function: template <size_t REPS, typename Fn>
ALWAYS_INLINE void Repeat(Fn&& fn)
{
if constexpr (REPS > 0) {
fn();
Repeat<REPS-1>(std::forward<Fn>(fn));
}
}This causes a 10% slowdown on this branch, but with avx2 the performance is abysmal: Baseline(master): This PR: This PR + avx2 commits: 202512-pr34083-templates + avx2 commits: The problem in AJ's branch that some functions end up un-inlined. For clang, it's Both are fixed with: diff --git a/src/crypto/chacha20_vec.ipp b/src/crypto/chacha20_vec.ipp
index 344c68259a5..16d7da45633 100644
--- a/src/crypto/chacha20_vec.ipp
+++ b/src/crypto/chacha20_vec.ipp
@@ -164 +164 @@ public:
- Repeat<10>([&]() {
+ Repeat<10>([&]() __attribute__ ((always_inline)) {That means, to be safe, every lambda would need that attribute, which is pretty ugly and easy to forget. So I think my aversion to lambdas in this code was somewhat justified. Further, clang's inlining docs state that variadic functions aren't inlined. Experimenting now, that's obviously not true as of v21-git. But clearly it was in the recent past. So.. I'd really prefer to keep things as simple here as possible. I know the recursive template functions aren't exactly pretty, but I don't think they're THAT bad? Plus, with c++26, it all goes away in favor of |
Could we get some comments in the code as to compiler versions (and architecture) that failed to be efficient enough with lambdas, to hopefully avoid future people modernizing the code without checking that these problems have gone away? Presumably will be a long time before we can modernize to (Only thing that I do think is "that bad" about the recursive templates is using |
Sure, I can definitely add some more context and clean up the wonky variable names. Another brain dump after thinking about all this some more last night:
There are a few considerations there. @sipa's So to be safe, whatever we do, I think we should strive to annotate all functions. And imo, annotating a bunch of lambdas with an inline attribute feels weird. (as an aside: There's also the The
In my testing, loop unrolling always showed slightly better performance (presumably because calculating multiple blocks is otherwise branch-free), but it's not nearly as performance-critical as the inlining. For example, skipping unrolling of
My primary goal with this code (and hopefully setting a precedent for other multi-arch simd code... poly1305 is next ;) is to be as explicit to the compiler about what we want as possible. Ideally as portably as possible. So if we want our functions inlined or loops unrolled, we should attempt to communicate those things opposed to leaving them implicit. Unfortunately, modern c++ doesn't have ways to express either of those yet, but we can make our intentions clear enough. |
You could
The above was kind-of a joke, but, perhaps you could combine it with a clang-tidy plugin that lets CI check that all lambdas in a particular namespace are annotated in that way? That might be both clear to compilers and reasonably friendly to human authors/reviewers? (Explicit recursive templates and prohibiting lambdas are fine by me though; that's still a big step up from inline asm) |
|
I have measured its effect on IBD and happy to say it produces a measurable speedup.
IBD | 926619 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
|
Presumably that indicates this IBD run is compute bound (vs disk or network), and that ChaCha20 is using up maybe 7% of CPU time (or 7% of a core when we're bottlenecked on something single-threaded) prior to this PR. That seems like a lot? Is this due to FastRandomContext, or something else? Or is it just that obfuscating all the block data is a large component of IBD CPU currently? This seems a bit surprising to me. |
| namespace CHACHA20_NAMESPACE { | ||
| #endif | ||
|
|
||
| void chacha20_crypt_vectorized(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, const std::array<uint32_t, 12>& input) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should get a short description too. Wasn't immediately clear to me that it attempts to peel off larger blocks first before potentially finishing with a few smaller blocks.
| ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec) | ||
| { | ||
| std::get<ITER>(arr) = vec; | ||
| if constexpr(ITER + 1 < I ) arr_set_vec256<I, ITER + 1>(arr, vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for the aversion to the current approach? It seems easier to read and probably also to debug to me.
| the layout to prepare for the second round. | ||
| After the second round, they are used (in reverse) to restore the original | ||
| layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me too, but I'm not sure why doing it this way is preferable. Intuitively I would have expected that the vectors hold the same word spread over multiple blocks. But if I understand your approach here, we have multiple words over two blocks. Is it just easier to express the quarter rounds in this way?
Not sure if you missed this in the description, but that's exactly what this is trying to improve:
IBD would indeed be slowed by Chacha20 via single-threaded bip324 handling on the network thread. While profiling my POC multi-process net binary I observed So it's not surprising to me at all that 3x'ing that function (it's only 2x here, avx2/avx512 improve performance even more) speeds up IBD. |
Co-authored-by: Lőrinc <pap.lorinc@gmail.com>
__builtin_shufflevector emits non-optimal code for this case, so avoid using it here. See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107563#c14
|
I finally managed to track down the gcc slowdown that @l0rinc mentioned during last week's IRC meeting. The culprit was this gcc bug. Thankfully, it's easily worked around by simply not calling the guilty builtin. Also pushed @l0rinc's fix for big-endian. Now that gcc/clang are more on par and the impl seems feasible again, I'll address the other feedback. |

Exploit modern simd to calculate 2/4/6/8/16 states at a time depending on the size of the input.
This demonstrates a 2x speedup on x86-64 and 3x for arm+neon. Platforms which require runtime detection (avx2/avx512) improve performance even further, and will come as a follow-up.
Rather than hand-writing assembly or using arch-specific intrinsics, this is written using compiler built-ins understood by gcc and clang.
In practice (at least on x86_64 and armv8), the compilers are able to produce assembly that's not much worse than hand-written.
This means that every architecture can benefit from its own vectorized instructions without having to write/maintain an implementation for each one. But because each will vary in ability to exploit the parallelism, we allow (via ifdefs) each architecture to opt-out of some or all multi-state calculation at compile-time..
Here, as a starting point, x86-64 and arm+neon have been defined based on local benchmarks.
Local profiling revealed that chacha20 accounts for a substantial amount of the network thread's time. It's not clear to me if speeding up chacha20 will improve network performance/latency, but it will definitely make it more efficient.
This is part 1 of a series of PR's for chacha20. I think it makes sense to take a look at the generic implementation and tune the architecture-specific defines for parallel blocks before adding the runtime-dependent platforms.
My WIP branch which includes avx2/avx512 can be seen here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/
I've been hacking on this for quite a while, trying every imaginable tweak and comparing lots of resulting asm/ir. I'm happy to answer any questions about any choices made which aren't immediately obvious.
Edit: some more impl details:
I wrestled with gcc/clang a good bit, tweaking something and comparing the generated code output. A few things I found, which may explain some of the decisions I made:
vzeroupper). Hence, all helpers are decorated withALWAYS_INLINE.[]is clumsy and verbose, it avoids compiler-specific layout assumptions. Other things I tried (which produced the same asm):using unaligned_vec256 __attribute__((aligned (1))) = vec256__builtin_masked_load#pragma GCC unroll n, and I tried to avoid macros for loops, hence the awkward recursive inline template loops. But in practice, I see those unrolled 100% of the time.std::getin the helpers for some extra compile-time safety (this actually pointed out some off-by-one's that would've been annoying to track down)vec256are passed by reference to avoid an annoying clang warning about returning a vector changing the abi (this is specific to x86 when not compiling with-avx). Even though our functions are all inlined, I didn't see any harm in making that adjustment.