Refactor: Final iteration on load/store#3869
Conversation
cab16ed to
5ad0f31
Compare
5ad0f31 to
4c9f780
Compare
4c9f780 to
530ad51
Compare
cb8089a to
8aa12b3
Compare
7c3c0f1 to
bc377d6
Compare
c5f52a1 to
a177713
Compare
675cf4c to
2bd8195
Compare
|
Rebased after #3888. |
This refactors the load/store helpers to duplicating all overloads, for big-endian and little-endian functionality. We hope to reduce the complexity and maintenance burden of this code. As a side-effect this also centralizes some of the pre-processor defines into low-level helper methods. The actual logic now uses `if constexpr` based on those helpers. Also, the support for legacy overloads (using pointers or C-arrays) produces much less clutter now. Additionally, the load/store helpers can now handle strong types that wrap unsigned integer values. Co-Authored-By: Fabian Albert <fabian.albert@rohde-schwarz.com>
This allows loading and storing of entire collections of unsigned integers from/into byte ranges. That can be used to extract a digest from the hash function's internal state array, for instance. Co-Authored-By: Fabian Albert <fabian.albert@rohde-schwarz.com>
2bd8195 to
74012f9
Compare
|
Rebased and resolved conflicts after #3908 got merged. |
randombit
left a comment
There was a problem hiding this comment.
I’m a little skeptical of this change simply because it’s +900/-300 just to keep feature parity, and the new code is rather more complicated than what came before, which does not seem like a simplification so much to me 🙂 but I can see the benefits re strong types and consistent constexpr.
As a smoke test botan speed seems to me fine, but 7% for AES-NI seems quite surprising! For one, because AES-NI uses SIMD_4x32::load_le which doesn’t use this load infrastructure at all - it basically is just a _mm_loadu_si128. (And likewise for stores.)
I’ll look into benchmarks on my machine asap. Approving in so far as - assuming we can clearly demonstrate no significant regressions - change seems ok.
|
I'd like to argue that the this adds almost 400 lines of test code. 🙂 However, admittedly, the code is denser than before, that's for sure. |
|
I checked times of this PR vs master on some algorithms that seemed likely to be impacted by performance of load/store (especially hash functions), nothing notable so ok to merge. |
Description
This should be the final iteration on the big/little endian helpers for now (modulo the eventual removal of the legacy ptr-based overloads).
The patch fulfills three main objectives:
This is achieved by introducing
detail::load_any<>anddetail::store_any<>with a template parameterEndianness. By passing this all the way to the lowest-level overload, we can do the distinction once and implement convenience overloads without duplicating them each time (+more duplication for legacy overloads)This allows for things like:
load_be<MyIntStrongType>(some_bytes), orstore_be(some_enum_value). It makes working with strong types and enums more convenient when loading/storing them.Before, one could load/store multiple variables at once (passed as variadic parameters). Now, its possible to pass ranges of integers (similarly to
typecast_copy()). That is useful initializing and extracting internal hash states (that are collections ofuint64_t), for instance.The whole range of features described above may also be used at compile-time. For instance, to initialize algorithm parameters that need a certain endianness.
Quick and Dirty Benchmark
Running
./botan speedboth on the feature branch and on themastercommit it is based on, both compiled with clang-15 on ubuntu 22.04 (running in WSL). Is that a reasonable smoke test?Many algorithms actually seem to benefit from this patch (without modifying the used load/store overloads). For instance, AES encryption/decryption (using NI extensions) clocks in at a 7% speedup. Is that even realistic? 😲 Keccak seems to suffer slightly. For instance, the SHAKE throughput is about 2% less.
Here's the full data set: bench_loadstore_refactoring.txt