Conversation
|
After getting all of the backends to share a common representation when using Karatsuba + powers-of-H, this feels like an annoying backstep, as it's using a I was a bit surprised by the benchmark results, as at least on my M1 Max here it seems to actually be slower than Karatsuba + powers-of-H for longer messages: I tried quickly modifying it to operate over 8-blocks which improved performance slightly but not enough to make up for the regression on larger messages. I suppose Karatsuba+powers-of-H is only doing a (more expensive) reduction every 8-blocks, whereas R/F is doing two simpler reductions. I'm curious if the algorithm were properly adapted to 8-blocks, so it deferred reductions all 8-blocks the way Karatsuba+powers-of-H does, if it would be faster in that case. I haven't tried that yet. This algorithm requires a key expansion twice the size of powers-of-H (it's effectively powers-of-H + a |
|
Sidebar: I ended up liking how I reorganized the crate in this PR (effectively a hybrid of how it used to be and the |
|
I tried turning down So yes, R/F wins, but at a cost of twice the memory usage. I think we should probably adopt it, but then the question becomes do we leave the implementation like this, nice and simple with a one-size-fits-most powers-of-H, or do we make it generic again for people who want to optimize for long messages? |
0a4fc63 to
3d7e37d
Compare
3d7e37d to
1b706b9
Compare
Adapts the Apache 2.0+MIT licensed implementation of the Reduction/Field algorithm (#290, see https://eprint.iacr.org/2025/2171) from the `hpcrypt-mac` crate: https://github.com/seceq/hpcrypt/ Sources were taken as of upstream commit 6c22585, then modified to fit the structure of this crate (e.g. removed buffering). Benchmarks show for 4 powers-of-H, this algorithm is ~66% faster for short messages on ARMv8, and ~20% faster for longer messages. It requires double the memory footprint, however, and for the same size is faster than our previous Karatsuba + 8 x powers-of-H for short messages, but slower for longer ones. This removes the generic parameter, since the implementation we're adapting isn't currently generic, however it could potentially be returned if there's a desire for higher performance on long messages. Additionally support for using a custom init block has been removed, but could be added back. I was noticing there are some oddities to it like `Reset` will always use `0` rather than the custom init block. I think perhaps instead of trying to shoehorn that stuff into `Polyval`/`GHash` and adding undue complexity there, we should instead expose bits through `polyval::hazmat` (like the core R/F algorithm) which make it possible to build your own customized high performance GHASH/POLYVAL-like UHF that can be tweaked in any way desired.
1b706b9 to
2c394f5
Compare
|
Note: also double checked this against |
## Added - `hazmat` feature that exposes `FieldElement` type (#279) ## Changed - Edition changed to 2024 and MSRV bumped to 1.85 (#228) - Relax MSRV policy and allow MSRV bumps in patch releases - Replace `polyval_force_soft` with `polyval_backend="soft"` (#259) - Use `cpubits` crate for `soft` backend selection (#261) - Bump `cpufeatures` to v0.3 (#292) - Use [Reduction/Field algorithm] for parallel block processing on `aarch64`/`x86(_64)` (#294) - `polyval::mulx` moved to `FieldElement::mulx` (#296) - Update to `universal-hash` v0.6 (#310) ## Removed - `cfg(polyval_armv8)` - now enabled by default (#214)
Adapts the Apache 2.0+MIT licensed implementation of the Reduction/Field algorithm (#290, see https://eprint.iacr.org/2025/2171) from
hpcrypt-mac: https://github.com/seceq/hpcrypt/Sources were taken as of upstream commit 6c22585, then modified to fit the structure of this crate (e.g. removed buffering).
Benchmarks show for 4 powers-of-H, this algorithm is ~66% faster for short messages on ARMv8, and ~20% faster for longer messages. It requires double the memory footprint, however, and for the same size is faster than our previous Karatsuba + 8 x powers-of-H for short messages, but slower for longer ones.
This removes the generic parameter, since the implementation we're adapting isn't currently generic, however it could potentially be returned if there's a desire for higher performance on long messages.
Additionally support for using a custom init block has been removed, but could be added back. I was noticing there are some oddities to it like
Resetwill always use0rather than the custom init block. I think perhaps instead of trying to shoehorn that stuff intoPolyval/GHashand adding undue complexity there, we should instead expose bits throughpolyval::hazmat(like the core R/F algorithm) which make it possible to build your own customized high performance GHASH/POLYVAL-like UHF that can be tweaked in any way desired.