Skip to content

polyval: implement R/F algorithm#294

Merged
tarcieri merged 1 commit intomasterfrom
polyval/rf-algorithm
Feb 8, 2026
Merged

polyval: implement R/F algorithm#294
tarcieri merged 1 commit intomasterfrom
polyval/rf-algorithm

Conversation

@tarcieri
Copy link
Member

@tarcieri tarcieri commented Feb 8, 2026

Adapts the Apache 2.0+MIT licensed implementation of the Reduction/Field algorithm (#290, see https://eprint.iacr.org/2025/2171) from hpcrypt-mac: https://github.com/seceq/hpcrypt/

Sources were taken as of upstream commit 6c22585, then modified to fit the structure of this crate (e.g. removed buffering).

Benchmarks show for 4 powers-of-H, this algorithm is ~66% faster for short messages on ARMv8, and ~20% faster for longer messages. It requires double the memory footprint, however, and for the same size is faster than our previous Karatsuba + 8 x powers-of-H for short messages, but slower for longer ones.

This removes the generic parameter, since the implementation we're adapting isn't currently generic, however it could potentially be returned if there's a desire for higher performance on long messages.

Additionally support for using a custom init block has been removed, but could be added back. I was noticing there are some oddities to it like Reset will always use 0 rather than the custom init block. I think perhaps instead of trying to shoehorn that stuff into Polyval/GHash and adding undue complexity there, we should instead expose bits through polyval::hazmat (like the core R/F algorithm) which make it possible to build your own customized high performance GHASH/POLYVAL-like UHF that can be tweaked in any way desired.

@tarcieri
Copy link
Member Author

tarcieri commented Feb 8, 2026

After getting all of the backends to share a common representation when using Karatsuba + powers-of-H, this feels like an annoying backstep, as it's using a union across backend-specific state representations again.

I was a bit surprised by the benchmark results, as at least on my M1 Max here it seems to actually be slower than Karatsuba + powers-of-H for longer messages:

polyval/update_padded/10
                        time:   [7.5722 ns 7.6091 ns 7.6540 ns]
                        thrpt:  [1.2168 GiB/s 1.2240 GiB/s 1.2299 GiB/s]
                 change:
                        time:   [−39.217% −38.757% −38.378%] (p = 0.00 < 0.05)
                        thrpt:  [+62.281% +63.285% +64.519%]
                        Performance has improved.

polyval/update_padded/100
                        time:   [32.548 ns 32.694 ns 32.847 ns]
                        thrpt:  [2.8354 GiB/s 2.8486 GiB/s 2.8614 GiB/s]
                 change:
                        time:   [−62.433% −62.232% −62.054%] (p = 0.00 < 0.05)
                        thrpt:  [+163.53% +164.78% +166.19%]
                        Performance has improved.

polyval/update_padded/1000
                        time:   [164.98 ns 165.57 ns 166.20 ns]
                        thrpt:  [5.6036 GiB/s 5.6248 GiB/s 5.6451 GiB/s]
                 change:
                        time:   [−0.2885% +0.1648% +0.6194%] (p = 0.50 > 0.05)
                        thrpt:  [−0.6156% −0.1645% +0.2893%]
                        No change in performance detected.

polyval/update_padded/10000
                        time:   [1.4803 µs 1.4860 µs 1.4938 µs]
                        thrpt:  [6.2348 GiB/s 6.2673 GiB/s 6.2915 GiB/s]
                 change:
                        time:   [+70.640% +71.959% +73.859%] (p = 0.00 < 0.05)
                        thrpt:  [−42.482% −41.847% −41.397%]
                        Performance has regressed.

I tried quickly modifying it to operate over 8-blocks which improved performance slightly but not enough to make up for the regression on larger messages. I suppose Karatsuba+powers-of-H is only doing a (more expensive) reduction every 8-blocks, whereas R/F is doing two simpler reductions.

I'm curious if the algorithm were properly adapted to 8-blocks, so it deferred reductions all 8-blocks the way Karatsuba+powers-of-H does, if it would be faster in that case. I haven't tried that yet.

This algorithm requires a key expansion twice the size of powers-of-H (it's effectively powers-of-H + a D value computed for each power-of-H), so it seems like there's also a compute/memory tradeoff.

@tarcieri
Copy link
Member Author

tarcieri commented Feb 8, 2026

Sidebar: I ended up liking how I reorganized the crate in this PR (effectively a hybrid of how it used to be and the FieldElement refactor, which uses an entirely portable implementation of FieldElement and builds the soft backend on that, rather than coupling the soft backend to FieldElement itself)

@tarcieri
Copy link
Member Author

tarcieri commented Feb 8, 2026

I tried turning down DEFAULT_PARALLELISM on Karatsuba + powers-of-H from 8 to 4 and benchmarking that versus R/F, showing the improvement of the latter over the former:

polyval/update_padded/10
                        time:   [7.5540 ns 7.5604 ns 7.5679 ns]
                        thrpt:  [1.2306 GiB/s 1.2318 GiB/s 1.2329 GiB/s]
                 change:
                        time:   [−40.267% −40.020% −39.769%] (p = 0.00 < 0.05)
                        thrpt:  [+66.029% +66.723% +67.412%]
                        Performance has improved.

polyval/update_padded/100
                        time:   [32.941 ns 33.074 ns 33.213 ns]
                        thrpt:  [2.8041 GiB/s 2.8159 GiB/s 2.8272 GiB/s]
                 change:
                        time:   [−31.774% −31.406% −31.036%] (p = 0.00 < 0.05)
                        thrpt:  [+45.003% +45.784% +46.572%]
                        Performance has improved.

polyval/update_padded/1000
                        time:   [167.07 ns 168.29 ns 169.70 ns]
                        thrpt:  [5.4881 GiB/s 5.5342 GiB/s 5.5743 GiB/s]
                 change:
                        time:   [−17.896% −16.950% −15.831%] (p = 0.00 < 0.05)
                        thrpt:  [+18.809% +20.410% +21.797%]
                        Performance has improved.

polyval/update_padded/10000
                        time:   [1.4745 µs 1.4758 µs 1.4772 µs]
                        thrpt:  [6.3048 GiB/s 6.3108 GiB/s 6.3160 GiB/s]
                 change:
                        time:   [−16.302% −15.891% −15.536%] (p = 0.00 < 0.05)
                        thrpt:  [+18.394% +18.893% +19.477%]
                        Performance has improved.

So yes, R/F wins, but at a cost of twice the memory usage.

I think we should probably adopt it, but then the question becomes do we leave the implementation like this, nice and simple with a one-size-fits-most powers-of-H, or do we make it generic again for people who want to optimize for long messages?

@tarcieri tarcieri force-pushed the polyval/rf-algorithm branch 4 times, most recently from 0a4fc63 to 3d7e37d Compare February 8, 2026 21:35
@tarcieri tarcieri changed the title [WIP] polyval: implement R/F algorithm polyval: implement R/F algorithm Feb 8, 2026
@tarcieri tarcieri force-pushed the polyval/rf-algorithm branch from 3d7e37d to 1b706b9 Compare February 8, 2026 22:24
@tarcieri tarcieri marked this pull request as ready for review February 8, 2026 22:24
Adapts the Apache 2.0+MIT licensed implementation of the Reduction/Field
algorithm (#290, see https://eprint.iacr.org/2025/2171) from the
`hpcrypt-mac` crate: https://github.com/seceq/hpcrypt/

Sources were taken as of upstream commit 6c22585, then modified to fit
the structure of this crate (e.g. removed buffering).

Benchmarks show for 4 powers-of-H, this algorithm is ~66% faster for
short messages on ARMv8, and ~20% faster for longer messages. It
requires double the memory footprint, however, and for the same size is
faster than our previous Karatsuba + 8 x powers-of-H for short messages,
but slower for longer ones.

This removes the generic parameter, since the implementation we're
adapting isn't currently generic, however it could potentially be
returned if there's a desire for higher performance on long messages.

Additionally support for using a custom init block has been removed, but
could be added back. I was noticing there are some oddities to it like
`Reset` will always use `0` rather than the custom init block. I think
perhaps instead of trying to shoehorn that stuff into `Polyval`/`GHash`
and adding undue complexity there, we should instead expose bits through
`polyval::hazmat` (like the core R/F algorithm) which make it possible
to build your own customized high performance GHASH/POLYVAL-like UHF
that can be tweaked in any way desired.
@tarcieri tarcieri force-pushed the polyval/rf-algorithm branch from 1b706b9 to 2c394f5 Compare February 8, 2026 22:29
@tarcieri tarcieri merged commit a327d42 into master Feb 8, 2026
32 checks passed
@tarcieri tarcieri deleted the polyval/rf-algorithm branch February 8, 2026 22:32
@tarcieri
Copy link
Member Author

tarcieri commented Feb 8, 2026

Note: also double checked this against aes-gcm / aes-gcm-siv, at least on ARMv8

@tarcieri tarcieri mentioned this pull request Feb 27, 2026
tarcieri added a commit that referenced this pull request Feb 27, 2026
## Added
- `hazmat` feature that exposes `FieldElement` type (#279)

## Changed
- Edition changed to 2024 and MSRV bumped to 1.85 (#228)
- Relax MSRV policy and allow MSRV bumps in patch releases
- Replace `polyval_force_soft` with `polyval_backend="soft"` (#259)
- Use `cpubits` crate for `soft` backend selection (#261)
- Bump `cpufeatures` to v0.3 (#292)
- Use [Reduction/Field algorithm] for parallel block processing on
  `aarch64`/`x86(_64)` (#294)
- `polyval::mulx` moved to `FieldElement::mulx` (#296)
- Update to `universal-hash` v0.6 (#310)

## Removed
- `cfg(polyval_armv8)` - now enabled by default (#214)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant