Refactor: Have a common base for Kyber/Dilithium structures#4024
Refactor: Have a common base for Kyber/Dilithium structures#4024reneme merged 7 commits intorandombit:masterfrom
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
64f9db8 to
066fc6d
Compare
dd45e61 to
c30b9fe
Compare
This comment was marked as outdated.
This comment was marked as outdated.
reneme
left a comment
There was a problem hiding this comment.
Below are a few highlights and potential conversion starters of things I saw along the way. :)
| // This is a mitigation for a potential side channel called "KyberSlash". | ||
| // It implements the division by Q using a multiplication and a shift. Most | ||
| // compilers would generate similar code for such a division by a constant. | ||
| // Though, in some cases, compilers might use a variable-time int division, | ||
| // resulting in a potential side channel. | ||
| // | ||
| // See "Hacker's Delight" (Second Edition) by Henry S. Warren, Jr. | ||
| // Chapter 10-9 "Unsigned Division by Divisors >= 1" | ||
| auto divide_by_q = [](uint32_t n) -> unsigned_T { | ||
| static_assert(KyberConstants::Q == 3329); | ||
| BOTAN_DEBUG_ASSERT(n < (1 << 23)); | ||
|
|
||
| // These constants work for all values that appear in Kyber with the | ||
| // greatest being 3328 * 2^11 + Q // 2 = 6,817,408 < 2**23 = 8,388,608. | ||
| constexpr uint64_t m = 2580335; | ||
| constexpr size_t p = 33; | ||
| return static_cast<unsigned_T>((n * m) >> p); | ||
| }; | ||
|
|
||
| constexpr unsigned_T mask = (1 << d) - 1; | ||
| return divide_by_q((static_cast<uint32_t>(x) << d) + KyberConstants::Q / 2) & mask; |
There was a problem hiding this comment.
I've adapted the KyberSlash countermeasure to also work for the compression instances with d = 10 and d = 11. Not sure it's necessary but doesn't hurt either. I'd definitely be happy about some other sharp pair of eyes.
| // TODO: perhaps secure vector | ||
| std::vector<T> m_coeffs_storage; |
There was a problem hiding this comment.
Polynomials store their coefficients as a std::vector<>. Some of those store secret values and should therefore use secure_vector. Also, it might be beneficial to add first-level support for Polynomials (and friends) as Strong types. (Most likely this would be another pull request, though.)
| namespace detail { | ||
|
|
||
| constexpr auto as_byte_source(BufferSlicer& slicer) { | ||
| return [&](std::span<uint8_t> out) { slicer.copy_into(out); }; | ||
| } | ||
|
|
||
| constexpr auto as_byte_source(Botan::XOF& xof) { | ||
| return [&](std::span<uint8_t> out) { xof.output(out); }; | ||
| } | ||
|
|
||
| } // namespace detail | ||
|
|
||
| template <typename T> | ||
| concept byte_source = | ||
| requires(T& t) { requires std::invocable<decltype(detail::as_byte_source(t)), std::span<uint8_t>>; }; |
There was a problem hiding this comment.
I feel like this is a top-level concept we might want to establish deeper in the library. Perhaps as an (internal) replacement for the DataSource class.
Namely: A thing that can provide a concrete number of bytes where I don't really know how many bytes will be needed. The source may be infinite or not.
Things that could fall in this abstraction may be:
- the
RandomNumberGenerator - a
XOF - read from some buffer (think
BufferSlicer) - from some I/O source (e.g. the network)
In Kyber/Dithithium this is used to either sample some random polynomial from a XOF or to unmarshal it from a bitpacked encoding.
Other use cases may be:
- parsing in general (TLS, X.509 and friends)
- data loading, e.g. like:
load_be<uint64_t>(rng)to get a random 64 bit integer. (See also Internal: load_be/le should accept a BufferSlicer #4068) - among other things
There was a problem hiding this comment.
From a technical perspective, one might even implement it like shown here: As a C++20 concept that tries to build a thin wrapper around "the source". That way, new data sources could just hook themselves into this "helper" and decide for themselves to be a "data source" or not.
There was a problem hiding this comment.
Cool idea 👍. I'm not sure if we want to commit to a pure concept construction, though. In some cases, it may be more convenient to have a class like DataSource. We could still extend the DataSource class to fulfill the respective concept if one wants to juggle with templates.
| // TODO: Remove the need for the montgomery transformation | ||
| // | ||
| // -> When calculating A*s below, A is not in montgomery form, but s is. The | ||
| // operation uses fqmul internally, which performs a montgomery reduction. | ||
| auto A = montgomery(kyber_sample_matrix(rho, false /* not transposed */, mode)); |
There was a problem hiding this comment.
This was different in the reference implementation. @FAlbertDev we did look into this a while ago, I feel we should take this up again and get to the bottom of it.
There was a problem hiding this comment.
I've just investigated the usage of Montgomery in Kyber. First, a short recap on Montgomery:
The main observation is that every Montgomery multiplication
In our Kyber and Dilithium code, the following operations add or remove the Montgomery factor.
- the function
montgomeryadds a factor$R$ - the function
inv_nttadds a factor$R$ . Note thatnttdoesn't add a Montgomery factor. - As already said, a Montgomery multiplication adds a factor
$R^{-1}$ to each poly, vector, or matrix element. This holds for all multiplication types, i.e., polynomial-polynomial, vector-vector, matrix-vector multiplication, etc.
For encryption and decryption, this setup works well since there is always one multiplication (adding factor inv_ntt (adding factor inv_ntt. We have a factor of montgomery. Instead, I propose to add this factor to t after the multiplication. The fewer elements multiplied by
auto A = kyber_sample_matrix(rho, false /* not transposed */, mode);
auto s = ntt(ps.sample_polynomial_vector_cbd_eta1());
const auto e = ntt(ps.sample_polynomial_vector_cbd_eta1());
auto t = montgomery(A * s);
t += e;
t.reduce();
// End Algorithm 12 ---------------------------There was a problem hiding this comment.
Nice! Thanks for double-checking and the detailed write-up! It would probably be really nice to track the montgomery parameter in the type system as well, but it does feel like overkill.
Also: by applying montgomery() to the t vector, we can remove the overload for PolynomialMatrix altogether. :)
| using KyberPolyNTT = Botan::CRYSTALS::Polynomial<KyberPolyTraits, Botan::CRYSTALS::Domain::NTT>; | ||
| using KyberPolyVecNTT = Botan::CRYSTALS::PolynomialVector<KyberPolyTraits, Botan::CRYSTALS::Domain::NTT>; | ||
| using KyberPolyMat = Botan::CRYSTALS::PolynomialMatrix<KyberPolyTraits>; | ||
|
|
||
| using KyberPoly = Botan::CRYSTALS::Polynomial<KyberPolyTraits, Botan::CRYSTALS::Domain::Normal>; | ||
| using KyberPolyVec = Botan::CRYSTALS::PolynomialVector<KyberPolyTraits, Botan::CRYSTALS::Domain::Normal>; |
There was a problem hiding this comment.
As mentioned before: it would be good if these could be the basis for strong types to disambiguate them in interfaces E.g. to replace the multiple poly_pack_t0, ..._t1, ... methods in dilithium_algos.cpp with an overloaded poly_pack() that just "does the right thing" ™ based on the strong type.
| enum DilithiumTau : uint32_t { _39 = 39, _49 = 49, _60 = 60 }; | ||
|
|
||
| enum DilithiumLambda : uint32_t { _128 = 128, _192 = 192, _256 = 256 }; | ||
|
|
||
| enum DilithiumGamma1 : uint32_t { ToThe17th = (1 << 17), ToThe19th = (1 << 19) }; | ||
|
|
||
| enum DilithiumGamma2 : uint32_t { Qminus1DevidedBy88 = (Q - 1) / 88, Qminus1DevidedBy32 = (Q - 1) / 32 }; | ||
|
|
||
| enum DilithiumEta : uint32_t { _2 = 2, _4 = 4 }; | ||
|
|
||
| enum DilithiumBeta : uint32_t { _78 = 78, _196 = 196, _120 = 120 }; | ||
|
|
||
| enum DilithiumOmega : uint32_t { _80 = 80, _55 = 55, _75 = 75 }; |
There was a problem hiding this comment.
Dilithium has many integer constants that change depending on the instantiation of the algorithm. I found it worthwhile to restrict them with explicit enums. I'm okay with this concept, also given the restricted (algorithm-internal) scope. But I feel there must be a better way for this. :)
There was a problem hiding this comment.
I also like this idea. While the implied switch-cases are annoying, they guarantee we do not forget a value.
| dilithium_poly_pack_t0(p, stuffer); | ||
| } | ||
|
|
||
| BOTAN_ASSERT_NOMSG(stuffer.full()); |
There was a problem hiding this comment.
Those became ubiquitous and at this point I do start wondering whether we should just put that in the destructors of Stuffer and Slicer (perhaps explicitly disengagable). What do y'all think?
There was a problem hiding this comment.
I'll do that after 3.5.0 is cut, to avoid unwanted surprises.
cf16bf3 to
3fc00ec
Compare
3fc00ec to
8dcd72b
Compare
FAlbertDev
left a comment
There was a problem hiding this comment.
I've only looked into dilithium and the crystals module so far. The other sections (including Kyber) will come tomorrow.
For algorithms of this high complexity, the new structure looks very well-organized and clear :)
I've only found some minor nits so far.
(Btw sorry if I also marked lots of removed code. The VSCode extension for GitHub caused this)
1a3ac57 to
ae4298c
Compare
FAlbertDev
left a comment
There was a problem hiding this comment.
Final review iteration.
I really like your work. IMO it's the better reference implementation ;)
| // TODO: Remove the need for the montgomery transformation | ||
| // | ||
| // -> When calculating A*s below, A is not in montgomery form, but s is. The | ||
| // operation uses fqmul internally, which performs a montgomery reduction. | ||
| auto A = montgomery(kyber_sample_matrix(rho, false /* not transposed */, mode)); |
There was a problem hiding this comment.
I've just investigated the usage of Montgomery in Kyber. First, a short recap on Montgomery:
The main observation is that every Montgomery multiplication
In our Kyber and Dilithium code, the following operations add or remove the Montgomery factor.
- the function
montgomeryadds a factor$R$ - the function
inv_nttadds a factor$R$ . Note thatnttdoesn't add a Montgomery factor. - As already said, a Montgomery multiplication adds a factor
$R^{-1}$ to each poly, vector, or matrix element. This holds for all multiplication types, i.e., polynomial-polynomial, vector-vector, matrix-vector multiplication, etc.
For encryption and decryption, this setup works well since there is always one multiplication (adding factor inv_ntt (adding factor inv_ntt. We have a factor of montgomery. Instead, I propose to add this factor to t after the multiplication. The fewer elements multiplied by
auto A = kyber_sample_matrix(rho, false /* not transposed */, mode);
auto s = ntt(ps.sample_polynomial_vector_cbd_eta1());
const auto e = ntt(ps.sample_polynomial_vector_cbd_eta1());
auto t = montgomery(A * s);
t += e;
t.reduce();
// End Algorithm 12 ---------------------------
randombit
left a comment
There was a problem hiding this comment.
Partial review still need to look at changes in dilithium and kyber
e6a97fd to
e9863c1
Compare
|
Everything so far looks good but I'm not enthused about dropping big refactors just before a release so pushing this to 3.6.0 |
See: randombit/botan#3887 This is a fairly minimal adaption, once the full refactoring is merged, we'll have to rewrite this more substantially. That won't happen before Botan 3.5.0, though. See also: randombit/botan#4024
| template <int32_t range, crystals_trait PolyTrait, Domain D, coeff_map_fn<typename PolyTrait::T> MapFnT> | ||
| constexpr void pack(const Polynomial<PolyTrait, D>& p, BufferStuffer& stuffer, MapFnT map) { | ||
| using trait = BitPackingTrait<range, PolyTrait>; |
There was a problem hiding this comment.
Could the gist of this be useful beyond the CRYSTALS scope? E.g. FrodoKEM has a pack/unpack routine that -- on first glance at least -- looks like it could use it as well. @randombit perhaps there are more examples?
botan/src/lib/pubkey/frodokem/frodokem_common/frodo_matrix.cpp
Lines 340 to 384 in 3657a72
randombit
left a comment
There was a problem hiding this comment.
This looks very good. I left one bikeshed suggestion that to me feels a little cleaner - but it's marginal, feel free to ignore if you disagree.
|
Final comment, could use some history cleanup before merge |
Most notably, this is an abstract implementation to handle CRYSTALS polynomials, poly vectors and poly matrices. Also, a generic implementation for bit-packed encoding/decoding of coefficients useful for both Kyber and Dilithium. Co-Authored-By: Fabian Albert <fabian.albert@rohde-schwarz.com>
t1 is part of the public key and thus independent of any other verification input. Precomputing it saves about 20% of verification runtime when performing more than a single verification with the same Botan::PK_Verifier. Co-Authored-By: Fabian Albert <fabian.albert@rohde-schwarz.com>
For Kyber those are {512,768,1024} and for Dilithium {44,65,87}
depending on the respective choice of parameter set.
a476620 to
8bd26c7
Compare
Rebased to master, cleaned up history and built the bikeshed. Waiting for CI, then finally drawing a line here. 🥳 |
This is essentially a complete overhaul of the Kyber and Dilithium implementations aiming at as much shared code as possible between the two algorithms. The rough structure is outlined below. I split-up the changes into three commits, that could be separate pull requests, if needed.
Module:
pqcrystalsThis is a templated base for the polynomials (as well as poly vectors/matrices) used in both implementations. The base implementation shares math where possible and is pluggable where necessary. It also exposes its domain (NTT or "normal") in the type system. Also, it makes some initial optimization effort (mostly memory layout), but more work could be done here.
Both algorithms contain bit-packing algorithms to efficiently encode polynomial coefficients of various known value ranges. Both Kyber and Dilithium now share a generic and pluggable implementation of this.
Module:
kyberThis now makes use of the
pqcrystalsmodule. The polynomial instantiation happens inkyber_polynomial.h(replacing the oldkyber_structures.h). Most low-level "algorithms" (pseudo-codes of the spec) live inkyber_algos.cpp. The indcpa "K-PKE" scheme is implemented inkyber_keys.cppand the cca2-secure wrapper can be found inkyber.cpp. Auxiliary stuff like constants and strong types are in other headers.Note that this now contains almost no verbatim code from the reference implementation anymore.
Module:
dilithiumLikewise, this uses the
pqcrystalsmodule (instantiating the polynomials indilithium_polynomial.h(replacing the almost verbatim reference implementation code). Thedilithium_algos.cppis home to most low-level "algorithms" (as seen in the spec).dilithium.cpphouses the actual signature/verification algorihtms that bind it all together..Care has been taken to have a consistent implementation structure between Kyber and Dilithium. Note, though, that Dilithium's flexibility might require some love before a clean integration of ML-KEM (similar to what happened to Kyber in #3887).
Benchmark
Despite not explicitly aiming to increase speed,./botan speedshows nice gains for Kyber and no losses for Dilithium.It might be worth looking into why Kyber is so much faster, though. Feels suspicious.Edit: The new implementation pre-computes and stores the expanded public matrix A. So as long as one re-uses thePK_KEM_Encryptorfor multiple encapsulations, one saves a lot of invocations to the XOF. That explains. 😏GCC 13.2 vs. Clang 18.1 on Linux (in WSL)