Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors.#13750
Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors.#13750amatyuko-intc wants to merge 1 commit intoopenssl:masterfrom
Conversation
6ce0f2e to
0978c89
Compare
|
So you mention this is for IceLake processors, which seem to be a mobile version. It was my understanding that the use of those instructions will cause license-based downclocking. At least for some (older) generation Intel processors, this down clocking is probably not what we want. But looking at IceLake information, the downlclocking seems reasonable. So I have to wonder if this is going to affect older CPUs in a negative way, and why you do this for a mobile CPU that's probably not going to do that many RSA operations. |
The older CPUs are not affected by this optimization as it utilizes AVX512_IFMA ISA, which is initially appeared in IceLake microarchitecture. As to downclocking issues, we intentionally sticked to 256-bit width registers (AVX12VL) with less power license instead of full 512-bit ones which reduces such effect. IceLake has not only mobile version, there is a ICX server as well which can benefit from this. |
ded3d01 to
e16af8a
Compare
|
I have no more comment on it at this time, the changes look good. But I didn't have time to actually review the code itself. |
Thank you for taking a look on it. |
|
I am a little bit worried about adding another special code path for 2k RSA keys for just specific family of processors. |
|
Will all feature Intel server processors have support for this?
|
It will be supported by IceLake server processors and beyond, and will provide benefits for multiple generations (just like ADX code path that was initially added for Broadwell based platforms). Intel can release only client CPU performance numbers, as server parts are not available on the market yet. I'd expect no less speedup than on mobile IceLake processor. |
|
Is there any decision about this PR? |
Tried with the patch on an IceLake server CPU running at 2.2G base clock, I can confirm that RSA 2048 signing performance improved over 100%. Before the patch: After the patch: That's about 103% performance gain. |
|
Is there any decision about this PR?
There isn't any decision on to include it or not.
I will probably not spend time reviewing this before the 3.0
release, but that shouldn't prevent someone else from doing this.
Others might feel the same, so this might need to wait for 3.1.
|
mattcaswell
left a comment
There was a problem hiding this comment.
I can't comment on whether the maths is right, and I've not reviewed the assembler. Consequently most of my review comments are restricted to coding style, and "fit" with OpenSSL.
4c18d55 to
7899e65
Compare
|
@mattcaswell, thanks for the comments. All of them are addressed in the latest commit. |
|
This pull request is ready to merge |
|
This will need a rebase before it can be merged. I doubt it would require re-approval. |
77f003f to
93b5934
Compare
|
Hmm. There appears to be some relevant failures in the CIs. |
Yes, seems like CI switched to another GCC since the last PR update (I suspect 9.x? not sure how to get exact version from CI info) and it started producing a warning. Looking into it. |
93b5934 to
87b7760
Compare
|
Made a tiny change to to_words52() function to make it a bit cleaner for compiler optimizer. Should be good now. |
mattcaswell
left a comment
There was a problem hiding this comment.
Still good.
@paulidale please can you reconfirm?
paulidale
left a comment
There was a problem hiding this comment.
Actually one tiny nit.
with AVX512_IFMA + AVX512_VL instructions, primarily for RSA CRT private key operations. It uses 256-bit registers to avoid CPU frequency scaling issues. The performance speedup for RSA2k signature on ICL is ~2x.
87b7760 to
98baa53
Compare
|
Still good. No need to reset 24 hour timer for this I think. |
|
24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually. |
with AVX512_IFMA + AVX512_VL instructions, primarily for RSA CRT private key operations. It uses 256-bit registers to avoid CPU frequency scaling issues. The performance speedup for RSA2k signature on ICL is ~2x. Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Matt Caswell <matt@openssl.org> (Merged from #13750)
|
Pushed! Thanks for the contribution! |
|
@amatyuko-intc or OpenSSL team: I would like to test this function with my fuzzer (https://github.com/guidovranken/cryptofuzz). IF there are bugs in the code, I can probably quickly find them. Unfortunately I don't have the requisite CPU. Is anyone willing to either give me (temporary) SSH access to a system with this CPU, or run Cryptofuzz on their system (with my guidance)? |
|
@guidovranken, I would be glad to help you run the tool on IceLake CPU, it would be interesting to check this. Do you have some pre-work ready to have this function call in the fuzzer? And I appreciate any guidance (you can find my email in the github profile). |
Great! I will e-mail you. |
This change adds AVX-512 support for RSA 2k, 3k and 4k signing. It is built around the use of AVX512_IFMA within the [(Almost) Montgomery Multiplication](https://eprint.iacr.org/2011/239) implementation that comprises the modular exponentiation part of the RSA algorithm. It is ported from the [OpenSSL patch](openssl/openssl#13750). On C6i instance, clang 12, Release build: Before: Did 832 RSA 2048 signing operations in 1009511us (824.2 ops/sec) Did 41000 RSA 2048 verify (same key) operations in 1019103us (40231.5 ops/sec) Did 30000 RSA 2048 verify (fresh key) operations in 1007956us (29763.2 ops/sec) Did 3684 RSA 2048 private key parse operations in 1067692us (3450.4 ops/sec) Did 340 RSA 3072 signing operations in 1051690us (323.3 ops/sec) Did 13000 RSA 3072 verify (same key) operations in 1087695us (11951.9 ops/sec) Did 16000 RSA 3072 verify (fresh key) operations in 1005781us (15908.0 ops/sec) Did 1870 RSA 3072 private key parse operations in 1017467us (1837.9 ops/sec) Did 128 RSA 4096 signing operations in 1015724us (126.0 ops/sec) Did 10000 RSA 4096 verify (same key) operations in 1071670us (9331.2 ops/sec) Did 6952 RSA 4096 verify (fresh key) operations in 1016484us (6839.3 ops/sec) Did 1110 RSA 4096 private key parse operations in 1092991us (1015.6 ops/sec) After: Did 1690 RSA 2048 signing operations in 1025072us (1648.7 ops/sec) Did 63000 RSA 2048 verify (same key) operations in 1008785us (62451.4 ops/sec) Did 54000 RSA 2048 verify (fresh key) operations in 1000298us (53983.9 ops/sec) Did 8000 RSA 2048 private key parse operations in 1000938us (7992.5 ops/sec) Did 550 RSA 3072 signing operations in 1012078us (543.4 ops/sec) Did 30000 RSA 3072 verify (same key) operations in 1022061us (29352.5 ops/sec) Did 27000 RSA 3072 verify (fresh key) operations in 1037663us (26020.0 ops/sec) Did 4140 RSA 3072 private key parse operations in 1006526us (4113.2 ops/sec) Did 253 RSA 4096 signing operations in 1050767us (240.8 ops/sec) Did 18000 RSA 4096 verify (same key) operations in 1057742us (17017.4 ops/sec) Did 15000 RSA 4096 verify (fresh key) operations in 1000483us (14992.8 ops/sec) Did 2510 RSA 4096 private key parse operations in 1004408us (2499.0 ops/sec) There is currently no support for 8k, so no change there. However, this could be a follow on if there is interest in that. Call-outs: This patch is primarily additive modulo a small logic change that occurs in `mod_exp()` in `rsa_impl.c`, where, previously, the calls to `mod_montgomery` and `BN_mod_exp_mont_consttime` were interleaved. Here, in order to make possible the parallel exponentiations, `r1` is kept around and a new `BIGNUM`, `r2`, is created on the context. --------- Co-authored-by: Nevine Ebeid <nebeid@amazon.com> Co-authored-by: Nevine Ebeid <66388554+nebeid@users.noreply.github.com>
The patch includes:
Note: other exponentiation kernels for RSA-3K and RSA-4K will follow.