Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors. by amatyuko-intc · Pull Request #13750 · openssl/openssl

amatyuko-intc · 2020-12-29T19:36:33Z

Improves RSA-2K performance by more than two times on AVX512_IFMA capable CPUs (Intel's IceLake architecture)
Implementation utilizes 256bit (ymm) registers to avoid frequency scaling issues.

The patch includes:

Dual 1024-bit modular exponentiation kernel (RSAZ using 256bit AVX512VL version of AVX512_IFMA ISA)
Note: other exponentiation kernels for RSA-3K and RSA-4K will follow.
RSA-2K (CRT case with 2 1024-bit primes) enabled with the kernel above
Dual exponentiation test

include/openssl/bn.h

crypto/bn/asm/rsaz-avx512.pl

kroeckx · 2020-12-30T15:01:15Z

So you mention this is for IceLake processors, which seem to be a mobile version. It was my understanding that the use of those instructions will cause license-based downclocking. At least for some (older) generation Intel processors, this down clocking is probably not what we want. But looking at IceLake information, the downlclocking seems reasonable.

So I have to wonder if this is going to affect older CPUs in a negative way, and why you do this for a mobile CPU that's probably not going to do that many RSA operations.

amatyuko-intc · 2020-12-30T18:42:47Z

So I have to wonder if this is going to affect older CPUs in a negative way, and why you do this for a mobile CPU that's probably not going to do that many RSA operations.

The older CPUs are not affected by this optimization as it utilizes AVX512_IFMA ISA, which is initially appeared in IceLake microarchitecture. As to downclocking issues, we intentionally sticked to 256-bit width registers (AVX12VL) with less power license instead of full 512-bit ones which reduces such effect.

IceLake has not only mobile version, there is a ICX server as well which can benefit from this.

crypto/bn/bn_exp.c

crypto/bn/asm/rsaz-avx512.pl

crypto/bn/rsaz_exp_x2.c

kroeckx · 2021-01-04T22:23:06Z

I have no more comment on it at this time, the changes look good. But I didn't have time to actually review the code itself.

amatyuko-intc · 2021-01-04T23:04:18Z

I have no more comment on it at this time, the changes look good. But I didn't have time to actually review the code itself.

Thank you for taking a look on it.

t8m · 2021-01-05T09:08:53Z

I am a little bit worried about adding another special code path for 2k RSA keys for just specific family of processors.

kroeckx · 2021-01-06T12:03:14Z

Will all feature Intel server processors have support for this?

ilya-albrekht · 2021-01-06T15:34:03Z

Will all feature Intel server processors have support for this?

It will be supported by IceLake server processors and beyond, and will provide benefits for multiple generations (just like ADX code path that was initially added for Broadwell based platforms). Intel can release only client CPU performance numbers, as server parts are not available on the market yet. I'd expect no less speedup than on mobile IceLake processor.

amatyuko-intc · 2021-01-11T16:56:56Z

Is there any decision about this PR?

vanc · 2021-01-11T17:36:33Z

Will all feature Intel server processors have support for this?

It will be supported by IceLake server processors and beyond, and will provide benefits for multiple generations (just like ADX code path that was initially added for Broadwell based platforms). Intel can release only client CPU performance numbers, as server parts are not available on the market yet. I'd expect no less speedup than on mobile IceLake processor.

Tried with the patch on an IceLake server CPU running at 2.2G base clock, I can confirm that RSA 2048 signing performance improved over 100%.

Before the patch:

                sign      verify   sign/s  verify/s
rsa 2048 bits 0.000942s 0.000028s **1061.8**  36232.2

After the patch:

                sign      verify   sign/s  verify/s
rsa 2048 bits 0.000466s 0.000029s **2145.4** 34772.2

That's about 103% performance gain.

kroeckx · 2021-01-11T17:40:35Z

Is there any decision about this PR?

There isn't any decision on to include it or not. I will probably not spend time reviewing this before the 3.0 release, but that shouldn't prevent someone else from doing this. Others might feel the same, so this might need to wait for 3.1.

mattcaswell

I can't comment on whether the maths is right, and I've not reviewed the assembler. Consequently most of my review comments are restricted to coding style, and "fit" with OpenSSL.

crypto/bn/bn_exp.c

crypto/bn/rsaz_exp_x2.c

doc/man3/BN_mod_exp_mont.pod

amatyuko-intc · 2021-01-20T19:51:30Z

@mattcaswell, thanks for the comments. All of them are addressed in the latest commit.

openssl-machine · 2021-03-17T01:00:17Z

This pull request is ready to merge

paulidale · 2021-03-17T02:34:24Z

This will need a rebase before it can be merged. I doubt it would require re-approval.

mattcaswell · 2021-03-17T09:04:56Z

Hmm. There appears to be some relevant failures in the CIs.

amatyuko-intc · 2021-03-17T09:09:35Z

Hmm. There appears to be some relevant failures in the CIs.

Yes, seems like CI switched to another GCC since the last PR update (I suspect 9.x? not sure how to get exact version from CI info) and it started producing a warning. Looking into it.

amatyuko-intc · 2021-03-17T20:37:28Z

Made a tiny change to to_words52() function to make it a bit cleaner for compiler optimizer. Should be good now.

mattcaswell

Still good.

@paulidale please can you reconfirm?

paulidale

Still good.

paulidale

Actually one tiny nit.

crypto/bn/rsaz_exp_x2.c

with AVX512_IFMA + AVX512_VL instructions, primarily for RSA CRT private key operations. It uses 256-bit registers to avoid CPU frequency scaling issues. The performance speedup for RSA2k signature on ICL is ~2x.

mattcaswell · 2021-03-18T16:55:45Z

Still good. No need to reset 24 hour timer for this I think.

openssl-machine · 2021-03-19T17:00:16Z

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

with AVX512_IFMA + AVX512_VL instructions, primarily for RSA CRT private key operations. It uses 256-bit registers to avoid CPU frequency scaling issues. The performance speedup for RSA2k signature on ICL is ~2x. Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Matt Caswell <matt@openssl.org> (Merged from #13750)

mattcaswell · 2021-03-22T10:39:26Z

Pushed! Thanks for the contribution!

guidovranken · 2021-04-15T19:44:53Z

@amatyuko-intc or OpenSSL team: I would like to test this function with my fuzzer (https://github.com/guidovranken/cryptofuzz). IF there are bugs in the code, I can probably quickly find them. Unfortunately I don't have the requisite CPU. Is anyone willing to either give me (temporary) SSH access to a system with this CPU, or run Cryptofuzz on their system (with my guidance)?

amatyuko-intc · 2021-04-16T07:20:20Z

@guidovranken, I would be glad to help you run the tool on IceLake CPU, it would be interesting to check this. Do you have some pre-work ready to have this function call in the fuzzer? And I appreciate any guidance (you can find my email in the github profile).

guidovranken · 2021-04-16T07:55:19Z

@guidovranken, I would be glad to help you run the tool on IceLake CPU, it would be interesting to check this. Do you have some pre-work ready to have this function call in the fuzzer? And I appreciate any guidance (you can find my email in the github profile).

Great! I will e-mail you.

This change adds AVX-512 support for RSA 2k, 3k and 4k signing. It is built around the use of AVX512_IFMA within the [(Almost) Montgomery Multiplication](https://eprint.iacr.org/2011/239) implementation that comprises the modular exponentiation part of the RSA algorithm. It is ported from the [OpenSSL patch](openssl/openssl#13750). On C6i instance, clang 12, Release build: Before: Did 832 RSA 2048 signing operations in 1009511us (824.2 ops/sec) Did 41000 RSA 2048 verify (same key) operations in 1019103us (40231.5 ops/sec) Did 30000 RSA 2048 verify (fresh key) operations in 1007956us (29763.2 ops/sec) Did 3684 RSA 2048 private key parse operations in 1067692us (3450.4 ops/sec) Did 340 RSA 3072 signing operations in 1051690us (323.3 ops/sec) Did 13000 RSA 3072 verify (same key) operations in 1087695us (11951.9 ops/sec) Did 16000 RSA 3072 verify (fresh key) operations in 1005781us (15908.0 ops/sec) Did 1870 RSA 3072 private key parse operations in 1017467us (1837.9 ops/sec) Did 128 RSA 4096 signing operations in 1015724us (126.0 ops/sec) Did 10000 RSA 4096 verify (same key) operations in 1071670us (9331.2 ops/sec) Did 6952 RSA 4096 verify (fresh key) operations in 1016484us (6839.3 ops/sec) Did 1110 RSA 4096 private key parse operations in 1092991us (1015.6 ops/sec) After: Did 1690 RSA 2048 signing operations in 1025072us (1648.7 ops/sec) Did 63000 RSA 2048 verify (same key) operations in 1008785us (62451.4 ops/sec) Did 54000 RSA 2048 verify (fresh key) operations in 1000298us (53983.9 ops/sec) Did 8000 RSA 2048 private key parse operations in 1000938us (7992.5 ops/sec) Did 550 RSA 3072 signing operations in 1012078us (543.4 ops/sec) Did 30000 RSA 3072 verify (same key) operations in 1022061us (29352.5 ops/sec) Did 27000 RSA 3072 verify (fresh key) operations in 1037663us (26020.0 ops/sec) Did 4140 RSA 3072 private key parse operations in 1006526us (4113.2 ops/sec) Did 253 RSA 4096 signing operations in 1050767us (240.8 ops/sec) Did 18000 RSA 4096 verify (same key) operations in 1057742us (17017.4 ops/sec) Did 15000 RSA 4096 verify (fresh key) operations in 1000483us (14992.8 ops/sec) Did 2510 RSA 4096 private key parse operations in 1004408us (2499.0 ops/sec) There is currently no support for 8k, so no change there. However, this could be a follow on if there is interest in that. Call-outs: This patch is primarily additive modulo a small logic change that occurs in `mod_exp()` in `rsa_impl.c`, where, previously, the calls to `mod_montgomery` and `BN_mod_exp_mont_consttime` were interleaved. Here, in order to make possible the parallel exponentiations, `r1` is kept around and a new `BIGNUM`, `r2`, is created on the context. --------- Co-authored-by: Nevine Ebeid <nebeid@amazon.com> Co-authored-by: Nevine Ebeid <66388554+nebeid@users.noreply.github.com>

amatyuko-intc changed the title ~~Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors.~~ WIP: Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors. Dec 29, 2020

amatyuko-intc force-pushed the dev/2020/rsa-2k-icx branch 2 times, most recently from 6ce0f2e to 0978c89 Compare December 29, 2020 23:09

kroeckx reviewed Dec 29, 2020

View reviewed changes

include/openssl/bn.h Outdated Show resolved Hide resolved

kroeckx reviewed Dec 29, 2020

View reviewed changes

crypto/bn/asm/rsaz-avx512.pl Outdated Show resolved Hide resolved

amatyuko-intc changed the title ~~WIP: Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors.~~ Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors. Dec 30, 2020

kroeckx reviewed Dec 30, 2020

View reviewed changes

crypto/bn/bn_exp.c Outdated Show resolved Hide resolved