Skip to content

Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors.#13750

Closed
amatyuko-intc wants to merge 1 commit intoopenssl:masterfrom
amatyuko-intc:dev/2020/rsa-2k-icx
Closed

Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors.#13750
amatyuko-intc wants to merge 1 commit intoopenssl:masterfrom
amatyuko-intc:dev/2020/rsa-2k-icx

Conversation

@amatyuko-intc
Copy link
Copy Markdown
Contributor

  • Improves RSA-2K performance by more than two times on AVX512_IFMA capable CPUs (Intel's IceLake architecture)
  • Implementation utilizes 256bit (ymm) registers to avoid frequency scaling issues.

The patch includes:

  • Dual 1024-bit modular exponentiation kernel (RSAZ using 256bit AVX512VL version of AVX512_IFMA ISA)
    Note: other exponentiation kernels for RSA-3K and RSA-4K will follow.
  • RSA-2K (CRT case with 2 1024-bit primes) enabled with the kernel above
  • Dual exponentiation test

@amatyuko-intc amatyuko-intc changed the title Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors. WIP: Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors. Dec 29, 2020
@amatyuko-intc amatyuko-intc force-pushed the dev/2020/rsa-2k-icx branch 2 times, most recently from 6ce0f2e to 0978c89 Compare December 29, 2020 23:09
@amatyuko-intc amatyuko-intc changed the title WIP: Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors. Parallel dual-prime 1024-bit modular exponentiation for AVX512_IFMA capable processors. Dec 30, 2020
@kroeckx
Copy link
Copy Markdown
Member

kroeckx commented Dec 30, 2020

So you mention this is for IceLake processors, which seem to be a mobile version. It was my understanding that the use of those instructions will cause license-based downclocking. At least for some (older) generation Intel processors, this down clocking is probably not what we want. But looking at IceLake information, the downlclocking seems reasonable.

So I have to wonder if this is going to affect older CPUs in a negative way, and why you do this for a mobile CPU that's probably not going to do that many RSA operations.

@amatyuko-intc
Copy link
Copy Markdown
Contributor Author

So I have to wonder if this is going to affect older CPUs in a negative way, and why you do this for a mobile CPU that's probably not going to do that many RSA operations.

The older CPUs are not affected by this optimization as it utilizes AVX512_IFMA ISA, which is initially appeared in IceLake microarchitecture. As to downclocking issues, we intentionally sticked to 256-bit width registers (AVX12VL) with less power license instead of full 512-bit ones which reduces such effect.

IceLake has not only mobile version, there is a ICX server as well which can benefit from this.

@amatyuko-intc amatyuko-intc force-pushed the dev/2020/rsa-2k-icx branch 2 times, most recently from ded3d01 to e16af8a Compare January 4, 2021 14:00
@amatyuko-intc amatyuko-intc requested a review from kroeckx January 4, 2021 21:20
@kroeckx
Copy link
Copy Markdown
Member

kroeckx commented Jan 4, 2021

I have no more comment on it at this time, the changes look good. But I didn't have time to actually review the code itself.

@amatyuko-intc
Copy link
Copy Markdown
Contributor Author

I have no more comment on it at this time, the changes look good. But I didn't have time to actually review the code itself.

Thank you for taking a look on it.

@t8m
Copy link
Copy Markdown
Member

t8m commented Jan 5, 2021

I am a little bit worried about adding another special code path for 2k RSA keys for just specific family of processors.

@kroeckx
Copy link
Copy Markdown
Member

kroeckx commented Jan 6, 2021 via email

@ilya-albrekht
Copy link
Copy Markdown

Will all feature Intel server processors have support for this?

It will be supported by IceLake server processors and beyond, and will provide benefits for multiple generations (just like ADX code path that was initially added for Broadwell based platforms). Intel can release only client CPU performance numbers, as server parts are not available on the market yet. I'd expect no less speedup than on mobile IceLake processor.

@amatyuko-intc
Copy link
Copy Markdown
Contributor Author

Is there any decision about this PR?

@vanc
Copy link
Copy Markdown

vanc commented Jan 11, 2021

Will all feature Intel server processors have support for this?

It will be supported by IceLake server processors and beyond, and will provide benefits for multiple generations (just like ADX code path that was initially added for Broadwell based platforms). Intel can release only client CPU performance numbers, as server parts are not available on the market yet. I'd expect no less speedup than on mobile IceLake processor.

Tried with the patch on an IceLake server CPU running at 2.2G base clock, I can confirm that RSA 2048 signing performance improved over 100%.

Before the patch:

                sign      verify   sign/s  verify/s
rsa 2048 bits 0.000942s 0.000028s **1061.8**  36232.2

After the patch:

                sign      verify   sign/s  verify/s
rsa 2048 bits 0.000466s 0.000029s **2145.4** 34772.2

That's about 103% performance gain.

@kroeckx
Copy link
Copy Markdown
Member

kroeckx commented Jan 11, 2021 via email

Copy link
Copy Markdown
Member

@mattcaswell mattcaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on whether the maths is right, and I've not reviewed the assembler. Consequently most of my review comments are restricted to coding style, and "fit" with OpenSSL.

@amatyuko-intc amatyuko-intc force-pushed the dev/2020/rsa-2k-icx branch 2 times, most recently from 4c18d55 to 7899e65 Compare January 20, 2021 19:40
@amatyuko-intc
Copy link
Copy Markdown
Contributor Author

@mattcaswell, thanks for the comments. All of them are addressed in the latest commit.

@openssl-machine
Copy link
Copy Markdown
Collaborator

This pull request is ready to merge

@openssl-machine openssl-machine added the approval: ready to merge The 24 hour grace period has passed, ready to merge label Mar 17, 2021
@paulidale
Copy link
Copy Markdown
Contributor

This will need a rebase before it can be merged. I doubt it would require re-approval.

@mattcaswell
Copy link
Copy Markdown
Member

Hmm. There appears to be some relevant failures in the CIs.

@amatyuko-intc
Copy link
Copy Markdown
Contributor Author

Hmm. There appears to be some relevant failures in the CIs.

Yes, seems like CI switched to another GCC since the last PR update (I suspect 9.x? not sure how to get exact version from CI info) and it started producing a warning. Looking into it.

@amatyuko-intc
Copy link
Copy Markdown
Contributor Author

Made a tiny change to to_words52() function to make it a bit cleaner for compiler optimizer. Should be good now.

Copy link
Copy Markdown
Member

@mattcaswell mattcaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still good.

@paulidale please can you reconfirm?

@mattcaswell mattcaswell added approval: review pending This pull request needs review by a committer and removed approval: ready to merge The 24 hour grace period has passed, ready to merge labels Mar 18, 2021
Copy link
Copy Markdown
Contributor

@paulidale paulidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still good.

Copy link
Copy Markdown
Contributor

@paulidale paulidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually one tiny nit.

with AVX512_IFMA + AVX512_VL instructions, primarily for RSA CRT private key
operations. It uses 256-bit registers to avoid CPU frequency scaling issues.
The performance speedup for RSA2k signature on ICL is ~2x.
@mattcaswell mattcaswell added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Mar 18, 2021
@mattcaswell
Copy link
Copy Markdown
Member

Still good. No need to reset 24 hour timer for this I think.

@openssl-machine
Copy link
Copy Markdown
Collaborator

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

openssl-machine pushed a commit that referenced this pull request Mar 22, 2021
with AVX512_IFMA + AVX512_VL instructions, primarily for RSA CRT private key
operations. It uses 256-bit registers to avoid CPU frequency scaling issues.
The performance speedup for RSA2k signature on ICL is ~2x.

Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Matt Caswell <matt@openssl.org>
(Merged from #13750)
@mattcaswell
Copy link
Copy Markdown
Member

Pushed! Thanks for the contribution!

@guidovranken
Copy link
Copy Markdown
Contributor

@amatyuko-intc or OpenSSL team: I would like to test this function with my fuzzer (https://github.com/guidovranken/cryptofuzz). IF there are bugs in the code, I can probably quickly find them. Unfortunately I don't have the requisite CPU. Is anyone willing to either give me (temporary) SSH access to a system with this CPU, or run Cryptofuzz on their system (with my guidance)?

@amatyuko-intc
Copy link
Copy Markdown
Contributor Author

@guidovranken, I would be glad to help you run the tool on IceLake CPU, it would be interesting to check this. Do you have some pre-work ready to have this function call in the fuzzer? And I appreciate any guidance (you can find my email in the github profile).

@guidovranken
Copy link
Copy Markdown
Contributor

@guidovranken, I would be glad to help you run the tool on IceLake CPU, it would be interesting to check this. Do you have some pre-work ready to have this function call in the fuzzer? And I appreciate any guidance (you can find my email in the github profile).

Great! I will e-mail you.

nebeid added a commit to aws/aws-lc that referenced this pull request Sep 17, 2024
This change adds AVX-512 support for RSA 2k, 3k and 4k signing. It is
built around the use of AVX512_IFMA within the [(Almost) Montgomery
Multiplication](https://eprint.iacr.org/2011/239) implementation that
comprises the modular exponentiation part of the RSA algorithm. It is
ported from the [OpenSSL
patch](openssl/openssl#13750).

On C6i instance, clang 12, Release build:
Before:
Did 832 RSA 2048 signing operations in 1009511us (824.2 ops/sec)
Did 41000 RSA 2048 verify (same key) operations in 1019103us (40231.5 ops/sec)
Did 30000 RSA 2048 verify (fresh key) operations in 1007956us (29763.2 ops/sec)
Did 3684 RSA 2048 private key parse operations in 1067692us (3450.4 ops/sec)
Did 340 RSA 3072 signing operations in 1051690us (323.3 ops/sec)
Did 13000 RSA 3072 verify (same key) operations in 1087695us (11951.9 ops/sec)
Did 16000 RSA 3072 verify (fresh key) operations in 1005781us (15908.0 ops/sec)
Did 1870 RSA 3072 private key parse operations in 1017467us (1837.9 ops/sec)
Did 128 RSA 4096 signing operations in 1015724us (126.0 ops/sec)
Did 10000 RSA 4096 verify (same key) operations in 1071670us (9331.2 ops/sec)
Did 6952 RSA 4096 verify (fresh key) operations in 1016484us (6839.3 ops/sec)
Did 1110 RSA 4096 private key parse operations in 1092991us (1015.6 ops/sec)
After:
Did 1690 RSA 2048 signing operations in 1025072us (1648.7 ops/sec)
Did 63000 RSA 2048 verify (same key) operations in 1008785us (62451.4 ops/sec)
Did 54000 RSA 2048 verify (fresh key) operations in 1000298us (53983.9 ops/sec)
Did 8000 RSA 2048 private key parse operations in 1000938us (7992.5 ops/sec)
Did 550 RSA 3072 signing operations in 1012078us (543.4 ops/sec)
Did 30000 RSA 3072 verify (same key) operations in 1022061us (29352.5 ops/sec)
Did 27000 RSA 3072 verify (fresh key) operations in 1037663us (26020.0 ops/sec)
Did 4140 RSA 3072 private key parse operations in 1006526us (4113.2 ops/sec)
Did 253 RSA 4096 signing operations in 1050767us (240.8 ops/sec)
Did 18000 RSA 4096 verify (same key) operations in 1057742us (17017.4 ops/sec)
Did 15000 RSA 4096 verify (fresh key) operations in 1000483us (14992.8 ops/sec)
Did 2510 RSA 4096 private key parse operations in 1004408us (2499.0 ops/sec)

There is currently no support for 8k, so no change there. However, this
could be a follow on if there is interest in that.

Call-outs:
This patch is primarily additive modulo a small logic change that occurs in `mod_exp()` in `rsa_impl.c`,
where, previously, the calls to `mod_montgomery` and `BN_mod_exp_mont_consttime` were
interleaved. Here, in order to make possible the parallel exponentiations, `r1` is kept around and a new 
`BIGNUM`, `r2`, is created on the context.

---------

Co-authored-by: Nevine Ebeid <nebeid@amazon.com>
Co-authored-by: Nevine Ebeid <66388554+nebeid@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval: done This pull request has the required number of approvals branch: master Applies to master branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.