Avoid mixing SSE and AVX in XTS-mode AVX512 implementation by torben-hansen · Pull Request #2140 · aws/aws-lc

torben-hansen · 2025-01-25T16:27:58Z

Issues:

P192842055

Description of changes:

Define the set S = [i*128+80,i*128+80+15], i=1,2,3,..., . The set of inputs with lengths from S, on the XTS encrypt path using the AVX512 implementation, a bimodal performance occurred. We have observed more than 80% drop in performance.

The graph depicts AWS-LC (blue) and AWS-LC-FIPS-2024 (orange), showing the pattern.

This is caused by mixing SSE and AVX instructions in the AVX512 implementation. In particular, for lengths in S, the path L_remaining_num_blocks_is_5_ is taken. This path contains a single move movdqa, an SSE instruction:

  .L_remaining_num_blocks_is_5_${rndsuffix}:
  vmovdqu8 	 ($input),%zmm1
  vmovdqu 	 0x40($input),%xmm2

  [...]

  vmovdqu8 	 %zmm1,($output)
  vmovdqu 	 %xmm2,0x40($output)
  add 	 \$0x50,$output
  movdqa 	 %xmm2,%xmm8
  vextracti32x4 	 \$0x1,%zmm10,%xmm0
  [...]

It's perhaps a less known fact that mixing SSE and AVX instruction can lead to severe performance issues. See e.g. Agner Fog's manual Section 9.12 "Transitions between VEX and non-VEX modes" (at time of writing page 132 in https://www.agner.org/optimize/microarchitecture.pdf):

The transitions B → C, C → B and C → A take approximately 70 clock cycles each on the Sandy Bridge

That appears to be true on our test CPU (Sapphire Rapids) from the c7i instance family. Although, page Section 12.10 in Agner's does state:

The severe penalty for mixing 256-bit VEX code with 128-bit non-VEX code in early processors (see chapter 9.12 page 134) is no longer found in the Ice Lake

Sapphire Rapids is not Ice Lake, but is newer though. So, that's slightly confusing. Nonetheless, to recover performance, a fix should simply be to use the VEX instruction instead of the non-VEX i.e. vmovdqa.

Before:

$ ./tool/bssl speed -filter AES-256-XTS -chunks 206,207,208,209
Did 27399500 AES-256-XTS encrypt (206 bytes) operations in 1000006us (27399335.6 ops/sec): 5644.3 MB/s
Did 28754500 AES-256-XTS encrypt (207 bytes) operations in 1000002us (28754442.5 ops/sec): 5952.2 MB/s
Did 5494000 AES-256-XTS encrypt (208 bytes) operations in 1000024us (5493868.1 ops/sec): 1142.7 MB/s
Did 5346750 AES-256-XTS encrypt (209 bytes) operations in 1000030us (5346589.6 ops/sec): 1117.4 MB/s

After:

$ ./tool/bssl speed -filter AES-256-XTS -chunks 206,207,208,209
Did 27859000 AES-256-XTS encrypt (206 bytes) operations in 1000015us (27858582.1 ops/sec): 5738.9 MB/s
Did 27360500 AES-256-XTS encrypt (207 bytes) operations in 1000007us (27360308.5 ops/sec): 5663.6 MB/s
Did 33774750 AES-256-XTS encrypt (208 bytes) operations in 1000003us (33774648.7 ops/sec): 7025.1 MB/s
Did 27810000 AES-256-XTS encrypt (209 bytes) operations in 1000004us (27809888.8 ops/sec): 5812.3 MB/s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

codecov-commenter · 2025-01-25T16:49:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.95%. Comparing base (81f138a) to head (42a9f04).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2140   +/-   ##
=======================================
  Coverage   78.95%   78.95%           
=======================================
  Files         610      610           
  Lines      105293   105293           
  Branches    14919    14920    +1     
=======================================
+ Hits        83136    83138    +2     
+ Misses      21505    21502    -3     
- Partials      652      653    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nebeid

Excellent find, @torben-hansen and thanks @andrewhop for reporting.

A bimodal performance occurred in the XTS encrypt AVX512 implementation. We have observed more than 80% drop in performance. This is caused by mixing SSE and AVX instructions in the AVX512 implementation. For a subset of input lengths, the code path contained a single move movdqa, an SSE instruction. Use vmovdqa instead.

A bimodal performance occurred in the XTS encrypt AVX512 implementation. We have observed more than 80% drop in performance. This is caused by mixing SSE and AVX instructions in the AVX512 implementation. For a subset of input lengths, the code path contained a single move movdqa, an SSE instruction. Use vmovdqa instead. (cherry picked from commit 37c2b5e)

…lace SSE instructions that degraded performance for certain input lengths (#2319) Original commits: a39439b and 37c2b5e This is a follow-up to #2228 where an out-of-bound (OOB) read was fixed in the AVX512 implementation of AES-XTS and more tests were added. This cherry-picks: - further hardening tests on checking pre-bound reads #2286 - a fix to a performance glitch on a code path that had an SSE instruction instead of an AVX512 one which was triggered by certain input lengths. #2140 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

torben-hansen added 2 commits January 24, 2025 18:22

Avoid mixing SSE and AVX in XTS-mode AVX512 implementation

8d80811

Update build files in generated-src

42a9f04

torben-hansen requested a review from a team as a code owner January 25, 2025 16:27

torben-hansen mentioned this pull request Jan 25, 2025

AVX-512 Support for AES-XTS openssl/openssl#26410

Closed

andrewhop approved these changes Jan 26, 2025

View reviewed changes

justsmth requested a review from nebeid January 27, 2025 13:46

nebeid approved these changes Jan 27, 2025

View reviewed changes

nebeid reviewed Jan 27, 2025

View reviewed changes

torben-hansen merged commit 37c2b5e into aws:main Jan 27, 2025

nebeid mentioned this pull request Apr 8, 2025

Cherry-pick: Harden bound checking tests of AES-XTS and replace SSE instructions that degraded performance for certain input lengths #2319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid mixing SSE and AVX in XTS-mode AVX512 implementation#2140

Avoid mixing SSE and AVX in XTS-mode AVX512 implementation#2140
torben-hansen merged 2 commits intoaws:mainfrom
torben-hansen:xts_avx512_bimodal_performance

torben-hansen commented Jan 25, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 25, 2025

Uh oh!

nebeid left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

torben-hansen commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues:

Description of changes:

Uh oh!

codecov-commenter commented Jan 25, 2025

Codecov Report

Uh oh!

nebeid left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

torben-hansen commented Jan 25, 2025 •

edited

Loading