Skip to content

Avoid mixing SSE and AVX in XTS-mode AVX512 implementation#2140

Merged
torben-hansen merged 2 commits intoaws:mainfrom
torben-hansen:xts_avx512_bimodal_performance
Jan 27, 2025
Merged

Avoid mixing SSE and AVX in XTS-mode AVX512 implementation#2140
torben-hansen merged 2 commits intoaws:mainfrom
torben-hansen:xts_avx512_bimodal_performance

Conversation

@torben-hansen
Copy link
Copy Markdown
Contributor

@torben-hansen torben-hansen commented Jan 25, 2025

Issues:

P192842055

Description of changes:

Define the set S = [i*128+80,i*128+80+15], i=1,2,3,..., . The set of inputs with lengths from S, on the XTS encrypt path using the AVX512 implementation, a bimodal performance occurred. We have observed more than 80% drop in performance.

The graph depicts AWS-LC (blue) and AWS-LC-FIPS-2024 (orange), showing the pattern.

Screenshot 2025-01-24 at 6 29 13 PM

This is caused by mixing SSE and AVX instructions in the AVX512 implementation. In particular, for lengths in S, the path L_remaining_num_blocks_is_5_ is taken. This path contains a single move movdqa, an SSE instruction:

  .L_remaining_num_blocks_is_5_${rndsuffix}:
  vmovdqu8 	 ($input),%zmm1
  vmovdqu 	 0x40($input),%xmm2

  [...]

  vmovdqu8 	 %zmm1,($output)
  vmovdqu 	 %xmm2,0x40($output)
  add 	 \$0x50,$output
  movdqa 	 %xmm2,%xmm8
  vextracti32x4 	 \$0x1,%zmm10,%xmm0
  [...]

It's perhaps a less known fact that mixing SSE and AVX instruction can lead to severe performance issues. See e.g. Agner Fog's manual Section 9.12 "Transitions between VEX and non-VEX modes" (at time of writing page 132 in https://www.agner.org/optimize/microarchitecture.pdf):

The transitions B → C, C → B and C → A take approximately 70 clock cycles each on the Sandy Bridge

That appears to be true on our test CPU (Sapphire Rapids) from the c7i instance family. Although, page Section 12.10 in Agner's does state:

The severe penalty for mixing 256-bit VEX code with 128-bit non-VEX code in early processors (see chapter 9.12 page 134) is no longer found in the Ice Lake

Sapphire Rapids is not Ice Lake, but is newer though. So, that's slightly confusing. Nonetheless, to recover performance, a fix should simply be to use the VEX instruction instead of the non-VEX i.e. vmovdqa.

Before:

$ ./tool/bssl speed -filter AES-256-XTS -chunks 206,207,208,209
Did 27399500 AES-256-XTS encrypt (206 bytes) operations in 1000006us (27399335.6 ops/sec): 5644.3 MB/s
Did 28754500 AES-256-XTS encrypt (207 bytes) operations in 1000002us (28754442.5 ops/sec): 5952.2 MB/s
Did 5494000 AES-256-XTS encrypt (208 bytes) operations in 1000024us (5493868.1 ops/sec): 1142.7 MB/s
Did 5346750 AES-256-XTS encrypt (209 bytes) operations in 1000030us (5346589.6 ops/sec): 1117.4 MB/s

After:

$ ./tool/bssl speed -filter AES-256-XTS -chunks 206,207,208,209
Did 27859000 AES-256-XTS encrypt (206 bytes) operations in 1000015us (27858582.1 ops/sec): 5738.9 MB/s
Did 27360500 AES-256-XTS encrypt (207 bytes) operations in 1000007us (27360308.5 ops/sec): 5663.6 MB/s
Did 33774750 AES-256-XTS encrypt (208 bytes) operations in 1000003us (33774648.7 ops/sec): 7025.1 MB/s
Did 27810000 AES-256-XTS encrypt (209 bytes) operations in 1000004us (27809888.8 ops/sec): 5812.3 MB/s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.95%. Comparing base (81f138a) to head (42a9f04).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2140   +/-   ##
=======================================
  Coverage   78.95%   78.95%           
=======================================
  Files         610      610           
  Lines      105293   105293           
  Branches    14919    14920    +1     
=======================================
+ Hits        83136    83138    +2     
+ Misses      21505    21502    -3     
- Partials      652      653    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@justsmth justsmth requested a review from nebeid January 27, 2025 13:46
Copy link
Copy Markdown
Contributor

@nebeid nebeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent find, @torben-hansen and thanks @andrewhop for reporting.

@torben-hansen torben-hansen merged commit 37c2b5e into aws:main Jan 27, 2025
manastasova pushed a commit to manastasova/aws-lc that referenced this pull request Jan 30, 2025
A bimodal performance occurred in the XTS encrypt AVX512 implementation. We have observed more than 80% drop in performance. This is caused by mixing SSE and AVX instructions in the AVX512 implementation. For a subset of input lengths, the code path contained a single move movdqa, an SSE instruction. Use vmovdqa instead.
nebeid pushed a commit to nebeid/aws-lc that referenced this pull request Apr 8, 2025
A bimodal performance occurred in the XTS encrypt AVX512 implementation. We have observed more than 80% drop in performance. This is caused by mixing SSE and AVX instructions in the AVX512 implementation. For a subset of input lengths, the code path contained a single move movdqa, an SSE instruction. Use vmovdqa instead.
nebeid pushed a commit to nebeid/aws-lc that referenced this pull request Apr 8, 2025
A bimodal performance occurred in the XTS encrypt AVX512 implementation. We have observed more than 80% drop in performance. This is caused by mixing SSE and AVX instructions in the AVX512 implementation. For a subset of input lengths, the code path contained a single move movdqa, an SSE instruction. Use vmovdqa instead.

(cherry picked from commit 37c2b5e)
nebeid added a commit that referenced this pull request Apr 22, 2025
…lace SSE instructions that degraded performance for certain input lengths (#2319)

Original commits:
a39439b
and
37c2b5e

This is a follow-up to #2228 where an
out-of-bound (OOB) read was fixed in the AVX512 implementation of
AES-XTS and more tests were added.
This cherry-picks:
- further hardening tests on checking pre-bound reads #2286 
- a fix to a performance glitch on a code path that had an SSE
instruction instead of an AVX512 one which was triggered by certain
input lengths. #2140

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license and the ISC license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants