Skip to content

Optimize Avx512 Keccak#9413

Draft
benaadams wants to merge 31 commits into
masterfrom
optimize-keccak
Draft

Optimize Avx512 Keccak#9413
benaadams wants to merge 31 commits into
masterfrom
optimize-keccak

Conversation

@benaadams

@benaadams benaadams commented Oct 5, 2025

Copy link
Copy Markdown
Member

Changes

Performance improvements:

  • Code size: 1919 → 994 bytes (48% reduction)
  • Stack usage: 1104 → 32 bytes (640 bytes saved, 10 fewer XMM saves)
  • Hot loop: eliminated ~40 constant vector loads per iteration

Optimizations:

  • Hoisted constants - moved rotation/permutation vectors to static readonly fields, eliminating redundant vector creations and memory loads in the hot loop
  • Improved instruction scheduling - restructured Keccak round function to overlap independent operations:
    • Theta: use 3-way TernaryLogic XOR to avoid materializing intermediate results
    • Rho+Pi: pipeline permutes immediately after rotates complete
    • Pi: rewrite using matrix transpose (unpack/shuffle) instead of 25× PermuteVar8x64x2 calls
  • Cache optimization - increased KeccakCache entry size from 96 to 128 bytes (align to 2 cache lines), added SSE prefetch hints
  • Benchmarking - added BenchmarkHash method for AVX-512 vs scalar comparison

Assembly impact:

Before: hot loop with 40+ memory loads for constant vectors, poor instruction interleaving

vmovups  zmm18, zmmword ptr [reloc @RWD64]
vmovaps  zmm19, zmm18
vpermi2q zmm19, zmm6, zmm7
vmovups  zmm20, zmmword ptr [reloc @RWD128]  ; repeated 40+ times

After: no constant loads, better scheduling

vpternlogq zmm0, zmm24, zmm25, -106  ; direct use of hoisted constants
vprolvq  zmm0, zmm0, zmm19
vpermq   zmm1, zmm16, zmm1            ; overlapped execution

Before

G_M000_IG04:                ;; offset=0x013F
       vmovaps  zmm18, zmm6
       vpternlogq zmm18, zmm7, zmm8, -106
       vpternlogq zmm18, zmm10, zmm9, -106
       vpermq   zmm19, zmm2, zmm18
       vpermq   zmm18, zmm0, zmm18
       vprolq   zmm18, zmm18, 1
       vpxorq   zmm18, zmm18, zmm19
       vpxorq   zmm6, zmm6, zmm18
       vpxorq   zmm7, zmm7, zmm18
       vpxorq   zmm8, zmm8, zmm18
       vpxorq   zmm9, zmm9, zmm18
       vpxorq   zmm10, zmm10, zmm18
       vprolvq  zmm6, zmm6, zmm3
       vprolvq  zmm7, zmm7, zmm4
       vprolvq  zmm8, zmm8, zmm5
       vprolvq  zmm9, zmm9, zmm16
       vprolvq  zmm10, zmm10, zmm17
       vmovups  zmm18, zmmword ptr [reloc @RWD64]
       vmovaps  zmm19, zmm18
       vpermi2q zmm19, zmm6, zmm7
       vmovups  zmm20, zmmword ptr [reloc @RWD128]
       vpermt2q zmm19, zmm20, zmm8
       vmovups  zmm20, zmmword ptr [reloc @RWD192]
       vpermt2q zmm19, zmm20, zmm9
       vmovups  zmm20, zmmword ptr [reloc @RWD256]
       vpermt2q zmm19, zmm20, zmm10
       vmovups  zmm20, zmmword ptr [reloc @RWD320]
       vmovaps  zmm21, zmm6
       vpermt2q zmm21, zmm20, zmm7
       vmovups  zmm20, zmmword ptr [reloc @RWD384]
       vpermt2q zmm21, zmm20, zmm8
       vmovups  zmm20, zmmword ptr [reloc @RWD448]
       vpermt2q zmm21, zmm20, zmm9
       vmovups  zmm20, zmmword ptr [reloc @RWD512]
       vpermt2q zmm21, zmm20, zmm10
       vmovups  zmm20, zmmword ptr [reloc @RWD576]
       vmovaps  zmm22, zmm6
       vpermt2q zmm22, zmm20, zmm7
       vmovups  zmm20, zmmword ptr [reloc @RWD640]
       vpermt2q zmm22, zmm20, zmm8
       vmovups  zmm20, zmmword ptr [reloc @RWD704]
       vpermt2q zmm22, zmm20, zmm9
       vmovups  zmm20, zmmword ptr [reloc @RWD768]
       vpermt2q zmm22, zmm20, zmm10
       vmovups  zmm20, zmmword ptr [reloc @RWD832]
       vmovaps  zmm23, zmm6
       vpermt2q zmm23, zmm20, zmm7
       vmovups  zmm20, zmmword ptr [reloc @RWD896]
       vpermt2q zmm23, zmm20, zmm8
       vmovups  zmm20, zmmword ptr [reloc @RWD960]
       vpermt2q zmm23, zmm20, zmm9
       vmovups  zmm20, zmmword ptr [reloc @RWD1024]
       vpermt2q zmm23, zmm20, zmm10
       vmovups  zmm20, zmmword ptr [reloc @RWD1088]
       vpermt2q zmm6, zmm20, zmm7
       vmovups  zmm20, zmmword ptr [reloc @RWD1152]
       vpermt2q zmm6, zmm20, zmm8
       vmovups  zmm20, zmmword ptr [reloc @RWD1216]
       vpermt2q zmm6, zmm20, zmm9
       vmovups  zmm20, zmmword ptr [reloc @RWD1280]
       vpermt2q zmm10, zmm20, zmm6

After

G_M000_IG04:                ;; offset=0x00DA
       vmovaps  zmm24, zmm0
       vpternlogq zmm24, zmm1, zmm2, -106
       vpternlogq zmm24, zmm4, zmm3, -106
       vpermq   zmm25, zmm16, zmm24
       vpermq   zmm24, zmm5, zmm24
       vprolq   zmm25, zmm25, 1
       vpternlogq zmm0, zmm24, zmm25, -106
       vpternlogq zmm1, zmm24, zmm25, -106
       vpternlogq zmm2, zmm24, zmm25, -106
       vpternlogq zmm3, zmm24, zmm25, -106
       vpternlogq zmm4, zmm24, zmm25, -106
       vprolvq  zmm0, zmm0, zmm19
       vprolvq  zmm1, zmm1, zmm20
       vpermq   zmm1, zmm16, zmm1
       vprolvq  zmm2, zmm2, zmm21
       vpermq   zmm2, zmm17, zmm2
       vprolvq  zmm3, zmm3, zmm22
       vpermq   zmm3, zmm18, zmm3
       vprolvq  zmm4, zmm4, zmm23
       vpunpcklqdq zmm24, zmm0, zmm1
       vpunpcklqdq zmm25, zmm2, zmm3
       vpunpckhqdq zmm0, zmm0, zmm1
       vpunpckhqdq zmm1, zmm2, zmm3
       vpermq   zmm4, zmm5, zmm4
       vpunpcklqdq zmm2, zmm4, zmm4
       vpunpckhqdq zmm3, zmm4, zmm4
       vshufi64x2 zmm4, zmm24, zmm25, 68
       vshufi64x2 zmm0, zmm0, zmm1, 68
       vshufi64x2 zmm1, zmm4, zmm2, -120
       vshufi64x2 zmm26, zmm0, zmm3, -35
       vshufi64x2 zmm0, zmm0, zmm3, -120
       vshufi64x2 zmm3, zmm24, zmm25, -18
       vshufi64x2 zmm3, zmm3, zmm2, -88
       vpermq   zmm24, zmm16, zmm1
       vpermq   zmm25, zmm17, zmm1
       vpermq   zmm27, zmm16, zmm26
       vpermq   zmm28, zmm17, zmm26
       vpternlogq zmm1, zmm24, zmm25, -46
       vpermq   zmm24, zmm16, zmm0
       vpermq   zmm25, zmm17, zmm0
       vpermq   zmm29, zmm16, zmm3
       vpermq   zmm30, zmm17, zmm3
       vpternlogq zmm26, zmm27, zmm28, -46
       vpternlogq zmm0, zmm24, zmm25, -46
       vshufi64x2 zmm4, zmm4, zmm2, -35
       vpermq   zmm2, zmm16, zmm4
       vpermq   zmm24, zmm17, zmm4
       vpternlogq zmm3, zmm29, zmm30, -46
       vpternlogq zmm4, zmm2, zmm24, -46
       vmovq    xmm2, qword ptr [rax]
       vpxord   zmm1, zmm2, zmm1
       add      rax, 8
       vmovaps  zmm2, zmm1
  • 640 bytes less of stack used; 10 less 16bytes pushes in preamble

Before

; Method Nethermind.Core.Crypto.KeccakHash:KeccakF1600Avx512F(System.Span`1[ulong]) (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 1104
       vmovaps  xmmword ptr [rsp+0x440], xmm6
       vmovaps  xmmword ptr [rsp+0x430], xmm7
       vmovaps  xmmword ptr [rsp+0x420], xmm8
       vmovaps  xmmword ptr [rsp+0x410], xmm9
       vmovaps  xmmword ptr [rsp+0x400], xmm10
       vmovaps  xmmword ptr [rsp+0x3F0], xmm11
       vmovaps  xmmword ptr [rsp+0x3E0], xmm12
       vmovaps  xmmword ptr [rsp+0x3D0], xmm13
       vmovaps  xmmword ptr [rsp+0x3C0], xmm14
       vmovaps  xmmword ptr [rsp+0x3B0], xmm15

G_M000_IG02:                ;; offset=0x0062

After

; Method Nethermind.Core.Crypto.KeccakHash:KeccakF1600Avx512F(System.Span`1[ulong]) (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 32

G_M000_IG02:                ;; offset=0x0005
  • Reduced code size and constant loads

Before

G_M000_IG11:                ;; offset=0x0779
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
RWD00  	dq	FFFFFFFFFFFFFFFFh, FFFFFFFFFFFFFFFFh, FFFFFFFFFFFFFFFFh, FFFFFFFFFFFFFFFFh, FFFFFFFFFFFFFFFFh, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD64  	dq	0000000000000001h, 0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000000h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD128 	dq	0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000000h, 0000000000000001h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD192 	dq	0000000000000004h, 0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000003h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD256 	dq	0000000000000000h, 0000000000000001h, 000000000000003Eh, 000000000000001Ch, 000000000000001Bh, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD320 	dq	0000000000000024h, 000000000000002Ch, 0000000000000006h, 0000000000000037h, 0000000000000014h, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD384 	dq	0000000000000003h, 000000000000000Ah, 000000000000002Bh, 0000000000000019h, 0000000000000027h, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD448 	dq	0000000000000029h, 000000000000002Dh, 000000000000000Fh, 0000000000000015h, 0000000000000008h, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD512 	dq	0000000000000012h, 0000000000000002h, 000000000000003Dh, 0000000000000038h, 000000000000000Eh, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD576 	dq	0000000000000000h, 0000000000000009h, 0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD640 	dq	0000000000000000h, 0000000000000001h, 000000000000000Ah, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD704 	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 000000000000000Bh, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD768 	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000003h, 000000000000000Ch, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD832 	dq	0000000000000003h, 000000000000000Ch, 0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD896 	dq	0000000000000000h, 0000000000000001h, 0000000000000008h, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD960 	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000009h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1024	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000003h, 000000000000000Ah, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1088	dq	0000000000000001h, 000000000000000Ah, 0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1152	dq	0000000000000000h, 0000000000000001h, 000000000000000Bh, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1216	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 000000000000000Ch, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1280	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000003h, 0000000000000008h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1344	dq	0000000000000004h, 0000000000000008h, 0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1408	dq	0000000000000000h, 0000000000000001h, 0000000000000009h, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1472	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 000000000000000Ah, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1536	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000003h, 000000000000000Bh, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1600	dq	0000000000000002h, 000000000000000Bh, 0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1664	dq	0000000000000000h, 0000000000000001h, 000000000000000Ch, 0000000000000003h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1728	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000008h, 0000000000000004h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD1792	dq	0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000003h, 0000000000000009h, 0000000000000005h, 0000000000000006h, 0000000000000007h
; Total bytes of code: 1919

After

G_M000_IG09:                ;; offset=0x03CE
       mov      rcx, 0x7FFEEC264120
       call     CORINFO_HELP_GET_GCSTATIC_BASE
       jmp      G_M000_IG03
RWD00  	dq	0000000000000004h, 0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000003h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD64  	dq	0000000000000001h, 0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000000h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD128 	dq	0000000000000002h, 0000000000000003h, 0000000000000004h, 0000000000000000h, 0000000000000001h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD192 	dq	0000000000000003h, 0000000000000004h, 0000000000000000h, 0000000000000001h, 0000000000000002h, 0000000000000005h, 0000000000000006h, 0000000000000007h
RWD256 	dq	0000000000000000h, 0000000000000001h, 000000000000003Eh, 000000000000001Ch, 000000000000001Bh, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD320 	dq	0000000000000024h, 000000000000002Ch, 0000000000000006h, 0000000000000037h, 0000000000000014h, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD384 	dq	0000000000000003h, 000000000000000Ah, 000000000000002Bh, 0000000000000019h, 0000000000000027h, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD448 	dq	0000000000000029h, 000000000000002Dh, 000000000000000Fh, 0000000000000015h, 0000000000000008h, 0000000000000000h, 0000000000000000h, 0000000000000000h
RWD512 	dq	0000000000000012h, 0000000000000002h, 000000000000003Dh, 0000000000000038h, 000000000000000Eh, 0000000000000000h, 0000000000000000h, 0000000000000000h
; Total bytes of code: 994
  • List the changes

Types of changes

What types of changes does your code introduce?

  • Optimization

Testing

Requires testing

  • No

@benaadams benaadams requested a review from Copilot October 5, 2025 01:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the AVX512-based Keccak hash implementation by extracting frequently used vector constants into static readonly fields and reducing redundant vector creations within the hot loop.

  • Hoists vector constants (permutation indices, rotation values) to static readonly fields to reduce GC pressure and improve JIT optimization
  • Refactors Pi step computation to reuse a single indexes vector with element modifications instead of creating new vectors each time
  • Simplifies the Iota step by using Vector512.CreateScalarUnsafe for more efficient round constant application

Comment thread src/Nethermind/Nethermind.Core/Crypto/KeccakHash.cs Outdated
Comment thread src/Nethermind/Nethermind.Core/Crypto/KeccakHash.cs Outdated
LukaszRozmej
LukaszRozmej previously approved these changes Oct 5, 2025

@LukaszRozmej LukaszRozmej left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benchmarks?

flcl42
flcl42 previously approved these changes Oct 30, 2025

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread src/Nethermind/Nethermind.Core/Crypto/KeccakHash.cs Outdated
Comment thread src/Nethermind/Nethermind.Core/Crypto/KeccakHash.cs
Comment thread src/Nethermind/Nethermind.Core/Crypto/KeccakCache.cs Outdated
Comment thread src/Nethermind/Nethermind.Benchmark/Core/Keccak256Benchmarks.cs Outdated
@benaadams

Copy link
Copy Markdown
Member Author

@copilot summarize the changes in this PR

Copilot AI commented Dec 13, 2025

Copy link
Copy Markdown
Contributor

@benaadams I've opened a new pull request, #9929, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


using static System.Numerics.BitOperations;

[assembly: InternalsVisibleTo("Nethermind.Benchmark")]

Copilot AI Dec 14, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InternalsVisibleTo attribute uses "Nethermind.Benchmark", but the actual root namespace is "Nethermind.Benchmarks" (note the 's' at the end). This mismatch will prevent the benchmark project from accessing internal members like BenchmarkHash and Implementation enum. The attribute should be changed to "Nethermind.Benchmarks" to match the actual namespace used in the benchmark project.

Suggested change
[assembly: InternalsVisibleTo("Nethermind.Benchmark")]
[assembly: InternalsVisibleTo("Nethermind.Benchmarks")]

Copilot uses AI. Check for mistakes.
foreach (Job job in jobs)
{
AddJob(job.WithToolchain(InProcessNoEmitToolchain.Instance));
AddJob(job);

Copilot AI Dec 14, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of InProcessNoEmitToolchain.Instance could affect benchmarking accuracy and behavior. The InProcess toolchain prevents creating a new process for each benchmark, which can be important for consistent measurements. This change should be intentional and documented, or reverted if it was an accidental modification.

Suggested change
AddJob(job);
AddJob(job.WithToolchain(InProcessNoEmitToolchain.Instance));

Copilot uses AI. Check for mistakes.
Unsafe.As<ulong, Vector256<ulong>>(ref Unsafe.Add(ref MemoryMarshal.GetReference(state), 20)) = c4.GetLower();
state[24] = c4.GetElement(4);
[SkipLocalsInit]
public static void KeccakF1600Avx2(Span<ulong> state)

Copilot AI Dec 14, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeccakF1600Avx2 is declared as public, but it's only called from internal BenchmarkHash method. For consistency and to prevent unintended external usage, this should be private since it's not part of the public API surface.

Suggested change
public static void KeccakF1600Avx2(Span<ulong> state)
private static void KeccakF1600Avx2(Span<ulong> state)

Copilot uses AI. Check for mistakes.
@benaadams benaadams marked this pull request as draft December 22, 2025 17:03
ref Entry e = ref Unsafe.Add(ref Unsafe.AsRef<Entry>(Memory), index);
if (Sse.IsSupported)
{
Sse.PrefetchNonTemporal((byte*)Unsafe.AsPointer(ref e) + 64);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a constant here instead of magic number 64?

// State layout:
// - Each zmm holds one Keccak row (y fixed, x varies) in lanes 0-4.
// - Lanes 5-7 are treated as "dead" and must never be permuted into lanes 0-4.
Vector512<ulong> c0 = Unsafe.As<ulong, Vector512<ulong>>(ref state);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we need to do the mask thing any more?

@benaadams benaadams dismissed stale reviews from flcl42 and LukaszRozmej via 1161276 April 21, 2026 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants