Optimize Avx512 Keccak#9413
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes the AVX512-based Keccak hash implementation by extracting frequently used vector constants into static readonly fields and reducing redundant vector creations within the hot loop.
- Hoists vector constants (permutation indices, rotation values) to static readonly fields to reduce GC pressure and improve JIT optimization
- Refactors Pi step computation to reuse a single
indexesvector with element modifications instead of creating new vectors each time - Simplifies the Iota step by using
Vector512.CreateScalarUnsafefor more efficient round constant application
|
@copilot summarize the changes in this PR |
|
@benaadams I've opened a new pull request, #9929, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
|
||
| using static System.Numerics.BitOperations; | ||
|
|
||
| [assembly: InternalsVisibleTo("Nethermind.Benchmark")] |
There was a problem hiding this comment.
The InternalsVisibleTo attribute uses "Nethermind.Benchmark", but the actual root namespace is "Nethermind.Benchmarks" (note the 's' at the end). This mismatch will prevent the benchmark project from accessing internal members like BenchmarkHash and Implementation enum. The attribute should be changed to "Nethermind.Benchmarks" to match the actual namespace used in the benchmark project.
| [assembly: InternalsVisibleTo("Nethermind.Benchmark")] | |
| [assembly: InternalsVisibleTo("Nethermind.Benchmarks")] |
| foreach (Job job in jobs) | ||
| { | ||
| AddJob(job.WithToolchain(InProcessNoEmitToolchain.Instance)); | ||
| AddJob(job); |
There was a problem hiding this comment.
The removal of InProcessNoEmitToolchain.Instance could affect benchmarking accuracy and behavior. The InProcess toolchain prevents creating a new process for each benchmark, which can be important for consistent measurements. This change should be intentional and documented, or reverted if it was an accidental modification.
| AddJob(job); | |
| AddJob(job.WithToolchain(InProcessNoEmitToolchain.Instance)); |
| Unsafe.As<ulong, Vector256<ulong>>(ref Unsafe.Add(ref MemoryMarshal.GetReference(state), 20)) = c4.GetLower(); | ||
| state[24] = c4.GetElement(4); | ||
| [SkipLocalsInit] | ||
| public static void KeccakF1600Avx2(Span<ulong> state) |
There was a problem hiding this comment.
KeccakF1600Avx2 is declared as public, but it's only called from internal BenchmarkHash method. For consistency and to prevent unintended external usage, this should be private since it's not part of the public API surface.
| public static void KeccakF1600Avx2(Span<ulong> state) | |
| private static void KeccakF1600Avx2(Span<ulong> state) |
| ref Entry e = ref Unsafe.Add(ref Unsafe.AsRef<Entry>(Memory), index); | ||
| if (Sse.IsSupported) | ||
| { | ||
| Sse.PrefetchNonTemporal((byte*)Unsafe.AsPointer(ref e) + 64); |
There was a problem hiding this comment.
can we use a constant here instead of magic number 64?
| // State layout: | ||
| // - Each zmm holds one Keccak row (y fixed, x varies) in lanes 0-4. | ||
| // - Lanes 5-7 are treated as "dead" and must never be permuted into lanes 0-4. | ||
| Vector512<ulong> c0 = Unsafe.As<ulong, Vector512<ulong>>(ref state); |
There was a problem hiding this comment.
why don't we need to do the mask thing any more?
Changes
Performance improvements:
Optimizations:
BenchmarkHashmethod for AVX-512 vs scalar comparisonAssembly impact:
Before: hot loop with 40+ memory loads for constant vectors, poor instruction interleaving
After: no constant loads, better scheduling
Before
After
Before
After
Before
After
Types of changes
What types of changes does your code introduce?
Testing
Requires testing