Fuse XOR-and-permute in BMI2 assembly and minor cleanups#8
Conversation
Combine the XOR of rate bytes into state with the Keccak-f[1600] permutation in a single assembly call, eliminating one full memory pass over the 200-byte state. Remove the Go-level xorIn helper and its unsafe usage. Also: reuse s.buf directly in padAndSqueeze (avoids lastBlock stack alloc), remove redundant BenchmarkSum256_500K, simplify benchName, and clarify D-value register comments in the generator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the accelerated Keccak-256 implementation to reduce overhead on amd64 BMI2 by allowing the assembly permutation to optionally XOR a full rate block into the state before permuting, and performs a few related cleanups in padding and benchmarks.
Changes:
- Extend
keccakF1600BMI2(amd64/BMI2) to accept an optionalbuf *byteand XOR 136 bytes into the state before running the 24 rounds when provided. - Refactor final-block padding to avoid
unsafe-based XOR helpers and reuse the sponge buffer inpadAndSqueeze. - Remove a redundant benchmark and simplify benchmark naming logic; update generator comments accordingly.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
keccakf_amd64_bmi2.s |
Adds optional pre-round XOR of 17 lanes (rate) from buf into the state before permutation. |
keccak_amd64.go |
Updates the BMI2 function signature and routes keccakF1600 / xorAndPermute through the new assembly entrypoint. |
keccak_asm.go |
Removes unsafe xor helper; updates final padding paths to use xorAndPermute and in-place buffer padding for squeezing. |
gen_keccakf_bmi2.go |
Updates the generator to emit the new BMI2 entrypoint signature and the optional XOR prelude; clarifies comments. |
keccak_test.go |
Removes a duplicate benchmark and simplifies benchName. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
After Read is called, padAndSqueeze mutates the sponge state (XORs padding, permutes). A subsequent Sum256 would re-apply padding and permute a copy of this already-finalized state, producing an incorrect digest. Panic to make this misuse visible. Note: this differs from x/crypto/sha3, where Sum() works after Read() by cloning the full sponge state before finalizing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-add xorIn using binary.LittleEndian.Uint64 (no unsafe) for the padding/finalization path (Sum256, padAndSqueeze, sum256Sponge). Keep fused xorAndPermute assembly only for full blocks in Write. This eliminates the ~2% ARM regression from the previous approach (clear + full-rate xorAndPermute for partial blocks) while aligning the ARM and amd64 code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
keccakF1600Sha3 now takes an optional buf *byte parameter, matching the amd64 keccakF1600BMI2 pattern. When buf != nil, XORs rate bytes into state before permuting; when nil, just permutes. The round loop and store are shared between both paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4129299 to
db74886
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`) to pick up erigontech/fastkeccak#8
Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`) to pick up erigontech/fastkeccak#8
Summary
keccakF1600BMI2now accepts an optionalbuf *byteparameter. When non-nil, it XORs 17 lanes (136 bytes) into state before permuting — eliminating one full memory pass. Removes the Go-levelxorInhelper and itsunsafeusage.keccakF1600andxorAndPermuteinto a singlekeccakF1600Sha3(a, buf)function with the same optional-buf pattern as amd64. Deduplicates the 24-round loop (-143 lines).xorInfor finalization: Re-addxorInusingencoding/binary.LittleEndian.Uint64(nounsafe) for the padding/finalization path (Sum256, padAndSqueeze, sum256Sponge). Full blocks still use fused assemblyxorAndPermute. This avoids an ARM regression from always-XOR-136-bytes on partial blocks.Sumpanics once squeezing has started.useASMon BMI1 && BMI2: The assembly usesANDNQ(BMI1) in addition toRORXQ(BMI2).BenchmarkSum256_500K, simplifybenchName, clarify D-value register comments in the generator.Benchmark results (amd64)
Benchmark results (arm64)
Zero allocations maintained across all benchmarks (0 B/op, 0 allocs/op).
Test plan
go test -v ./...) on both amd64 and arm64go test -fuzz=FuzzSum256 -fuzztime=10s)benchstat(6 runs, 500ms each) on both platforms🤖 Generated with Claude Code