Fuse XOR-and-permute in BMI2 assembly and minor cleanups by yperbasis · Pull Request #8 · erigontech/fastkeccak

yperbasis · 2026-03-15T06:53:19Z

Summary

Fuse XOR + permute in BMI2 assembly: keccakF1600BMI2 now accepts an optional buf *byte parameter. When non-nil, it XORs 17 lanes (136 bytes) into state before permuting — eliminating one full memory pass. Removes the Go-level xorIn helper and its unsafe usage.
Unify ARM64 assembly: Merge keccakF1600 and xorAndPermute into a single keccakF1600Sha3(a, buf) function with the same optional-buf pattern as amd64. Deduplicates the 24-round loop (-143 lines).
Safe xorIn for finalization: Re-add xorIn using encoding/binary.LittleEndian.Uint64 (no unsafe) for the padding/finalization path (Sum256, padAndSqueeze, sum256Sponge). Full blocks still use fused assembly xorAndPermute. This avoids an ARM regression from always-XOR-136-bytes on partial blocks.
Panic in Sum256 after Read: Sum panics once squeezing has started.
Gate useASM on BMI1 && BMI2: The assembly uses ANDNQ (BMI1) in addition to RORXQ (BMI2).
Minor cleanups: Remove redundant BenchmarkSum256_500K, simplify benchName, clarify D-value register comments in the generator.

Benchmark results (amd64)

goos: linux
goarch: amd64
cpu: 13th Gen Intel(R) Core(TM) i9-13900KS

                           │    master    │            this PR              │
                           │    sec/op    │   sec/op     vs base            │
FasterKeccak/32B-32           151.8n ± 2%   154.7n ± 2%       ~ (p=0.002)
FasterKeccak/128B-32          154.0n ± 1%   155.7n ± 1%       ~ (p=0.006)
FasterKeccak/256B-32          298.1n ± 2%   298.1n ± 1%       ~ (p=0.987)
FasterKeccak/1K-32            1.190µ ± 1%   1.178µ ± 1%  -1.01% (p=0.006)
FasterKeccak/4K-32            4.614µ ± 2%   4.523µ ± 1%  -1.97% (p=0.002)
FasterKeccak/500K-32          561.9µ ± 1%   551.0µ ± 2%  -1.93% (p=0.009)

Benchmark results (arm64)

goos: darwin
goarch: arm64
cpu: Apple M2 Max

                           │    master     │            this PR               │
                           │    sec/op     │   sec/op     vs base             │
FasterKeccak/32B-12           144.8n ±  8%   145.9n ±  1%       ~ (p=0.970)
FasterKeccak/128B-12          154.0n ±  3%   151.8n ±  1%       ~ (p=0.623)
FasterKeccak/256B-12          299.3n ± 11%   290.0n ±  1%       ~ (p=0.509)
FasterKeccak/1K-12            1.138µ ±  2%   1.127µ ±  5%       ~ (p=0.331)
FasterKeccak/4K-12            4.375µ ±  6%   4.366µ ± 11%       ~ (p=0.699)
FasterKeccak/500K-12          530.7µ ±  2%   528.1µ ±  6%       ~ (p=0.394)
FasterKeccakHasher/4K-12      4.372µ ±  1%   4.331µ ±  3%       ~ (p=0.180)
FasterKeccakHasher/500K-12    530.7µ ±  1%   524.3µ ±  2%       ~ (p=0.180)

Zero allocations maintained across all benchmarks (0 B/op, 0 allocs/op).

Test plan

All unit tests pass (go test -v ./...) on both amd64 and arm64
Fuzz test passes (go test -fuzz=FuzzSum256 -fuzztime=10s)
Benchmarks verified with benchstat (6 runs, 500ms each) on both platforms

🤖 Generated with Claude Code

Combine the XOR of rate bytes into state with the Keccak-f[1600] permutation in a single assembly call, eliminating one full memory pass over the 200-byte state. Remove the Go-level xorIn helper and its unsafe usage. Also: reuse s.buf directly in padAndSqueeze (avoids lastBlock stack alloc), remove redundant BenchmarkSum256_500K, simplify benchName, and clarify D-value register comments in the generator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the accelerated Keccak-256 implementation to reduce overhead on amd64 BMI2 by allowing the assembly permutation to optionally XOR a full rate block into the state before permuting, and performs a few related cleanups in padding and benchmarks.

Changes:

Extend keccakF1600BMI2 (amd64/BMI2) to accept an optional buf *byte and XOR 136 bytes into the state before running the 24 rounds when provided.
Refactor final-block padding to avoid unsafe-based XOR helpers and reuse the sponge buffer in padAndSqueeze.
Remove a redundant benchmark and simplify benchmark naming logic; update generator comments accordingly.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`keccakf_amd64_bmi2.s`	Adds optional pre-round XOR of 17 lanes (rate) from `buf` into the state before permutation.
`keccak_amd64.go`	Updates the BMI2 function signature and routes `keccakF1600` / `xorAndPermute` through the new assembly entrypoint.
`keccak_asm.go`	Removes `unsafe` xor helper; updates final padding paths to use `xorAndPermute` and in-place buffer padding for squeezing.
`gen_keccakf_bmi2.go`	Updates the generator to emit the new BMI2 entrypoint signature and the optional XOR prelude; clarifies comments.
`keccak_test.go`	Removes a duplicate benchmark and simplifies `benchName`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

After Read is called, padAndSqueeze mutates the sponge state (XORs padding, permutes). A subsequent Sum256 would re-apply padding and permute a copy of this already-finalized state, producing an incorrect digest. Panic to make this misuse visible. Note: this differs from x/crypto/sha3, where Sum() works after Read() by cloning the full sponge state before finalizing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Re-add xorIn using binary.LittleEndian.Uint64 (no unsafe) for the padding/finalization path (Sum256, padAndSqueeze, sum256Sponge). Keep fused xorAndPermute assembly only for full blocks in Write. This eliminates the ~2% ARM regression from the previous approach (clear + full-rate xorAndPermute for partial blocks) while aligning the ARM and amd64 code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

keccakF1600Sha3 now takes an optional buf *byte parameter, matching the amd64 keccakF1600BMI2 pattern. When buf != nil, XORs rate bytes into state before permuting; when nil, just permutes. The round loop and store are shared between both paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`) to pick up erigontech/fastkeccak#8

yperbasis requested review from Giulio2002 and chfast March 15, 2026 06:54

AskAlexSharov requested a review from Copilot April 7, 2026 01:40

Copilot started reviewing on behalf of AskAlexSharov April 7, 2026 01:40 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

yperbasis mentioned this pull request Apr 7, 2026

Implement Read method on Hasher for sponge squeeze output #1

Merged

yperbasis requested a review from AskAlexSharov April 7, 2026 12:44

This was referenced Apr 7, 2026

Optimize ARM slightly #2

Merged

Add BMI2 keccak-f[1600] for amd64 #6

Merged

yperbasis requested a review from Copilot April 7, 2026 13:39

Copilot started reviewing on behalf of yperbasis April 7, 2026 13:39 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread keccak_asm.go

yperbasis and others added 4 commits April 7, 2026 15:52

Also check for BMI1 since ANDNQ is a BMI1 instruction

a60658f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yperbasis force-pushed the yperbasis/minor-improvements branch from 4129299 to db74886 Compare April 7, 2026 13:53

AskAlexSharov requested a review from Copilot April 7, 2026 14:52

Copilot started reviewing on behalf of AskAlexSharov April 7, 2026 14:53 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread keccak_asm.go

AskAlexSharov approved these changes Apr 8, 2026

View reviewed changes

AskAlexSharov merged commit 08e7b66 into master Apr 8, 2026
4 checks passed

AskAlexSharov deleted the yperbasis/minor-improvements branch April 8, 2026 01:07

yperbasis mentioned this pull request Apr 8, 2026

deps: update fastkeccak erigontech/erigon#20395

Merged

github-merge-queue Bot pushed a commit to erigontech/erigon that referenced this pull request Apr 10, 2026

deps: update fastkeccak (#20395)

2bb276d

Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`) to pick up erigontech/fastkeccak#8

github-merge-queue Bot pushed a commit to erigontech/erigon that referenced this pull request Apr 10, 2026

deps: update fastkeccak (#20395)

b331d19

Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`) to pick up erigontech/fastkeccak#8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse XOR-and-permute in BMI2 assembly and minor cleanups#8

Fuse XOR-and-permute in BMI2 assembly and minor cleanups#8
AskAlexSharov merged 5 commits into
masterfrom
yperbasis/minor-improvements

yperbasis commented Mar 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yperbasis commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results (amd64)

Benchmark results (arm64)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yperbasis commented Mar 15, 2026 •

edited

Loading