Skip to content

Fuse XOR-and-permute in BMI2 assembly and minor cleanups#8

Merged
AskAlexSharov merged 5 commits into
masterfrom
yperbasis/minor-improvements
Apr 8, 2026
Merged

Fuse XOR-and-permute in BMI2 assembly and minor cleanups#8
AskAlexSharov merged 5 commits into
masterfrom
yperbasis/minor-improvements

Conversation

@yperbasis

@yperbasis yperbasis commented Mar 15, 2026

Copy link
Copy Markdown
Member

Summary

  • Fuse XOR + permute in BMI2 assembly: keccakF1600BMI2 now accepts an optional buf *byte parameter. When non-nil, it XORs 17 lanes (136 bytes) into state before permuting — eliminating one full memory pass. Removes the Go-level xorIn helper and its unsafe usage.
  • Unify ARM64 assembly: Merge keccakF1600 and xorAndPermute into a single keccakF1600Sha3(a, buf) function with the same optional-buf pattern as amd64. Deduplicates the 24-round loop (-143 lines).
  • Safe xorIn for finalization: Re-add xorIn using encoding/binary.LittleEndian.Uint64 (no unsafe) for the padding/finalization path (Sum256, padAndSqueeze, sum256Sponge). Full blocks still use fused assembly xorAndPermute. This avoids an ARM regression from always-XOR-136-bytes on partial blocks.
  • Panic in Sum256 after Read: Sum panics once squeezing has started.
  • Gate useASM on BMI1 && BMI2: The assembly uses ANDNQ (BMI1) in addition to RORXQ (BMI2).
  • Minor cleanups: Remove redundant BenchmarkSum256_500K, simplify benchName, clarify D-value register comments in the generator.

Benchmark results (amd64)

goos: linux
goarch: amd64
cpu: 13th Gen Intel(R) Core(TM) i9-13900KS

                           │    master    │            this PR              │
                           │    sec/op    │   sec/op     vs base            │
FasterKeccak/32B-32           151.8n ± 2%   154.7n ± 2%       ~ (p=0.002)
FasterKeccak/128B-32          154.0n ± 1%   155.7n ± 1%       ~ (p=0.006)
FasterKeccak/256B-32          298.1n ± 2%   298.1n ± 1%       ~ (p=0.987)
FasterKeccak/1K-32            1.190µ ± 1%   1.178µ ± 1%  -1.01% (p=0.006)
FasterKeccak/4K-32            4.614µ ± 2%   4.523µ ± 1%  -1.97% (p=0.002)
FasterKeccak/500K-32          561.9µ ± 1%   551.0µ ± 2%  -1.93% (p=0.009)

Benchmark results (arm64)

goos: darwin
goarch: arm64
cpu: Apple M2 Max

                           │    master     │            this PR               │
                           │    sec/op     │   sec/op     vs base             │
FasterKeccak/32B-12           144.8n ±  8%   145.9n ±  1%       ~ (p=0.970)
FasterKeccak/128B-12          154.0n ±  3%   151.8n ±  1%       ~ (p=0.623)
FasterKeccak/256B-12          299.3n ± 11%   290.0n ±  1%       ~ (p=0.509)
FasterKeccak/1K-12            1.138µ ±  2%   1.127µ ±  5%       ~ (p=0.331)
FasterKeccak/4K-12            4.375µ ±  6%   4.366µ ± 11%       ~ (p=0.699)
FasterKeccak/500K-12          530.7µ ±  2%   528.1µ ±  6%       ~ (p=0.394)
FasterKeccakHasher/4K-12      4.372µ ±  1%   4.331µ ±  3%       ~ (p=0.180)
FasterKeccakHasher/500K-12    530.7µ ±  1%   524.3µ ±  2%       ~ (p=0.180)

Zero allocations maintained across all benchmarks (0 B/op, 0 allocs/op).

Test plan

  • All unit tests pass (go test -v ./...) on both amd64 and arm64
  • Fuzz test passes (go test -fuzz=FuzzSum256 -fuzztime=10s)
  • Benchmarks verified with benchstat (6 runs, 500ms each) on both platforms

🤖 Generated with Claude Code

Combine the XOR of rate bytes into state with the Keccak-f[1600]
permutation in a single assembly call, eliminating one full memory
pass over the 200-byte state. Remove the Go-level xorIn helper and
its unsafe usage.

Also: reuse s.buf directly in padAndSqueeze (avoids lastBlock stack
alloc), remove redundant BenchmarkSum256_500K, simplify benchName,
and clarify D-value register comments in the generator.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the accelerated Keccak-256 implementation to reduce overhead on amd64 BMI2 by allowing the assembly permutation to optionally XOR a full rate block into the state before permuting, and performs a few related cleanups in padding and benchmarks.

Changes:

  • Extend keccakF1600BMI2 (amd64/BMI2) to accept an optional buf *byte and XOR 136 bytes into the state before running the 24 rounds when provided.
  • Refactor final-block padding to avoid unsafe-based XOR helpers and reuse the sponge buffer in padAndSqueeze.
  • Remove a redundant benchmark and simplify benchmark naming logic; update generator comments accordingly.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
keccakf_amd64_bmi2.s Adds optional pre-round XOR of 17 lanes (rate) from buf into the state before permutation.
keccak_amd64.go Updates the BMI2 function signature and routes keccakF1600 / xorAndPermute through the new assembly entrypoint.
keccak_asm.go Removes unsafe xor helper; updates final padding paths to use xorAndPermute and in-place buffer padding for squeezing.
gen_keccakf_bmi2.go Updates the generator to emit the new BMI2 entrypoint signature and the optional XOR prelude; clarifies comments.
keccak_test.go Removes a duplicate benchmark and simplifies benchName.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread keccak_asm.go
yperbasis and others added 4 commits April 7, 2026 15:52
After Read is called, padAndSqueeze mutates the sponge state (XORs
padding, permutes). A subsequent Sum256 would re-apply padding and
permute a copy of this already-finalized state, producing an incorrect
digest. Panic to make this misuse visible.

Note: this differs from x/crypto/sha3, where Sum() works after Read()
by cloning the full sponge state before finalizing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-add xorIn using binary.LittleEndian.Uint64 (no unsafe) for the
padding/finalization path (Sum256, padAndSqueeze, sum256Sponge).
Keep fused xorAndPermute assembly only for full blocks in Write.

This eliminates the ~2% ARM regression from the previous approach
(clear + full-rate xorAndPermute for partial blocks) while aligning
the ARM and amd64 code paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
keccakF1600Sha3 now takes an optional buf *byte parameter, matching the
amd64 keccakF1600BMI2 pattern. When buf != nil, XORs rate bytes into
state before permuting; when nil, just permutes. The round loop and
store are shared between both paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread keccak_asm.go
@AskAlexSharov AskAlexSharov merged commit 08e7b66 into master Apr 8, 2026
4 checks passed
@AskAlexSharov AskAlexSharov deleted the yperbasis/minor-improvements branch April 8, 2026 01:07
github-merge-queue Bot pushed a commit to erigontech/erigon that referenced this pull request Apr 10, 2026
Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`)
to pick up erigontech/fastkeccak#8
github-merge-queue Bot pushed a commit to erigontech/erigon that referenced this pull request Apr 10, 2026
Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`)
to pick up erigontech/fastkeccak#8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants