Skip to content

txnprovider/txpool: drop and kick peers on blob KZG verify failure#21421

Merged
yperbasis merged 9 commits into
mainfrom
yperbasis/blob-kzg-batch-shortcircuit
May 27, 2026
Merged

txnprovider/txpool: drop and kick peers on blob KZG verify failure#21421
yperbasis merged 9 commits into
mainfrom
yperbasis/blob-kzg-batch-shortcircuit

Conversation

@yperbasis

@yperbasis yperbasis commented May 26, 2026

Copy link
Copy Markdown
Member

Summary

Fixes GHSA-f4rj-p5qm-7m36. Two pieces, one commit each:

1. Within-batch break in validateTxns — once one remote (IsLocal=false) blob txn fails KZG verify, the per-tx loop stops; trailing slots are excluded from goodTxns via a checkedCount cap. Caps per-batch attacker CPU at one verify regardless of N. Local txns keep per-tx semantics, so an eth_sendRawTransaction batch with one bad tx still validates the rest.

2. Peer-kick in processRemoteTxnsAddRemoteTxnsFromPeer(ctx, slots, peerID, sentry) retains the source peer per slot in unprocessedRemotePeers. After validateTxns, kickKZGOffenders calls PenalizePeer(Kick) on each UnmatchedBlobTxExt slot's peer (deduped, at most once per pass). Mirrors the across-batch half of geth fdfd1235ac8ed9e1ed95a8db3e7ad0c54f4eb45f. Together with (1) this eliminates both the per-batch amplification and the cross-batch flood.

Scope kept to KZG verify failure (UnmatchedBlobTxExt) only — fetch.go already kicks for the cheap protocol violations (BlobHashCheckFail, UnequalBlobTxExt) via checkBlobSidecar, and the remaining validateTxns reasons (NonceTooLow, InsufficientFunds, UnderPriced, fork-skew, …) aren't protocol violations worth kicking for.

Test plan

  • TestValidateTxnsBlobKZGShortCircuit — drives validateTxns directly with a 2-txn batch where the first txn's blob byte is mutated post-proof (mirrors advisory PoC: cheap commit→versioned-hash check passes, KZG pairing fails). Remote subtest: goodTxns empty. Local subtest: valid txn[1] reaches goodTxns.
  • TestProcessRemoteTxnsKicksKZGOffender — end-to-end via AddRemoteTxnsFromPeer + processRemoteTxns against a MockSentryServer; asserts PenalizePeer(Kick) called once with the attacker's PeerId.
  • Both tests confirmed to fail when the respective fix is stashed (within-batch break removed → goodTxns non-empty; kickKZGOffenders call removed → gomock reports missing PenalizePeer call).
  • Existing tests still green: TestPenalizePeerForMalformedMessages, TestDropRemoteAtNoGossip, TestBlobSlots, TestFetch.
  • make lint clean.

🤖 Generated with Claude Code

…G verify failure

GHSA-f4rj-p5qm-7m36.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis requested a review from taratorio as a code owner May 26, 2026 12:04
@yperbasis yperbasis added this to the 3.5.0 milestone May 26, 2026
@yperbasis yperbasis requested a review from Copilot May 26, 2026 12:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Mitigates a CPU DoS vector (GHSA-f4rj-p5qm-7m36) by short-circuiting remote batch validation after the first blob KZG verification failure, while preserving per-tx validation semantics for local submissions.

Changes:

  • Short-circuit validateTxns for remote batches after the first UnmatchedBlobTxExt.
  • Ensure goodTxns excludes trailing (unchecked) remote slots after short-circuit.
  • Add a regression test covering remote short-circuiting vs local per-tx behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
txnprovider/txpool/pool.go Adds remote-batch short-circuit logic and limits the “goodTxns” collection loop to checked transactions.
txnprovider/txpool/pool_blob_kzg_shortcircuit_test.go Adds a focused test to assert remote short-circuit behavior and local per-tx semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go
@yperbasis yperbasis marked this pull request as draft May 26, 2026 12:11
yperbasis and others added 2 commits May 26, 2026 14:27
Plumb (peerID, sentry) through AddRemoteTxnsFromPeer so processRemoteTxns
can call PenalizePeer(Kick) on each UnmatchedBlobTxExt slot, matching the
across-batch half of geth fdfd1235ac8ed9e1ed95a8db3e7ad0c54f4eb45f.

GHSA-f4rj-p5qm-7m36.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clarify that trailing reasons[i]=NotSet after a remote-batch UnmatchedBlobTxExt
short-circuit means "not validated" rather than "accepted". Caller must also
consult goodTxns to disambiguate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis changed the title txnprovider/txpool: short-circuit remote batch validation on first KZG verify failure txnprovider/txpool: drop and kick peers on blob KZG verify failure May 26, 2026
@yperbasis yperbasis requested a review from Copilot May 26, 2026 12:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

Files not reviewed (1)
  • txnprovider/txpool/pool_mock.go: Language not supported

Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool.go Outdated
processRemoteTxns runs under p.lock; calling sentry.PenalizePeer inline would
block every other pool operation on sentry I/O. Collect the deduped offenders
while still under the lock (using a [64]byte map key to avoid string allocation
per peer) and fire PenalizePeer from a goroutine — the kick is best-effort and
no caller depends on its completion. Test syncs via a channel signaled from the
mock's DoAndReturn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as ready for review May 26, 2026 14:01
@yperbasis yperbasis requested a review from anacrolix May 26, 2026 14:01
info@weblogix.biz and others added 2 commits May 26, 2026 14:07
Fold AddRemoteTxnsFromPeer back into AddRemoteTxns now that fetch.go is the
only caller — no need for the two-name split. Drop the [64]byte-keyed dedup
map in kickKZGOffenders: the validateTxns short-circuit guarantees at most
one UnmatchedBlobTxExt per pass, so the dedup was dead code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • txnprovider/txpool/pool_mock.go: Language not supported

Comment thread txnprovider/txpool/pool.go
Comment thread txnprovider/txpool/pool_mock.go Outdated
The previous regen invoked mockgen with the full import path, leaving the
header out of sync with pool.go's //go:generate directive (which uses `. Pool`).
Re-regenerate via `go generate` so the header matches the directive and future
regenerations are stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis enabled auto-merge May 26, 2026 14:55
@yperbasis yperbasis added this pull request to the merge queue May 27, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026
@taratorio taratorio added this pull request to the merge queue May 27, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026
@yperbasis yperbasis added this pull request to the merge queue May 27, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026
@yperbasis yperbasis added this pull request to the merge queue May 27, 2026
@yperbasis yperbasis removed this pull request from the merge queue due to a manual request May 27, 2026
@yperbasis yperbasis added this pull request to the merge queue May 27, 2026
Merged via the queue into main with commit e4deed1 May 27, 2026
89 of 90 checks passed
@yperbasis yperbasis deleted the yperbasis/blob-kzg-batch-shortcircuit branch May 27, 2026 14:47
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 28, 2026
…phemeral merge-queue ref (erigontech#21447)

## Problem

In the merge queue, hive / hive-eest jobs intermittently fail with **all
clients failed to build**:

```
Cloning: erigontech/erigon - gh-readonly-queue/main/pr-XXXXX-<sha>
fatal: Remote branch gh-readonly-queue/main/pr-XXXXX-<sha> not found in upstream origin
→ image build failed (128) → "too few tests" → ci-gate failure
```

`test-hive.yml` / `test-hive-eest.yml` build the hive erigon client via
`clients/erigon/Dockerfile.git`, which runs `git clone --depth 1
--branch $tag https://github.com/$github`. For `merge_group` events
`GITHUB_HEAD_REF` is empty, so `$tag` falls back to
`${GITHUB_REF#refs/heads/}` = the **ephemeral** `gh-readonly-queue/...`
ref. When the queue re-forms the group or a runner is slow to start,
GitHub deletes that ref before the in-container clone runs → the clone
fails. This evicts PRs from the merge queue (e.g. erigontech#21421), independent
of the failing PR's code.

`actions/checkout` doesn't hit this — GitHub reliably provides the
`merge_group` commit to the run; only the *independent* `git clone
--branch <ephemeral-ref>` inside hive's builder races the ref's
deletion.

## Fix

Build the erigon image from the **checked-out commit** and wrap it with
hive's default *prebuilt-image* `Dockerfile` (`FROM $baseimage:$tag`),
in both hive workflows:

- check out the full erigon source into a fresh **`erigon-full`** path.
The old sparse `erigon-src` path leaves a stale
`.git/info/sparse-checkout` on the **reused** self-hosted hive runners;
a full checkout into that same path doesn't clear it, so the tree stays
sparse, `cmd/erigon` is missing, and the build fails (`cd ./cmd/erigon:
No such file or directory`). A fresh path no sparse run has touched
avoids that.
- plain **`docker build -t hive/erigon:cilocal`** on the host daemon
(persistent layer cache on the reused runners) — *not* a shared
`type=gha` cache, so no concurrent-writer 504s and no cross-workflow
scope contention (the failure mode `test-kurtosis-assertoor.yml`
centralizes its build to avoid).
- point hive's default client Dockerfile at that local image
(`baseimage=hive/erigon`, `tag=cilocal`) instead of `mv`-ing in
`Dockerfile.git`.
- drop the now-unused `SOURCE_REPO` / `branch_name` / clone-arg /
builder-Go-version plumbing.

`-docker.pull` defaults to false (CI doesn't pass it), so `FROM
hive/erigon:cilocal` resolves from the local daemon, never force-pulled.
erigon's image is `debian:13-slim` with `erigon` on PATH, so hive's
default Dockerfile (`apt-get`, `erigon --version`) layers on top cleanly
— on both `ethereum/hive` and the `erigontech/hive` fork used by
hive-eest.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants