Skip to content

execution/engineapi: replay cached validation error on bad-block short-circuit#21362

Merged
yperbasis merged 5 commits into
mainfrom
yperbasis/badheaders-cache-validation-err
May 22, 2026
Merged

execution/engineapi: replay cached validation error on bad-block short-circuit#21362
yperbasis merged 5 commits into
mainfrom
yperbasis/badheaders-cache-validation-err

Conversation

@yperbasis

@yperbasis yperbasis commented May 22, 2026

Copy link
Copy Markdown
Member

Resolves #21363. Likely also resolves Mode A (the "previously known bad block" cache short-circuit) of #21364 — leaving that issue open until verified in-tree across all its parametrizations.

Summary

  • When engine_newPayload short-circuits via the badHeaders LRU on a previously-rejected block hash, replay the original validationErr string (e.g. "max initcode size exceeded") instead of the generic "previously known bad block".
  • For parent-inheritance hits, wrap as "ancestor 0x… rejected: <parent err>" so the cause is still traceable when the child hash hadn't been seen on its own before.
  • Falls back to the current "previously known bad block" string when the cache entry has no recorded error (sync-time downloader path, "invalid block number" header check before first cache-populate).
  • Shrink the badHeaders LRU capacity from 10_000 → 96. Each entry now holds a heap-allocated error string (not GC'd while cached); the realistic working set for newPayload bad blocks is in the single digits per session, so 96 leaves substantial headroom while keeping memory well-bounded.

Motivation

Harmless in production until the hive warm-daemon client pool (#20812) started reusing erigon processes across EEST tests. The pool calls debug_setHead(0) between tests but does not clear in-memory caches (the pool doc-comment in internal/libhive/pool.go calls this out). A later EEST test that happens to submit a payload with a block-hash also produced by an earlier test on the same warm daemon hits the short-circuit and gets the generic string in validationError — EEST's ErigonExceptionMapper has no rule for it, so the test fails with the wrong exception even though the block was correctly rejected on first sight.

Concretely: hive-eest / test-hive-eest (glamsterdam-devnet) intermittently failed with 3-4 failures over the max-failures: 2 budget. Two of those failures are deterministic (the documented wrong-EEST-expectation test_fork_transition pair). The flaky 1-2 extras rotated across runs between variants of test_max_initcode_size[over_max] and test_bal_invalid_extraneous_entries[*] — exactly the pattern you'd expect from an LRU-pool-assignment race on top of a process-lifetime cache. See the project memory writeup for the detailed trace.

This also makes Engine API replies more informative for legitimate retries — a CL resubmitting after a config change / debugger session / network glitch now gets the same actionable error twice instead of a generic short-circuit on retry.

Test plan

  • make lint clean (verified locally, 3 passes)
  • make erigon integration clean (verified locally)
  • go test ./execution/engineapi/... -short clean (verified locally — includes new unit tests pinning Report/IsBadHeader round-trip in block_downloader_test.go)
  • hive-eest / test-hive-eest (glamsterdam-devnet) passes on the merge queue without the flaky previously known bad block failures
  • No regression on other hive-eest shards

🤖 Generated with Claude Code

…t-circuit

Previously, when engine_newPayload was called twice for the same bad
block, the second call short-circuited via the badHeaders LRU with a
generic "previously known bad block" string in ValidationError —
discarding the specific reason (e.g. "max initcode size exceeded") that
the first call would have returned.

This was harmless for production until the hive warm-daemon client pool
(PR #20812) started reusing erigon processes across EEST tests. The pool
calls debug_setHead(0) between tests but does not clear in-memory
caches, so a later test that happens to send a payload with a
block-hash also produced by an earlier test gets the cached
short-circuit instead of fresh validation — and EEST's
ErigonExceptionMapper has no rule for the generic string, so the test
fails with the wrong exception even though the block was correctly
rejected on first sight.

Cache the original validationErr string alongside the lastValidHash and
replay it on hit. Fall back to "previously known bad block" when the
cache entry was produced by a path that didn't surface a validation
string (sync-time downloader, "invalid block number" header check). For
parent-inheritance hits, wrap as "ancestor 0x… rejected: <parent err>"
so the cause is still traceable when the child hash hadn't been seen
before.

Beyond the EEST flake, this is also a better Engine API experience for
production: a CL that resubmits a bad payload (retry-after-glitch,
debugging session) now sees the same actionable error twice instead of
a generic short-circuit on retry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Engine API engine_newPayload retry diagnostics by preserving and replaying the original cached bad-block validation error (instead of returning the generic "previously known bad block"), including an "ancestor … rejected: …" wrapper when the invalidity is inherited from a rejected parent.

Changes:

  • Extend the bad-header LRU cache to store both LatestValidHash and the original validationErr string.
  • Update getQuickPayloadStatusIfPossible to replay cached validationErr (and wrap parent-inheritance causes) when short-circuiting.
  • Plumb validationErr through ReportBadHeader call sites (EngineServer + downloader validation paths).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
execution/engineapi/engine_server.go Replay cached validation errors on bad-header short-circuits; wrap parent-inherited causes; pass validation strings into bad-header reporting.
execution/engineapi/engine_block_downloader/block_downloader.go Store (lastValidAncestor, validationErr) in the bad-headers LRU; plumb validation error from ValidateChain into the cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/engineapi/engine_block_downloader/block_downloader.go Outdated
Comment thread execution/engineapi/engine_server.go
Comment thread execution/engineapi/engine_block_downloader/block_downloader.go Outdated
@yperbasis yperbasis force-pushed the yperbasis/badheaders-cache-validation-err branch from d97bf58 to 820b5b3 Compare May 22, 2026 08:47
@yperbasis

Copy link
Copy Markdown
Member Author

Moved the unrelated branch-naming.md doc commit out into #21365. PR #21362 is now scope-pure: just the engineapi change.

The force-push rolled HEAD back from d97bf581d7 to 820b5b3395 — the latter is the original engineapi-only commit, unchanged from what was already in this PR. @mh0lt's approval still applies (strict subset of what was approved).

- Drop the BadHeaderEntry doc comment (the field names are
  self-descriptive and the prior wording incorrectly listed the
  "invalid block number" header check as an empty-string path —
  this PR populates it). Per #21355.
- Shrink the badHeaders LRU from 10_000 to 96. Each entry now holds a
  heap-allocated string that won't be GC'd while cached, and the
  realistic working set on a healthy node is tiny.
- Add unit tests pinning the ReportBadHeader/IsBadHeader round-trip
  for the validation-error string, including the empty-err fallback
  and the later-report-overrides path.
- Drop the scenario-narrative comment block above the IsBadHeader
  short-circuit per #21355.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread execution/engineapi/engine_block_downloader/block_downloader.go
Comment thread execution/engineapi/engine_server.go
yperbasis and others added 2 commits May 22, 2026 12:17
PR #21362 fixes the EIP-7954 test_max_initcode_size[over_max] flake
(Mode A of #21364), so the budget no longer needs to cover it. Mode B
(test_bal_invalid_extraneous_entries) can still fire intermittently
until further investigation, so one flake slot remains on top of the
two known wrong-test-expectation baseline failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread execution/engineapi/engine_server.go Outdated
Comment thread .github/workflows/test-hive-eest.yml Outdated
Adds a small generic helper

    func Deref[T any](p *T) T

and adopts it in six spots that all hand-rolled the same shape — an
explicit nil check followed by *p, with the zero value of T as the
implicit fallback. Five different pointer element types are
consolidated (*string, *[]byte, *uint64, *log.Logger, plus the
package-private derefString that this PR introduced earlier).

Also trims the glamsterdam-devnet workflow comment: now that the
EIP-7954 test_max_initcode_size flake is fixed by this PR, the
budget breakdown no longer needs to enumerate the "Mode A / Mode B"
framing — the one remaining flake slot is for #21364.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis enabled auto-merge May 22, 2026 10:47
@yperbasis yperbasis added this pull request to the merge queue May 22, 2026
Merged via the queue into main with commit 7ec886b May 22, 2026
74 checks passed
@yperbasis yperbasis deleted the yperbasis/badheaders-cache-validation-err branch May 22, 2026 13:55
taratorio added a commit that referenced this pull request May 29, 2026
…ard download when gap exceeds limit (#21502)

Cherry-pick of #21489 to `release/3.4`.

## r3.4-specific adaptations

- `execution/engineapi/engine_block_downloader/block_downloader_test.go`
did not exist on `release/3.4` (file was introduced by #21362, not
backported). The cherry-pick keeps only the new
`TestNewPayloadGapExceedsLimit` test added by #21489; the unrelated
`TestBadHeader_*` tests from #21362 are not pulled in.

Co-authored-by: Andrew Ashikhmin <34320705+yperbasis@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hive eest: flaky EIP-7954 test_max_initcode_size on glamsterdam-devnet shard

4 participants