execution/engineapi: replay cached validation error on bad-block short-circuit#21362
Conversation
…t-circuit Previously, when engine_newPayload was called twice for the same bad block, the second call short-circuited via the badHeaders LRU with a generic "previously known bad block" string in ValidationError — discarding the specific reason (e.g. "max initcode size exceeded") that the first call would have returned. This was harmless for production until the hive warm-daemon client pool (PR #20812) started reusing erigon processes across EEST tests. The pool calls debug_setHead(0) between tests but does not clear in-memory caches, so a later test that happens to send a payload with a block-hash also produced by an earlier test gets the cached short-circuit instead of fresh validation — and EEST's ErigonExceptionMapper has no rule for the generic string, so the test fails with the wrong exception even though the block was correctly rejected on first sight. Cache the original validationErr string alongside the lastValidHash and replay it on hit. Fall back to "previously known bad block" when the cache entry was produced by a path that didn't surface a validation string (sync-time downloader, "invalid block number" header check). For parent-inheritance hits, wrap as "ancestor 0x… rejected: <parent err>" so the cause is still traceable when the child hash hadn't been seen before. Beyond the EEST flake, this is also a better Engine API experience for production: a CL that resubmits a bad payload (retry-after-glitch, debugging session) now sees the same actionable error twice instead of a generic short-circuit on retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves Engine API engine_newPayload retry diagnostics by preserving and replaying the original cached bad-block validation error (instead of returning the generic "previously known bad block"), including an "ancestor … rejected: …" wrapper when the invalidity is inherited from a rejected parent.
Changes:
- Extend the bad-header LRU cache to store both
LatestValidHashand the originalvalidationErrstring. - Update
getQuickPayloadStatusIfPossibleto replay cachedvalidationErr(and wrap parent-inheritance causes) when short-circuiting. - Plumb
validationErrthroughReportBadHeadercall sites (EngineServer + downloader validation paths).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| execution/engineapi/engine_server.go | Replay cached validation errors on bad-header short-circuits; wrap parent-inherited causes; pass validation strings into bad-header reporting. |
| execution/engineapi/engine_block_downloader/block_downloader.go | Store (lastValidAncestor, validationErr) in the bad-headers LRU; plumb validation error from ValidateChain into the cache. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d97bf58 to
820b5b3
Compare
|
Moved the unrelated The force-push rolled HEAD back from |
- Drop the BadHeaderEntry doc comment (the field names are self-descriptive and the prior wording incorrectly listed the "invalid block number" header check as an empty-string path — this PR populates it). Per #21355. - Shrink the badHeaders LRU from 10_000 to 96. Each entry now holds a heap-allocated string that won't be GC'd while cached, and the realistic working set on a healthy node is tiny. - Add unit tests pinning the ReportBadHeader/IsBadHeader round-trip for the validation-error string, including the empty-err fallback and the later-report-overrides path. - Drop the scenario-narrative comment block above the IsBadHeader short-circuit per #21355. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cache-validation-err
PR #21362 fixes the EIP-7954 test_max_initcode_size[over_max] flake (Mode A of #21364), so the budget no longer needs to cover it. Mode B (test_bal_invalid_extraneous_entries) can still fire intermittently until further investigation, so one flake slot remains on top of the two known wrong-test-expectation baseline failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a small generic helper
func Deref[T any](p *T) T
and adopts it in six spots that all hand-rolled the same shape — an
explicit nil check followed by *p, with the zero value of T as the
implicit fallback. Five different pointer element types are
consolidated (*string, *[]byte, *uint64, *log.Logger, plus the
package-private derefString that this PR introduced earlier).
Also trims the glamsterdam-devnet workflow comment: now that the
EIP-7954 test_max_initcode_size flake is fixed by this PR, the
budget breakdown no longer needs to enumerate the "Mode A / Mode B"
framing — the one remaining flake slot is for #21364.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ard download when gap exceeds limit (#21502) Cherry-pick of #21489 to `release/3.4`. ## r3.4-specific adaptations - `execution/engineapi/engine_block_downloader/block_downloader_test.go` did not exist on `release/3.4` (file was introduced by #21362, not backported). The cherry-pick keeps only the new `TestNewPayloadGapExceedsLimit` test added by #21489; the unrelated `TestBadHeader_*` tests from #21362 are not pulled in. Co-authored-by: Andrew Ashikhmin <34320705+yperbasis@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves #21363. Likely also resolves Mode A (the
"previously known bad block"cache short-circuit) of #21364 — leaving that issue open until verified in-tree across all its parametrizations.Summary
engine_newPayloadshort-circuits via thebadHeadersLRU on a previously-rejected block hash, replay the originalvalidationErrstring (e.g."max initcode size exceeded") instead of the generic"previously known bad block"."ancestor 0x… rejected: <parent err>"so the cause is still traceable when the child hash hadn't been seen on its own before."previously known bad block"string when the cache entry has no recorded error (sync-time downloader path,"invalid block number"header check before first cache-populate).badHeadersLRU capacity from 10_000 → 96. Each entry now holds a heap-allocated error string (not GC'd while cached); the realistic working set for newPayload bad blocks is in the single digits per session, so 96 leaves substantial headroom while keeping memory well-bounded.Motivation
Harmless in production until the hive warm-daemon client pool (#20812) started reusing erigon processes across EEST tests. The pool calls
debug_setHead(0)between tests but does not clear in-memory caches (the pool doc-comment ininternal/libhive/pool.gocalls this out). A later EEST test that happens to submit a payload with a block-hash also produced by an earlier test on the same warm daemon hits the short-circuit and gets the generic string invalidationError— EEST'sErigonExceptionMapperhas no rule for it, so the test fails with the wrong exception even though the block was correctly rejected on first sight.Concretely:
hive-eest / test-hive-eest (glamsterdam-devnet)intermittently failed with 3-4 failures over themax-failures: 2budget. Two of those failures are deterministic (the documented wrong-EEST-expectationtest_fork_transitionpair). The flaky 1-2 extras rotated across runs between variants oftest_max_initcode_size[over_max]andtest_bal_invalid_extraneous_entries[*]— exactly the pattern you'd expect from an LRU-pool-assignment race on top of a process-lifetime cache. See the project memory writeup for the detailed trace.This also makes Engine API replies more informative for legitimate retries — a CL resubmitting after a config change / debugger session / network glitch now gets the same actionable error twice instead of a generic short-circuit on retry.
Test plan
make lintclean (verified locally, 3 passes)make erigon integrationclean (verified locally)go test ./execution/engineapi/... -shortclean (verified locally — includes new unit tests pinning Report/IsBadHeader round-trip inblock_downloader_test.go)hive-eest / test-hive-eest (glamsterdam-devnet)passes on the merge queue without the flakypreviously known bad blockfailureshive-eestshards🤖 Generated with Claude Code