Skip to content

ci: enable hive-eest client pool (~2.7× faster gate)#20812

Merged
anacrolix merged 9 commits into
mainfrom
yperbasis/hive-pool-experiment
Apr 27, 2026
Merged

ci: enable hive-eest client pool (~2.7× faster gate)#20812
anacrolix merged 9 commits into
mainfrom
yperbasis/hive-pool-experiment

Conversation

@yperbasis

@yperbasis yperbasis commented Apr 25, 2026

Copy link
Copy Markdown
Member

Summary

Enables the hive warm-daemon client pool on the test-hive-eest.yml workflow.

The pool keeps client containers running across tests and resets chain state via JSON-RPC debug_setHead(0) between them, instead of creating a fresh container per test. On corpora with high pre-state reuse this skips ~1.9 s of per-test overhead (docker create + tar upload + erigon init + daemon boot).

Effect on the hive-eest merge-queue gate:

Shard pool.size Baseline wall New wall Δ
cancun 4 52.2 min 18.0 min −65.5%
prague 4 61.0 min 22.4 min −63.4%
osaka 4 63.0 min 23.6 min −62.6%
glamsterdam-devnet 24 8.5 min 6.5 min −22.9%
paris+shanghai 0 13.5 min 12.3 min ~baseline
consume-rlp 0 32.3 min 31.1 min ~baseline

Pacemaker: 63 min → 24 min (the gate is now 2.7× faster on every PR / merge group). Baseline run #24827222279; current run #24941388548.

Per-shard pool.size

The pool only helps where tests share pre-state. From a trace-instrumented run + offline LRU simulation, the EEST access pattern is depth-2 consecutive: each pre-state is hit twice in a row, then never seen again. So pool.size = 4 captures all available reuse on the long consume-engine shards; bigger caches buy nothing. The other extreme — paris+shanghai and consume-rlp — has ~unique pre-state per test (~0% hit at any cache size) and gets pool.size = 0, which short-circuits every new code path in hive (byte-identical to the pre-pool flow).

Shard Reuse pool.size
cancun, prague, osaka ~2× 4
glamsterdam-devnet curve still climbs at 24 24
paris+shanghai, consume-rlp ~1× 0 (pool disabled)

Files

  • .github/workflows/test-hive-eest.yml:
    • Pinned hive checkout to erigontech/hive @ 06b9774991053bf4952c98750c53fc52bceb3991 — see "Pre-merge" below.
    • Added pool-size per matrix entry.
    • Added --client.pool.size=${{ matrix.pool-size }} to the hive invocation.

No other files touched.

Pre-merge

The hive-side change is up as a draft at ethereum/hive#1449. This Erigon PR currently pins the hive checkout at the corresponding erigontech/hive branch, which is fine for the trial dispatches but not for merge.

Before this lands:

  1. libhive: optional warm-daemon client pool ethereum/hive#1449 merges.
  2. hive-versions.json gets a routine bump to the new ethereum/hive ref.
  3. The repository: / ref: override in test-hive-eest.yml is reverted to the existing ethereum/hive + steps.hive-version.outputs.ref pattern.
  4. The pool-size matrix var and --client.pool.size flag stay — they're the actual change this PR ships.

I'll squash the experiment commits down to that final shape once #1449 is in.

Test plan

  • All 6 shards green on the bare-metal hive runner group across multiple dispatches.
  • Per-shard pool.size derived from a trace-instrumented run and an offline LRU simulator (the curve plateaus past 4 on long shards).
  • Low-reuse shards (paris+shanghai, consume-rlp) verified at pool.size=0 — wall time matches the no-pool baseline (no regression from the pool's per-test reset overhead).
  • No container-count / disk-pressure issues on the runner (global LRU cap bounds the running daemon count regardless of test corpus).

yperbasis and others added 7 commits April 25, 2026 10:58
…ze=4

Wires the test-hive-eest workflow to a hive prototype that retains
stopped client containers in a pool keyed by (image, sanitized HIVE_*
env, genesis bytes), restarting them across tests instead of
creating-from-scratch every time.

Hive ref: erigontech/hive @ c46813fa410308b4defe849c0518404a6c77505b
Branch:   yperbasis/genesis-pool

The simplified prototype on this branch saves only docker-create + tar-
upload + container-start per pool hit — about 250-400 ms in CI. A
follow-up snapshot-based path could also skip `erigon init` itself but
needs Linux substrate to validate (the cp -a was prohibitively slow on
macOS Docker Desktop during local benchmarking).

Pool size of 4 is intentionally conservative for the first run. With
~4500 unique pre-states across osaka and 21580 tests, expected idle
container count peaks around the same order; we want to confirm it
doesn't blow up disk or the docker daemon before tuning further.

This commit must be reverted before merge (or the workflow merged
unmodified after retargeting at upstream ethereum/hive once the pool
lands there). The PR is purely for CI dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds back the post-init snapshot save / restore on pool reuse to skip
`erigon init` itself on hits. The Mac substrate constraint that
forced this off doesn't apply on bare-metal Linux runners.

Hive ref: erigontech/hive @ bc2cfecf24066898d0308f967d1d6d66a2dafc41

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches the pool from snapshot-restore to warm-daemon: client containers
stay running across tests, hive sends JSON-RPC `debug_setHead(0)` to
revert chain state to genesis between tests. Per-hit cost drops from
~700-1500 ms (full restart) to ~10-20 ms (one HTTP RPC).

Hive ref: erigontech/hive @ c916d4357edcfdb2b23900c2b04157b6b924a84f

Validated locally with TestSetHeadEnablesDaemonReuseAcrossTests in
execution/execmodule (added but not committed) — debug_setHead(0)
correctly rewinds a 1-block chain to genesis in ~285 µs and a different
block 1 inserts cleanly afterward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hive ref: erigontech/hive @ 08ea28aa57c5a04afa0bc76afd0d5bbef6640607

Pool flag is now a *global* idle ceiling rather than per-bucket — addresses
the iter 3 failure where ~3500 unique pre-states on paris+shanghai each
got their own running daemon and exhausted the docker daemon. With LRU
eviction the pool's memory footprint is bounded.

Bumped pool.size from 4 to 24 (= 2× parallelism). At ~150 MB per Erigon
daemon that's ~3.6 GB additional RAM, comfortable on the bare-metal
runners. The LRU now has room to actually keep hot pre-states warm
between tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Bump hive ref to erigontech/hive yperbasis/client-pool @ 684f7a63
  (the upstream-PR-ready version: simplified single-list LRU pool,
  posted as ethereum/hive#1449).
- Add `pool-size` matrix var: 24 on cancun/prague/osaka/glamsterdam-devnet
  where iter 4 measured a clean -32% to -40% wall-time win, 0 on
  paris+shanghai and rlp where the corpus has unique pre-state per test
  and the warm-daemon path was a small net regression (+7%, +11%).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trace-instrumented run 24938677388 + offline LRU simulation showed
the hit-rate curve plateaus at ~50% past size=4 on cancun/osaka/prague
(reuse pattern is depth-2 consecutive, captured by a tiny cache).
Drop those shards from 24 → 4 — same hit rate, ~3 GB less RAM per
runner, simpler.

glamsterdam-devnet's curve keeps climbing through size=512; keep at
24. paris+shanghai and rlp stay at 0 (~0% hit at any size).

Hit rate at size=4 vs size=24 from the trace:
  cancun:  46.9% vs 49.8%  (Δ 2.9 pp, ~rounding error wall-time)
  osaka:   47.5% vs 49.1%
  prague:  39.9% vs 42.3%

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ortID)

Three follow-up commits on top of the iter-6 hive ref:

- 204e90f9: log path on warm acquire, networks in pool key, ResetPort,
  Drain↔Release WaitGroup race, suitability docstring.
- 06b97749: TCP probe on Acquire to reject dead daemons, fold Key into
  PoolEntry (Release takes one arg), shortID() in api.startClient slog
  calls.

The probe is the load-bearing fix: cold path's CheckLive wait was
skipped on warm reuse, so a daemon that died between Release and
Acquire would surface as a confusing RPC timeout. Probe now catches
that pre-test, drops the entry, schedules its container for delete,
and walks to the next candidate.

Validation re-run; same per-shard pool sizes (cancun/prague/osaka=4,
glamsterdam=24, paris+shanghai/rlp=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis changed the title [experiment] hive client pool keyed by genesis (pool.size=4) ci: enable hive-eest client pool (~2.7× faster gate) Apr 25, 2026
@yperbasis yperbasis requested a review from Copilot April 25, 2026 20:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables Hive’s warm-daemon client pool for the EEST workflow to reduce per-test container startup overhead and speed up the hive-eest CI gate.

Changes:

  • Add per-shard pool-size values to the workflow matrix to tune cache effectiveness by shard.
  • Pass --client.pool.size=${{ matrix.pool-size }} to the hive invocation.
  • Temporarily pin the Hive checkout to erigontech/hive at a specific SHA (experimental override).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/test-hive-eest.yml Outdated
Comment thread .github/workflows/test-hive-eest.yml
204e90f9 collapsed the pool reset port with options.CheckLive. EEST
consume-engine sets HIVE_CHECK_LIVE_PORT=8551 (engine port with JWT)
so iter 7's reset RPC went to the wrong port and got 403 on every
test. Pool effectively size=0; long shards regressed +140-170%.

Hive a3c9afd9 separates the two: ResetPort defaults to 8545 (the
public JSON-RPC port where debug_setHead lives), overridable via
new HIVE_RESET_PORT env var if any client puts debug elsewhere.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as ready for review April 26, 2026 07:21

@Giulio2002 Giulio2002 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small CI-only workflow change: tunes hive-eest client pool sizes and pins the temporary hive fork/commit with clear revert notes.

@anacrolix anacrolix added this pull request to the merge queue Apr 27, 2026
Merged via the queue into main with commit 8460422 Apr 27, 2026
74 checks passed
@anacrolix anacrolix deleted the yperbasis/hive-pool-experiment branch April 27, 2026 04:01
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Apr 27, 2026
## Summary

Enables the hive *warm-daemon client pool* on the `test-hive-eest.yml`
workflow.

The pool keeps client containers running across tests and resets chain
state via JSON-RPC `debug_setHead(0)` between them, instead of creating
a fresh container per test. On corpora with high pre-state reuse this
skips ~1.9 s of per-test overhead (`docker create` + tar upload +
`erigon init` + daemon boot).

**Effect on the hive-eest merge-queue gate:**

| Shard | `pool.size` | Baseline wall | New wall | Δ |
|---|---|---|---|---|
| cancun | 4 | 52.2 min | **18.0 min** | **−65.5%** |
| prague | 4 | 61.0 min | **22.4 min** | **−63.4%** |
| osaka | 4 | 63.0 min | **23.6 min** | **−62.6%** |
| glamsterdam-devnet | 24 | 8.5 min | 6.5 min | −22.9% |
| paris+shanghai | 0 | 13.5 min | 12.3 min | ~baseline |
| consume-rlp | 0 | 32.3 min | 31.1 min | ~baseline |

Pacemaker: **63 min → 24 min** (the gate is now 2.7× faster on every PR
/ merge group). Baseline run
[#24827222279](https://github.com/erigontech/erigon/actions/runs/24827222279);
current run
[#24941388548](https://github.com/erigontech/erigon/actions/runs/24941388548).

## Per-shard `pool.size`

The pool only helps where tests share pre-state. From a
trace-instrumented run + offline LRU simulation, the EEST access pattern
is *depth-2 consecutive*: each pre-state is hit twice in a row, then
never seen again. So `pool.size = 4` captures all available reuse on the
long consume-engine shards; bigger caches buy nothing. The other extreme
— paris+shanghai and consume-rlp — has ~unique pre-state per test (~0%
hit at any cache size) and gets `pool.size = 0`, which short-circuits
every new code path in hive (byte-identical to the pre-pool flow).

| Shard | Reuse | `pool.size` |
|---|---|---|
| cancun, prague, osaka | ~2× | 4 |
| glamsterdam-devnet | curve still climbs at 24 | 24 |
| paris+shanghai, consume-rlp | ~1× | 0 (pool disabled) |

## Files

- `.github/workflows/test-hive-eest.yml`:
- Pinned hive checkout to `erigontech/hive @
06b9774991053bf4952c98750c53fc52bceb3991` — see "Pre-merge" below.
  - Added `pool-size` per matrix entry.
- Added `--client.pool.size=${{ matrix.pool-size }}` to the hive
invocation.

No other files touched.

## Pre-merge

The hive-side change is up as a draft at
[ethereum/hive#1449](ethereum/hive#1449). This
Erigon PR currently pins the hive checkout at the corresponding
`erigontech/hive` branch, which is fine for the trial dispatches but
**not for merge**.

Before this lands:
1. ethereum/hive#1449 merges.
2. `hive-versions.json` gets a routine bump to the new ethereum/hive
ref.
3. The `repository:` / `ref:` override in `test-hive-eest.yml` is
reverted to the existing `ethereum/hive` +
`steps.hive-version.outputs.ref` pattern.
4. The `pool-size` matrix var and `--client.pool.size` flag stay —
they're the actual change this PR ships.

I'll squash the experiment commits down to that final shape once erigontech#1449
is in.

## Test plan

- [x] All 6 shards green on the bare-metal `hive` runner group across
multiple dispatches.
- [x] Per-shard `pool.size` derived from a trace-instrumented run and an
offline LRU simulator (the curve plateaus past 4 on long shards).
- [x] Low-reuse shards (paris+shanghai, consume-rlp) verified at
`pool.size=0` — wall time matches the no-pool baseline (no regression
from the pool's per-test reset overhead).
- [x] No container-count / disk-pressure issues on the runner (global
LRU cap bounds the running daemon count regardless of test corpus).

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Apr 27, 2026
## Summary

Enables the hive *warm-daemon client pool* on the `test-hive-eest.yml`
workflow.

The pool keeps client containers running across tests and resets chain
state via JSON-RPC `debug_setHead(0)` between them, instead of creating
a fresh container per test. On corpora with high pre-state reuse this
skips ~1.9 s of per-test overhead (`docker create` + tar upload +
`erigon init` + daemon boot).

**Effect on the hive-eest merge-queue gate:**

| Shard | `pool.size` | Baseline wall | New wall | Δ |
|---|---|---|---|---|
| cancun | 4 | 52.2 min | **18.0 min** | **−65.5%** |
| prague | 4 | 61.0 min | **22.4 min** | **−63.4%** |
| osaka | 4 | 63.0 min | **23.6 min** | **−62.6%** |
| glamsterdam-devnet | 24 | 8.5 min | 6.5 min | −22.9% |
| paris+shanghai | 0 | 13.5 min | 12.3 min | ~baseline |
| consume-rlp | 0 | 32.3 min | 31.1 min | ~baseline |

Pacemaker: **63 min → 24 min** (the gate is now 2.7× faster on every PR
/ merge group). Baseline run
[#24827222279](https://github.com/erigontech/erigon/actions/runs/24827222279);
current run
[#24941388548](https://github.com/erigontech/erigon/actions/runs/24941388548).

## Per-shard `pool.size`

The pool only helps where tests share pre-state. From a
trace-instrumented run + offline LRU simulation, the EEST access pattern
is *depth-2 consecutive*: each pre-state is hit twice in a row, then
never seen again. So `pool.size = 4` captures all available reuse on the
long consume-engine shards; bigger caches buy nothing. The other extreme
— paris+shanghai and consume-rlp — has ~unique pre-state per test (~0%
hit at any cache size) and gets `pool.size = 0`, which short-circuits
every new code path in hive (byte-identical to the pre-pool flow).

| Shard | Reuse | `pool.size` |
|---|---|---|
| cancun, prague, osaka | ~2× | 4 |
| glamsterdam-devnet | curve still climbs at 24 | 24 |
| paris+shanghai, consume-rlp | ~1× | 0 (pool disabled) |

## Files

- `.github/workflows/test-hive-eest.yml`:
- Pinned hive checkout to `erigontech/hive @
06b9774991053bf4952c98750c53fc52bceb3991` — see "Pre-merge" below.
  - Added `pool-size` per matrix entry.
- Added `--client.pool.size=${{ matrix.pool-size }}` to the hive
invocation.

No other files touched.

## Pre-merge

The hive-side change is up as a draft at
[ethereum/hive#1449](ethereum/hive#1449). This
Erigon PR currently pins the hive checkout at the corresponding
`erigontech/hive` branch, which is fine for the trial dispatches but
**not for merge**.

Before this lands:
1. ethereum/hive#1449 merges.
2. `hive-versions.json` gets a routine bump to the new ethereum/hive
ref.
3. The `repository:` / `ref:` override in `test-hive-eest.yml` is
reverted to the existing `ethereum/hive` +
`steps.hive-version.outputs.ref` pattern.
4. The `pool-size` matrix var and `--client.pool.size` flag stay —
they're the actual change this PR ships.

I'll squash the experiment commits down to that final shape once erigontech#1449
is in.

## Test plan

- [x] All 6 shards green on the bare-metal `hive` runner group across
multiple dispatches.
- [x] Per-shard `pool.size` derived from a trace-instrumented run and an
offline LRU simulator (the curve plateaus past 4 on long shards).
- [x] Low-reuse shards (paris+shanghai, consume-rlp) verified at
`pool.size=0` — wall time matches the no-pool baseline (no regression
from the pool's per-test reset overhead).
- [x] No container-count / disk-pressure issues on the runner (global
LRU cap bounds the running daemon count regardless of test corpus).

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lupin012 pushed a commit that referenced this pull request May 2, 2026
## Summary

Enables the hive *warm-daemon client pool* on the `test-hive-eest.yml`
workflow.

The pool keeps client containers running across tests and resets chain
state via JSON-RPC `debug_setHead(0)` between them, instead of creating
a fresh container per test. On corpora with high pre-state reuse this
skips ~1.9 s of per-test overhead (`docker create` + tar upload +
`erigon init` + daemon boot).

**Effect on the hive-eest merge-queue gate:**

| Shard | `pool.size` | Baseline wall | New wall | Δ |
|---|---|---|---|---|
| cancun | 4 | 52.2 min | **18.0 min** | **−65.5%** |
| prague | 4 | 61.0 min | **22.4 min** | **−63.4%** |
| osaka | 4 | 63.0 min | **23.6 min** | **−62.6%** |
| glamsterdam-devnet | 24 | 8.5 min | 6.5 min | −22.9% |
| paris+shanghai | 0 | 13.5 min | 12.3 min | ~baseline |
| consume-rlp | 0 | 32.3 min | 31.1 min | ~baseline |

Pacemaker: **63 min → 24 min** (the gate is now 2.7× faster on every PR
/ merge group). Baseline run
[#24827222279](https://github.com/erigontech/erigon/actions/runs/24827222279);
current run
[#24941388548](https://github.com/erigontech/erigon/actions/runs/24941388548).

## Per-shard `pool.size`

The pool only helps where tests share pre-state. From a
trace-instrumented run + offline LRU simulation, the EEST access pattern
is *depth-2 consecutive*: each pre-state is hit twice in a row, then
never seen again. So `pool.size = 4` captures all available reuse on the
long consume-engine shards; bigger caches buy nothing. The other extreme
— paris+shanghai and consume-rlp — has ~unique pre-state per test (~0%
hit at any cache size) and gets `pool.size = 0`, which short-circuits
every new code path in hive (byte-identical to the pre-pool flow).

| Shard | Reuse | `pool.size` |
|---|---|---|
| cancun, prague, osaka | ~2× | 4 |
| glamsterdam-devnet | curve still climbs at 24 | 24 |
| paris+shanghai, consume-rlp | ~1× | 0 (pool disabled) |

## Files

- `.github/workflows/test-hive-eest.yml`:
- Pinned hive checkout to `erigontech/hive @
06b9774991053bf4952c98750c53fc52bceb3991` — see "Pre-merge" below.
  - Added `pool-size` per matrix entry.
- Added `--client.pool.size=${{ matrix.pool-size }}` to the hive
invocation.

No other files touched.

## Pre-merge

The hive-side change is up as a draft at
[ethereum/hive#1449](ethereum/hive#1449). This
Erigon PR currently pins the hive checkout at the corresponding
`erigontech/hive` branch, which is fine for the trial dispatches but
**not for merge**.

Before this lands:
1. ethereum/hive#1449 merges.
2. `hive-versions.json` gets a routine bump to the new ethereum/hive
ref.
3. The `repository:` / `ref:` override in `test-hive-eest.yml` is
reverted to the existing `ethereum/hive` +
`steps.hive-version.outputs.ref` pattern.
4. The `pool-size` matrix var and `--client.pool.size` flag stay —
they're the actual change this PR ships.

I'll squash the experiment commits down to that final shape once #1449
is in.

## Test plan

- [x] All 6 shards green on the bare-metal `hive` runner group across
multiple dispatches.
- [x] Per-shard `pool.size` derived from a trace-instrumented run and an
offline LRU simulator (the curve plateaus past 4 on long shards).
- [x] Low-reuse shards (paris+shanghai, consume-rlp) verified at
`pool.size=0` — wall time matches the no-pool baseline (no regression
from the pool's per-test reset overhead).
- [x] No container-count / disk-pressure issues on the runner (global
LRU cap bounds the running daemon count regardless of test corpus).

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 22, 2026
…t-circuit (erigontech#21362)

Resolves erigontech#21363. Likely also resolves Mode A (the `"previously known bad
block"` cache short-circuit) of erigontech#21364 — leaving that issue open until
verified in-tree across all its parametrizations.

## Summary

- When `engine_newPayload` short-circuits via the `badHeaders` LRU on a
previously-rejected block hash, replay the original `validationErr`
string (e.g. `"max initcode size exceeded"`) instead of the generic
`"previously known bad block"`.
- For parent-inheritance hits, wrap as `"ancestor 0x… rejected: <parent
err>"` so the cause is still traceable when the child hash hadn't been
seen on its own before.
- Falls back to the current `"previously known bad block"` string when
the cache entry has no recorded error (sync-time downloader path,
`"invalid block number"` header check before first cache-populate).
- Shrink the `badHeaders` LRU capacity from 10_000 → 96. Each entry now
holds a heap-allocated error string (not GC'd while cached); the
realistic working set for newPayload bad blocks is in the single digits
per session, so 96 leaves substantial headroom while keeping memory
well-bounded.

## Motivation

Harmless in production until the hive warm-daemon client pool (erigontech#20812)
started reusing erigon processes across EEST tests. The pool calls
`debug_setHead(0)` between tests but does not clear in-memory caches
(the pool doc-comment in `internal/libhive/pool.go` calls this out). A
later EEST test that happens to submit a payload with a block-hash also
produced by an earlier test on the same warm daemon hits the
short-circuit and gets the generic string in `validationError` — EEST's
`ErigonExceptionMapper` has no rule for it, so the test fails with the
wrong exception even though the block was correctly rejected on first
sight.

Concretely: `hive-eest / test-hive-eest (glamsterdam-devnet)`
intermittently failed with 3-4 failures over the `max-failures: 2`
budget. Two of those failures are deterministic (the documented
wrong-EEST-expectation `test_fork_transition` pair). The flaky 1-2
extras rotated across runs between variants of
`test_max_initcode_size[over_max]` and
`test_bal_invalid_extraneous_entries[*]` — exactly the pattern you'd
expect from an LRU-pool-assignment race on top of a process-lifetime
cache. See the [project memory
writeup](https://github.com/erigontech/erigon/actions/runs/26233342007/job/77199556910)
for the detailed trace.

This also makes Engine API replies more informative for legitimate
retries — a CL resubmitting after a config change / debugger session /
network glitch now gets the same actionable error twice instead of a
generic short-circuit on retry.

## Test plan

- [ ] `make lint` clean (verified locally, 3 passes)
- [ ] `make erigon integration` clean (verified locally)
- [ ] `go test ./execution/engineapi/... -short` clean (verified locally
— includes new unit tests pinning Report/IsBadHeader round-trip in
`block_downloader_test.go`)
- [ ] `hive-eest / test-hive-eest (glamsterdam-devnet)` passes on the
merge queue without the flaky `previously known bad block` failures
- [ ] No regression on other `hive-eest` shards

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants