ci: enable hive-eest client pool (~2.7× faster gate) by yperbasis · Pull Request #20812 · erigontech/erigon

yperbasis · 2026-04-25T08:59:27Z

Summary

Enables the hive warm-daemon client pool on the test-hive-eest.yml workflow.

The pool keeps client containers running across tests and resets chain state via JSON-RPC debug_setHead(0) between them, instead of creating a fresh container per test. On corpora with high pre-state reuse this skips ~1.9 s of per-test overhead (docker create + tar upload + erigon init + daemon boot).

Effect on the hive-eest merge-queue gate:

Shard	`pool.size`	Baseline wall	New wall	Δ
cancun	4	52.2 min	18.0 min	−65.5%
prague	4	61.0 min	22.4 min	−63.4%
osaka	4	63.0 min	23.6 min	−62.6%
glamsterdam-devnet	24	8.5 min	6.5 min	−22.9%
paris+shanghai	0	13.5 min	12.3 min	~baseline
consume-rlp	0	32.3 min	31.1 min	~baseline

Pacemaker: 63 min → 24 min (the gate is now 2.7× faster on every PR / merge group). Baseline run #24827222279; current run #24941388548.

Per-shard `pool.size`

The pool only helps where tests share pre-state. From a trace-instrumented run + offline LRU simulation, the EEST access pattern is depth-2 consecutive: each pre-state is hit twice in a row, then never seen again. So pool.size = 4 captures all available reuse on the long consume-engine shards; bigger caches buy nothing. The other extreme — paris+shanghai and consume-rlp — has ~unique pre-state per test (~0% hit at any cache size) and gets pool.size = 0, which short-circuits every new code path in hive (byte-identical to the pre-pool flow).

Shard	Reuse	`pool.size`
cancun, prague, osaka	~2×	4
glamsterdam-devnet	curve still climbs at 24	24
paris+shanghai, consume-rlp	~1×	0 (pool disabled)

Files

.github/workflows/test-hive-eest.yml:
- Pinned hive checkout to erigontech/hive @ 06b9774991053bf4952c98750c53fc52bceb3991 — see "Pre-merge" below.
- Added pool-size per matrix entry.
- Added --client.pool.size=${{ matrix.pool-size }} to the hive invocation.

No other files touched.

Pre-merge

The hive-side change is up as a draft at ethereum/hive#1449. This Erigon PR currently pins the hive checkout at the corresponding erigontech/hive branch, which is fine for the trial dispatches but not for merge.

Before this lands:

libhive: optional warm-daemon client pool ethereum/hive#1449 merges.
hive-versions.json gets a routine bump to the new ethereum/hive ref.
The repository: / ref: override in test-hive-eest.yml is reverted to the existing ethereum/hive + steps.hive-version.outputs.ref pattern.
The pool-size matrix var and --client.pool.size flag stay — they're the actual change this PR ships.

I'll squash the experiment commits down to that final shape once #1449 is in.

Test plan

All 6 shards green on the bare-metal hive runner group across multiple dispatches.
Per-shard pool.size derived from a trace-instrumented run and an offline LRU simulator (the curve plateaus past 4 on long shards).
Low-reuse shards (paris+shanghai, consume-rlp) verified at pool.size=0 — wall time matches the no-pool baseline (no regression from the pool's per-test reset overhead).
No container-count / disk-pressure issues on the runner (global LRU cap bounds the running daemon count regardless of test corpus).

…ze=4 Wires the test-hive-eest workflow to a hive prototype that retains stopped client containers in a pool keyed by (image, sanitized HIVE_* env, genesis bytes), restarting them across tests instead of creating-from-scratch every time. Hive ref: erigontech/hive @ c46813fa410308b4defe849c0518404a6c77505b Branch: yperbasis/genesis-pool The simplified prototype on this branch saves only docker-create + tar- upload + container-start per pool hit — about 250-400 ms in CI. A follow-up snapshot-based path could also skip `erigon init` itself but needs Linux substrate to validate (the cp -a was prohibitively slow on macOS Docker Desktop during local benchmarking). Pool size of 4 is intentionally conservative for the first run. With ~4500 unique pre-states across osaka and 21580 tests, expected idle container count peaks around the same order; we want to confirm it doesn't blow up disk or the docker daemon before tuning further. This commit must be reverted before merge (or the workflow merged unmodified after retargeting at upstream ethereum/hive once the pool lands there). The PR is purely for CI dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds back the post-init snapshot save / restore on pool reuse to skip `erigon init` itself on hits. The Mac substrate constraint that forced this off doesn't apply on bare-metal Linux runners. Hive ref: erigontech/hive @ bc2cfecf24066898d0308f967d1d6d66a2dafc41 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switches the pool from snapshot-restore to warm-daemon: client containers stay running across tests, hive sends JSON-RPC `debug_setHead(0)` to revert chain state to genesis between tests. Per-hit cost drops from ~700-1500 ms (full restart) to ~10-20 ms (one HTTP RPC). Hive ref: erigontech/hive @ c916d4357edcfdb2b23900c2b04157b6b924a84f Validated locally with TestSetHeadEnablesDaemonReuseAcrossTests in execution/execmodule (added but not committed) — debug_setHead(0) correctly rewinds a 1-block chain to genesis in ~285 µs and a different block 1 inserts cleanly afterward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hive ref: erigontech/hive @ 08ea28aa57c5a04afa0bc76afd0d5bbef6640607 Pool flag is now a *global* idle ceiling rather than per-bucket — addresses the iter 3 failure where ~3500 unique pre-states on paris+shanghai each got their own running daemon and exhausted the docker daemon. With LRU eviction the pool's memory footprint is bounded. Bumped pool.size from 4 to 24 (= 2× parallelism). At ~150 MB per Erigon daemon that's ~3.6 GB additional RAM, comfortable on the bare-metal runners. The LRU now has room to actually keep hot pre-states warm between tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Bump hive ref to erigontech/hive yperbasis/client-pool @ 684f7a63 (the upstream-PR-ready version: simplified single-list LRU pool, posted as ethereum/hive#1449). - Add `pool-size` matrix var: 24 on cancun/prague/osaka/glamsterdam-devnet where iter 4 measured a clean -32% to -40% wall-time win, 0 on paris+shanghai and rlp where the corpus has unique pre-state per test and the warm-daemon path was a small net regression (+7%, +11%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Trace-instrumented run 24938677388 + offline LRU simulation showed the hit-rate curve plateaus at ~50% past size=4 on cancun/osaka/prague (reuse pattern is depth-2 consecutive, captured by a tiny cache). Drop those shards from 24 → 4 — same hit rate, ~3 GB less RAM per runner, simpler. glamsterdam-devnet's curve keeps climbing through size=512; keep at 24. paris+shanghai and rlp stay at 0 (~0% hit at any size). Hit rate at size=4 vs size=24 from the trace: cancun: 46.9% vs 49.8% (Δ 2.9 pp, ~rounding error wall-time) osaka: 47.5% vs 49.1% prague: 39.9% vs 42.3% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ortID) Three follow-up commits on top of the iter-6 hive ref: - 204e90f9: log path on warm acquire, networks in pool key, ResetPort, Drain↔Release WaitGroup race, suitability docstring. - 06b97749: TCP probe on Acquire to reject dead daemons, fold Key into PoolEntry (Release takes one arg), shortID() in api.startClient slog calls. The probe is the load-bearing fix: cold path's CheckLive wait was skipped on warm reuse, so a daemon that died between Release and Acquire would surface as a confusing RPC timeout. Probe now catches that pre-test, drops the entry, schedules its container for delete, and walks to the next candidate. Validation re-run; same per-shard pool sizes (cancun/prague/osaka=4, glamsterdam=24, paris+shanghai/rlp=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Enables Hive’s warm-daemon client pool for the EEST workflow to reduce per-test container startup overhead and speed up the hive-eest CI gate.

Changes:

Add per-shard pool-size values to the workflow matrix to tune cache effectiveness by shard.
Pass --client.pool.size=${{ matrix.pool-size }} to the hive invocation.
Temporarily pin the Hive checkout to erigontech/hive at a specific SHA (experimental override).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

204e90f9 collapsed the pool reset port with options.CheckLive. EEST consume-engine sets HIVE_CHECK_LIVE_PORT=8551 (engine port with JWT) so iter 7's reset RPC went to the wrong port and got 403 on every test. Pool effectively size=0; long shards regressed +140-170%. Hive a3c9afd9 separates the two: ResetPort defaults to 8545 (the public JSON-RPC port where debug_setHead lives), overridable via new HIVE_RESET_PORT env var if any client puts debug elsewhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Giulio2002

LGTM — small CI-only workflow change: tunes hive-eest client pool sizes and pins the temporary hive fork/commit with clear revert notes.

## Summary Enables the hive *warm-daemon client pool* on the `test-hive-eest.yml` workflow. The pool keeps client containers running across tests and resets chain state via JSON-RPC `debug_setHead(0)` between them, instead of creating a fresh container per test. On corpora with high pre-state reuse this skips ~1.9 s of per-test overhead (`docker create` + tar upload + `erigon init` + daemon boot). **Effect on the hive-eest merge-queue gate:** | Shard | `pool.size` | Baseline wall | New wall | Δ | |---|---|---|---|---| | cancun | 4 | 52.2 min | **18.0 min** | **−65.5%** | | prague | 4 | 61.0 min | **22.4 min** | **−63.4%** | | osaka | 4 | 63.0 min | **23.6 min** | **−62.6%** | | glamsterdam-devnet | 24 | 8.5 min | 6.5 min | −22.9% | | paris+shanghai | 0 | 13.5 min | 12.3 min | ~baseline | | consume-rlp | 0 | 32.3 min | 31.1 min | ~baseline | Pacemaker: **63 min → 24 min** (the gate is now 2.7× faster on every PR / merge group). Baseline run [#24827222279](https://github.com/erigontech/erigon/actions/runs/24827222279); current run [#24941388548](https://github.com/erigontech/erigon/actions/runs/24941388548). ## Per-shard `pool.size` The pool only helps where tests share pre-state. From a trace-instrumented run + offline LRU simulation, the EEST access pattern is *depth-2 consecutive*: each pre-state is hit twice in a row, then never seen again. So `pool.size = 4` captures all available reuse on the long consume-engine shards; bigger caches buy nothing. The other extreme — paris+shanghai and consume-rlp — has ~unique pre-state per test (~0% hit at any cache size) and gets `pool.size = 0`, which short-circuits every new code path in hive (byte-identical to the pre-pool flow). | Shard | Reuse | `pool.size` | |---|---|---| | cancun, prague, osaka | ~2× | 4 | | glamsterdam-devnet | curve still climbs at 24 | 24 | | paris+shanghai, consume-rlp | ~1× | 0 (pool disabled) | ## Files - `.github/workflows/test-hive-eest.yml`: - Pinned hive checkout to `erigontech/hive @ 06b9774991053bf4952c98750c53fc52bceb3991` — see "Pre-merge" below. - Added `pool-size` per matrix entry. - Added `--client.pool.size=${{ matrix.pool-size }}` to the hive invocation. No other files touched. ## Pre-merge The hive-side change is up as a draft at [ethereum/hive#1449](ethereum/hive#1449). This Erigon PR currently pins the hive checkout at the corresponding `erigontech/hive` branch, which is fine for the trial dispatches but **not for merge**. Before this lands: 1. ethereum/hive#1449 merges. 2. `hive-versions.json` gets a routine bump to the new ethereum/hive ref. 3. The `repository:` / `ref:` override in `test-hive-eest.yml` is reverted to the existing `ethereum/hive` + `steps.hive-version.outputs.ref` pattern. 4. The `pool-size` matrix var and `--client.pool.size` flag stay — they're the actual change this PR ships. I'll squash the experiment commits down to that final shape once erigontech#1449 is in. ## Test plan - [x] All 6 shards green on the bare-metal `hive` runner group across multiple dispatches. - [x] Per-shard `pool.size` derived from a trace-instrumented run and an offline LRU simulator (the curve plateaus past 4 on long shards). - [x] Low-reuse shards (paris+shanghai, consume-rlp) verified at `pool.size=0` — wall time matches the no-pool baseline (no regression from the pool's per-test reset overhead). - [x] No container-count / disk-pressure issues on the runner (global LRU cap bounds the running daemon count regardless of test corpus). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary Enables the hive *warm-daemon client pool* on the `test-hive-eest.yml` workflow. The pool keeps client containers running across tests and resets chain state via JSON-RPC `debug_setHead(0)` between them, instead of creating a fresh container per test. On corpora with high pre-state reuse this skips ~1.9 s of per-test overhead (`docker create` + tar upload + `erigon init` + daemon boot). **Effect on the hive-eest merge-queue gate:** | Shard | `pool.size` | Baseline wall | New wall | Δ | |---|---|---|---|---| | cancun | 4 | 52.2 min | **18.0 min** | **−65.5%** | | prague | 4 | 61.0 min | **22.4 min** | **−63.4%** | | osaka | 4 | 63.0 min | **23.6 min** | **−62.6%** | | glamsterdam-devnet | 24 | 8.5 min | 6.5 min | −22.9% | | paris+shanghai | 0 | 13.5 min | 12.3 min | ~baseline | | consume-rlp | 0 | 32.3 min | 31.1 min | ~baseline | Pacemaker: **63 min → 24 min** (the gate is now 2.7× faster on every PR / merge group). Baseline run [#24827222279](https://github.com/erigontech/erigon/actions/runs/24827222279); current run [#24941388548](https://github.com/erigontech/erigon/actions/runs/24941388548). ## Per-shard `pool.size` The pool only helps where tests share pre-state. From a trace-instrumented run + offline LRU simulation, the EEST access pattern is *depth-2 consecutive*: each pre-state is hit twice in a row, then never seen again. So `pool.size = 4` captures all available reuse on the long consume-engine shards; bigger caches buy nothing. The other extreme — paris+shanghai and consume-rlp — has ~unique pre-state per test (~0% hit at any cache size) and gets `pool.size = 0`, which short-circuits every new code path in hive (byte-identical to the pre-pool flow). | Shard | Reuse | `pool.size` | |---|---|---| | cancun, prague, osaka | ~2× | 4 | | glamsterdam-devnet | curve still climbs at 24 | 24 | | paris+shanghai, consume-rlp | ~1× | 0 (pool disabled) | ## Files - `.github/workflows/test-hive-eest.yml`: - Pinned hive checkout to `erigontech/hive @ 06b9774991053bf4952c98750c53fc52bceb3991` — see "Pre-merge" below. - Added `pool-size` per matrix entry. - Added `--client.pool.size=${{ matrix.pool-size }}` to the hive invocation. No other files touched. ## Pre-merge The hive-side change is up as a draft at [ethereum/hive#1449](ethereum/hive#1449). This Erigon PR currently pins the hive checkout at the corresponding `erigontech/hive` branch, which is fine for the trial dispatches but **not for merge**. Before this lands: 1. ethereum/hive#1449 merges. 2. `hive-versions.json` gets a routine bump to the new ethereum/hive ref. 3. The `repository:` / `ref:` override in `test-hive-eest.yml` is reverted to the existing `ethereum/hive` + `steps.hive-version.outputs.ref` pattern. 4. The `pool-size` matrix var and `--client.pool.size` flag stay — they're the actual change this PR ships. I'll squash the experiment commits down to that final shape once #1449 is in. ## Test plan - [x] All 6 shards green on the bare-metal `hive` runner group across multiple dispatches. - [x] Per-shard `pool.size` derived from a trace-instrumented run and an offline LRU simulator (the curve plateaus past 4 on long shards). - [x] Low-reuse shards (paris+shanghai, consume-rlp) verified at `pool.size=0` — wall time matches the no-pool baseline (no regression from the pool's per-test reset overhead). - [x] No container-count / disk-pressure issues on the runner (global LRU cap bounds the running daemon count regardless of test corpus). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t-circuit (erigontech#21362) Resolves erigontech#21363. Likely also resolves Mode A (the `"previously known bad block"` cache short-circuit) of erigontech#21364 — leaving that issue open until verified in-tree across all its parametrizations. ## Summary - When `engine_newPayload` short-circuits via the `badHeaders` LRU on a previously-rejected block hash, replay the original `validationErr` string (e.g. `"max initcode size exceeded"`) instead of the generic `"previously known bad block"`. - For parent-inheritance hits, wrap as `"ancestor 0x… rejected: <parent err>"` so the cause is still traceable when the child hash hadn't been seen on its own before. - Falls back to the current `"previously known bad block"` string when the cache entry has no recorded error (sync-time downloader path, `"invalid block number"` header check before first cache-populate). - Shrink the `badHeaders` LRU capacity from 10_000 → 96. Each entry now holds a heap-allocated error string (not GC'd while cached); the realistic working set for newPayload bad blocks is in the single digits per session, so 96 leaves substantial headroom while keeping memory well-bounded. ## Motivation Harmless in production until the hive warm-daemon client pool (erigontech#20812) started reusing erigon processes across EEST tests. The pool calls `debug_setHead(0)` between tests but does not clear in-memory caches (the pool doc-comment in `internal/libhive/pool.go` calls this out). A later EEST test that happens to submit a payload with a block-hash also produced by an earlier test on the same warm daemon hits the short-circuit and gets the generic string in `validationError` — EEST's `ErigonExceptionMapper` has no rule for it, so the test fails with the wrong exception even though the block was correctly rejected on first sight. Concretely: `hive-eest / test-hive-eest (glamsterdam-devnet)` intermittently failed with 3-4 failures over the `max-failures: 2` budget. Two of those failures are deterministic (the documented wrong-EEST-expectation `test_fork_transition` pair). The flaky 1-2 extras rotated across runs between variants of `test_max_initcode_size[over_max]` and `test_bal_invalid_extraneous_entries[*]` — exactly the pattern you'd expect from an LRU-pool-assignment race on top of a process-lifetime cache. See the [project memory writeup](https://github.com/erigontech/erigon/actions/runs/26233342007/job/77199556910) for the detailed trace. This also makes Engine API replies more informative for legitimate retries — a CL resubmitting after a config change / debugger session / network glitch now gets the same actionable error twice instead of a generic short-circuit on retry. ## Test plan - [ ] `make lint` clean (verified locally, 3 passes) - [ ] `make erigon integration` clean (verified locally) - [ ] `go test ./execution/engineapi/... -short` clean (verified locally — includes new unit tests pinning Report/IsBadHeader round-trip in `block_downloader_test.go`) - [ ] `hive-eest / test-hive-eest (glamsterdam-devnet)` passes on the merge queue without the flaky `previously known bad block` failures - [ ] No regression on other `hive-eest` shards 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yperbasis and others added 7 commits April 25, 2026 10:58

[experiment] fix hive ref: full SHA required by actions/checkout

a5aa529

yperbasis added QA performance labels Apr 25, 2026

yperbasis changed the title ~~[experiment] hive client pool keyed by genesis (pool.size=4)~~ ci: enable hive-eest client pool (~2.7× faster gate) Apr 25, 2026

yperbasis requested a review from Copilot April 25, 2026 20:59

Copilot started reviewing on behalf of yperbasis April 25, 2026 20:59 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread .github/workflows/test-hive-eest.yml Outdated

Comment thread .github/workflows/test-hive-eest.yml

yperbasis marked this pull request as ready for review April 26, 2026 07:21

yperbasis requested review from AskAlexSharov and mriccobene as code owners April 26, 2026 07:21

yperbasis requested review from anacrolix, mh0lt and taratorio April 26, 2026 07:22

Giulio2002 approved these changes Apr 26, 2026

View reviewed changes

anacrolix added this pull request to the merge queue Apr 27, 2026

Merged via the queue into main with commit 8460422 Apr 27, 2026
74 checks passed

anacrolix deleted the yperbasis/hive-pool-experiment branch April 27, 2026 04:01

yperbasis mentioned this pull request May 22, 2026

execution/engineapi: replay cached validation error on bad-block short-circuit #21362

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: enable hive-eest client pool (~2.7× faster gate)#20812

ci: enable hive-eest client pool (~2.7× faster gate)#20812
anacrolix merged 9 commits into
mainfrom
yperbasis/hive-pool-experiment

yperbasis commented Apr 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Giulio2002 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yperbasis commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Per-shard pool.size

Files

Pre-merge

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Giulio2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yperbasis commented Apr 25, 2026 •

edited

Loading

Per-shard `pool.size`