ci: enable hive-eest client pool (~2.7× faster gate)#20812
Merged
Conversation
…ze=4 Wires the test-hive-eest workflow to a hive prototype that retains stopped client containers in a pool keyed by (image, sanitized HIVE_* env, genesis bytes), restarting them across tests instead of creating-from-scratch every time. Hive ref: erigontech/hive @ c46813fa410308b4defe849c0518404a6c77505b Branch: yperbasis/genesis-pool The simplified prototype on this branch saves only docker-create + tar- upload + container-start per pool hit — about 250-400 ms in CI. A follow-up snapshot-based path could also skip `erigon init` itself but needs Linux substrate to validate (the cp -a was prohibitively slow on macOS Docker Desktop during local benchmarking). Pool size of 4 is intentionally conservative for the first run. With ~4500 unique pre-states across osaka and 21580 tests, expected idle container count peaks around the same order; we want to confirm it doesn't blow up disk or the docker daemon before tuning further. This commit must be reverted before merge (or the workflow merged unmodified after retargeting at upstream ethereum/hive once the pool lands there). The PR is purely for CI dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds back the post-init snapshot save / restore on pool reuse to skip `erigon init` itself on hits. The Mac substrate constraint that forced this off doesn't apply on bare-metal Linux runners. Hive ref: erigontech/hive @ bc2cfecf24066898d0308f967d1d6d66a2dafc41 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches the pool from snapshot-restore to warm-daemon: client containers stay running across tests, hive sends JSON-RPC `debug_setHead(0)` to revert chain state to genesis between tests. Per-hit cost drops from ~700-1500 ms (full restart) to ~10-20 ms (one HTTP RPC). Hive ref: erigontech/hive @ c916d4357edcfdb2b23900c2b04157b6b924a84f Validated locally with TestSetHeadEnablesDaemonReuseAcrossTests in execution/execmodule (added but not committed) — debug_setHead(0) correctly rewinds a 1-block chain to genesis in ~285 µs and a different block 1 inserts cleanly afterward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hive ref: erigontech/hive @ 08ea28aa57c5a04afa0bc76afd0d5bbef6640607 Pool flag is now a *global* idle ceiling rather than per-bucket — addresses the iter 3 failure where ~3500 unique pre-states on paris+shanghai each got their own running daemon and exhausted the docker daemon. With LRU eviction the pool's memory footprint is bounded. Bumped pool.size from 4 to 24 (= 2× parallelism). At ~150 MB per Erigon daemon that's ~3.6 GB additional RAM, comfortable on the bare-metal runners. The LRU now has room to actually keep hot pre-states warm between tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Bump hive ref to erigontech/hive yperbasis/client-pool @ 684f7a63 (the upstream-PR-ready version: simplified single-list LRU pool, posted as ethereum/hive#1449). - Add `pool-size` matrix var: 24 on cancun/prague/osaka/glamsterdam-devnet where iter 4 measured a clean -32% to -40% wall-time win, 0 on paris+shanghai and rlp where the corpus has unique pre-state per test and the warm-daemon path was a small net regression (+7%, +11%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trace-instrumented run 24938677388 + offline LRU simulation showed the hit-rate curve plateaus at ~50% past size=4 on cancun/osaka/prague (reuse pattern is depth-2 consecutive, captured by a tiny cache). Drop those shards from 24 → 4 — same hit rate, ~3 GB less RAM per runner, simpler. glamsterdam-devnet's curve keeps climbing through size=512; keep at 24. paris+shanghai and rlp stay at 0 (~0% hit at any size). Hit rate at size=4 vs size=24 from the trace: cancun: 46.9% vs 49.8% (Δ 2.9 pp, ~rounding error wall-time) osaka: 47.5% vs 49.1% prague: 39.9% vs 42.3% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ortID) Three follow-up commits on top of the iter-6 hive ref: - 204e90f9: log path on warm acquire, networks in pool key, ResetPort, Drain↔Release WaitGroup race, suitability docstring. - 06b97749: TCP probe on Acquire to reject dead daemons, fold Key into PoolEntry (Release takes one arg), shortID() in api.startClient slog calls. The probe is the load-bearing fix: cold path's CheckLive wait was skipped on warm reuse, so a daemon that died between Release and Acquire would surface as a confusing RPC timeout. Probe now catches that pre-test, drops the entry, schedules its container for delete, and walks to the next candidate. Validation re-run; same per-shard pool sizes (cancun/prague/osaka=4, glamsterdam=24, paris+shanghai/rlp=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Enables Hive’s warm-daemon client pool for the EEST workflow to reduce per-test container startup overhead and speed up the hive-eest CI gate.
Changes:
- Add per-shard
pool-sizevalues to the workflow matrix to tune cache effectiveness by shard. - Pass
--client.pool.size=${{ matrix.pool-size }}to thehiveinvocation. - Temporarily pin the Hive checkout to
erigontech/hiveat a specific SHA (experimental override).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
204e90f9 collapsed the pool reset port with options.CheckLive. EEST consume-engine sets HIVE_CHECK_LIVE_PORT=8551 (engine port with JWT) so iter 7's reset RPC went to the wrong port and got 403 on every test. Pool effectively size=0; long shards regressed +140-170%. Hive a3c9afd9 separates the two: ResetPort defaults to 8545 (the public JSON-RPC port where debug_setHead lives), overridable via new HIVE_RESET_PORT env var if any client puts debug elsewhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Giulio2002
approved these changes
Apr 26, 2026
Giulio2002
left a comment
Contributor
There was a problem hiding this comment.
LGTM — small CI-only workflow change: tunes hive-eest client pool sizes and pins the temporary hive fork/commit with clear revert notes.
Sahil-4555
pushed a commit
to Sahil-4555/erigon
that referenced
this pull request
Apr 27, 2026
## Summary Enables the hive *warm-daemon client pool* on the `test-hive-eest.yml` workflow. The pool keeps client containers running across tests and resets chain state via JSON-RPC `debug_setHead(0)` between them, instead of creating a fresh container per test. On corpora with high pre-state reuse this skips ~1.9 s of per-test overhead (`docker create` + tar upload + `erigon init` + daemon boot). **Effect on the hive-eest merge-queue gate:** | Shard | `pool.size` | Baseline wall | New wall | Δ | |---|---|---|---|---| | cancun | 4 | 52.2 min | **18.0 min** | **−65.5%** | | prague | 4 | 61.0 min | **22.4 min** | **−63.4%** | | osaka | 4 | 63.0 min | **23.6 min** | **−62.6%** | | glamsterdam-devnet | 24 | 8.5 min | 6.5 min | −22.9% | | paris+shanghai | 0 | 13.5 min | 12.3 min | ~baseline | | consume-rlp | 0 | 32.3 min | 31.1 min | ~baseline | Pacemaker: **63 min → 24 min** (the gate is now 2.7× faster on every PR / merge group). Baseline run [#24827222279](https://github.com/erigontech/erigon/actions/runs/24827222279); current run [#24941388548](https://github.com/erigontech/erigon/actions/runs/24941388548). ## Per-shard `pool.size` The pool only helps where tests share pre-state. From a trace-instrumented run + offline LRU simulation, the EEST access pattern is *depth-2 consecutive*: each pre-state is hit twice in a row, then never seen again. So `pool.size = 4` captures all available reuse on the long consume-engine shards; bigger caches buy nothing. The other extreme — paris+shanghai and consume-rlp — has ~unique pre-state per test (~0% hit at any cache size) and gets `pool.size = 0`, which short-circuits every new code path in hive (byte-identical to the pre-pool flow). | Shard | Reuse | `pool.size` | |---|---|---| | cancun, prague, osaka | ~2× | 4 | | glamsterdam-devnet | curve still climbs at 24 | 24 | | paris+shanghai, consume-rlp | ~1× | 0 (pool disabled) | ## Files - `.github/workflows/test-hive-eest.yml`: - Pinned hive checkout to `erigontech/hive @ 06b9774991053bf4952c98750c53fc52bceb3991` — see "Pre-merge" below. - Added `pool-size` per matrix entry. - Added `--client.pool.size=${{ matrix.pool-size }}` to the hive invocation. No other files touched. ## Pre-merge The hive-side change is up as a draft at [ethereum/hive#1449](ethereum/hive#1449). This Erigon PR currently pins the hive checkout at the corresponding `erigontech/hive` branch, which is fine for the trial dispatches but **not for merge**. Before this lands: 1. ethereum/hive#1449 merges. 2. `hive-versions.json` gets a routine bump to the new ethereum/hive ref. 3. The `repository:` / `ref:` override in `test-hive-eest.yml` is reverted to the existing `ethereum/hive` + `steps.hive-version.outputs.ref` pattern. 4. The `pool-size` matrix var and `--client.pool.size` flag stay — they're the actual change this PR ships. I'll squash the experiment commits down to that final shape once erigontech#1449 is in. ## Test plan - [x] All 6 shards green on the bare-metal `hive` runner group across multiple dispatches. - [x] Per-shard `pool.size` derived from a trace-instrumented run and an offline LRU simulator (the curve plateaus past 4 on long shards). - [x] Low-reuse shards (paris+shanghai, consume-rlp) verified at `pool.size=0` — wall time matches the no-pool baseline (no regression from the pool's per-test reset overhead). - [x] No container-count / disk-pressure issues on the runner (global LRU cap bounds the running daemon count regardless of test corpus). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sahil-4555
pushed a commit
to Sahil-4555/erigon
that referenced
this pull request
Apr 27, 2026
## Summary Enables the hive *warm-daemon client pool* on the `test-hive-eest.yml` workflow. The pool keeps client containers running across tests and resets chain state via JSON-RPC `debug_setHead(0)` between them, instead of creating a fresh container per test. On corpora with high pre-state reuse this skips ~1.9 s of per-test overhead (`docker create` + tar upload + `erigon init` + daemon boot). **Effect on the hive-eest merge-queue gate:** | Shard | `pool.size` | Baseline wall | New wall | Δ | |---|---|---|---|---| | cancun | 4 | 52.2 min | **18.0 min** | **−65.5%** | | prague | 4 | 61.0 min | **22.4 min** | **−63.4%** | | osaka | 4 | 63.0 min | **23.6 min** | **−62.6%** | | glamsterdam-devnet | 24 | 8.5 min | 6.5 min | −22.9% | | paris+shanghai | 0 | 13.5 min | 12.3 min | ~baseline | | consume-rlp | 0 | 32.3 min | 31.1 min | ~baseline | Pacemaker: **63 min → 24 min** (the gate is now 2.7× faster on every PR / merge group). Baseline run [#24827222279](https://github.com/erigontech/erigon/actions/runs/24827222279); current run [#24941388548](https://github.com/erigontech/erigon/actions/runs/24941388548). ## Per-shard `pool.size` The pool only helps where tests share pre-state. From a trace-instrumented run + offline LRU simulation, the EEST access pattern is *depth-2 consecutive*: each pre-state is hit twice in a row, then never seen again. So `pool.size = 4` captures all available reuse on the long consume-engine shards; bigger caches buy nothing. The other extreme — paris+shanghai and consume-rlp — has ~unique pre-state per test (~0% hit at any cache size) and gets `pool.size = 0`, which short-circuits every new code path in hive (byte-identical to the pre-pool flow). | Shard | Reuse | `pool.size` | |---|---|---| | cancun, prague, osaka | ~2× | 4 | | glamsterdam-devnet | curve still climbs at 24 | 24 | | paris+shanghai, consume-rlp | ~1× | 0 (pool disabled) | ## Files - `.github/workflows/test-hive-eest.yml`: - Pinned hive checkout to `erigontech/hive @ 06b9774991053bf4952c98750c53fc52bceb3991` — see "Pre-merge" below. - Added `pool-size` per matrix entry. - Added `--client.pool.size=${{ matrix.pool-size }}` to the hive invocation. No other files touched. ## Pre-merge The hive-side change is up as a draft at [ethereum/hive#1449](ethereum/hive#1449). This Erigon PR currently pins the hive checkout at the corresponding `erigontech/hive` branch, which is fine for the trial dispatches but **not for merge**. Before this lands: 1. ethereum/hive#1449 merges. 2. `hive-versions.json` gets a routine bump to the new ethereum/hive ref. 3. The `repository:` / `ref:` override in `test-hive-eest.yml` is reverted to the existing `ethereum/hive` + `steps.hive-version.outputs.ref` pattern. 4. The `pool-size` matrix var and `--client.pool.size` flag stay — they're the actual change this PR ships. I'll squash the experiment commits down to that final shape once erigontech#1449 is in. ## Test plan - [x] All 6 shards green on the bare-metal `hive` runner group across multiple dispatches. - [x] Per-shard `pool.size` derived from a trace-instrumented run and an offline LRU simulator (the curve plateaus past 4 on long shards). - [x] Low-reuse shards (paris+shanghai, consume-rlp) verified at `pool.size=0` — wall time matches the no-pool baseline (no regression from the pool's per-test reset overhead). - [x] No container-count / disk-pressure issues on the runner (global LRU cap bounds the running daemon count regardless of test corpus). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lupin012
pushed a commit
that referenced
this pull request
May 2, 2026
## Summary Enables the hive *warm-daemon client pool* on the `test-hive-eest.yml` workflow. The pool keeps client containers running across tests and resets chain state via JSON-RPC `debug_setHead(0)` between them, instead of creating a fresh container per test. On corpora with high pre-state reuse this skips ~1.9 s of per-test overhead (`docker create` + tar upload + `erigon init` + daemon boot). **Effect on the hive-eest merge-queue gate:** | Shard | `pool.size` | Baseline wall | New wall | Δ | |---|---|---|---|---| | cancun | 4 | 52.2 min | **18.0 min** | **−65.5%** | | prague | 4 | 61.0 min | **22.4 min** | **−63.4%** | | osaka | 4 | 63.0 min | **23.6 min** | **−62.6%** | | glamsterdam-devnet | 24 | 8.5 min | 6.5 min | −22.9% | | paris+shanghai | 0 | 13.5 min | 12.3 min | ~baseline | | consume-rlp | 0 | 32.3 min | 31.1 min | ~baseline | Pacemaker: **63 min → 24 min** (the gate is now 2.7× faster on every PR / merge group). Baseline run [#24827222279](https://github.com/erigontech/erigon/actions/runs/24827222279); current run [#24941388548](https://github.com/erigontech/erigon/actions/runs/24941388548). ## Per-shard `pool.size` The pool only helps where tests share pre-state. From a trace-instrumented run + offline LRU simulation, the EEST access pattern is *depth-2 consecutive*: each pre-state is hit twice in a row, then never seen again. So `pool.size = 4` captures all available reuse on the long consume-engine shards; bigger caches buy nothing. The other extreme — paris+shanghai and consume-rlp — has ~unique pre-state per test (~0% hit at any cache size) and gets `pool.size = 0`, which short-circuits every new code path in hive (byte-identical to the pre-pool flow). | Shard | Reuse | `pool.size` | |---|---|---| | cancun, prague, osaka | ~2× | 4 | | glamsterdam-devnet | curve still climbs at 24 | 24 | | paris+shanghai, consume-rlp | ~1× | 0 (pool disabled) | ## Files - `.github/workflows/test-hive-eest.yml`: - Pinned hive checkout to `erigontech/hive @ 06b9774991053bf4952c98750c53fc52bceb3991` — see "Pre-merge" below. - Added `pool-size` per matrix entry. - Added `--client.pool.size=${{ matrix.pool-size }}` to the hive invocation. No other files touched. ## Pre-merge The hive-side change is up as a draft at [ethereum/hive#1449](ethereum/hive#1449). This Erigon PR currently pins the hive checkout at the corresponding `erigontech/hive` branch, which is fine for the trial dispatches but **not for merge**. Before this lands: 1. ethereum/hive#1449 merges. 2. `hive-versions.json` gets a routine bump to the new ethereum/hive ref. 3. The `repository:` / `ref:` override in `test-hive-eest.yml` is reverted to the existing `ethereum/hive` + `steps.hive-version.outputs.ref` pattern. 4. The `pool-size` matrix var and `--client.pool.size` flag stay — they're the actual change this PR ships. I'll squash the experiment commits down to that final shape once #1449 is in. ## Test plan - [x] All 6 shards green on the bare-metal `hive` runner group across multiple dispatches. - [x] Per-shard `pool.size` derived from a trace-instrumented run and an offline LRU simulator (the curve plateaus past 4 on long shards). - [x] Low-reuse shards (paris+shanghai, consume-rlp) verified at `pool.size=0` — wall time matches the no-pool baseline (no regression from the pool's per-test reset overhead). - [x] No container-count / disk-pressure issues on the runner (global LRU cap bounds the running daemon count regardless of test corpus). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 22, 2026
…t-circuit (erigontech#21362) Resolves erigontech#21363. Likely also resolves Mode A (the `"previously known bad block"` cache short-circuit) of erigontech#21364 — leaving that issue open until verified in-tree across all its parametrizations. ## Summary - When `engine_newPayload` short-circuits via the `badHeaders` LRU on a previously-rejected block hash, replay the original `validationErr` string (e.g. `"max initcode size exceeded"`) instead of the generic `"previously known bad block"`. - For parent-inheritance hits, wrap as `"ancestor 0x… rejected: <parent err>"` so the cause is still traceable when the child hash hadn't been seen on its own before. - Falls back to the current `"previously known bad block"` string when the cache entry has no recorded error (sync-time downloader path, `"invalid block number"` header check before first cache-populate). - Shrink the `badHeaders` LRU capacity from 10_000 → 96. Each entry now holds a heap-allocated error string (not GC'd while cached); the realistic working set for newPayload bad blocks is in the single digits per session, so 96 leaves substantial headroom while keeping memory well-bounded. ## Motivation Harmless in production until the hive warm-daemon client pool (erigontech#20812) started reusing erigon processes across EEST tests. The pool calls `debug_setHead(0)` between tests but does not clear in-memory caches (the pool doc-comment in `internal/libhive/pool.go` calls this out). A later EEST test that happens to submit a payload with a block-hash also produced by an earlier test on the same warm daemon hits the short-circuit and gets the generic string in `validationError` — EEST's `ErigonExceptionMapper` has no rule for it, so the test fails with the wrong exception even though the block was correctly rejected on first sight. Concretely: `hive-eest / test-hive-eest (glamsterdam-devnet)` intermittently failed with 3-4 failures over the `max-failures: 2` budget. Two of those failures are deterministic (the documented wrong-EEST-expectation `test_fork_transition` pair). The flaky 1-2 extras rotated across runs between variants of `test_max_initcode_size[over_max]` and `test_bal_invalid_extraneous_entries[*]` — exactly the pattern you'd expect from an LRU-pool-assignment race on top of a process-lifetime cache. See the [project memory writeup](https://github.com/erigontech/erigon/actions/runs/26233342007/job/77199556910) for the detailed trace. This also makes Engine API replies more informative for legitimate retries — a CL resubmitting after a config change / debugger session / network glitch now gets the same actionable error twice instead of a generic short-circuit on retry. ## Test plan - [ ] `make lint` clean (verified locally, 3 passes) - [ ] `make erigon integration` clean (verified locally) - [ ] `go test ./execution/engineapi/... -short` clean (verified locally — includes new unit tests pinning Report/IsBadHeader round-trip in `block_downloader_test.go`) - [ ] `hive-eest / test-hive-eest (glamsterdam-devnet)` passes on the merge queue without the flaky `previously known bad block` failures - [ ] No regression on other `hive-eest` shards 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enables the hive warm-daemon client pool on the
test-hive-eest.ymlworkflow.The pool keeps client containers running across tests and resets chain state via JSON-RPC
debug_setHead(0)between them, instead of creating a fresh container per test. On corpora with high pre-state reuse this skips ~1.9 s of per-test overhead (docker create+ tar upload +erigon init+ daemon boot).Effect on the hive-eest merge-queue gate:
pool.sizePacemaker: 63 min → 24 min (the gate is now 2.7× faster on every PR / merge group). Baseline run #24827222279; current run #24941388548.
Per-shard
pool.sizeThe pool only helps where tests share pre-state. From a trace-instrumented run + offline LRU simulation, the EEST access pattern is depth-2 consecutive: each pre-state is hit twice in a row, then never seen again. So
pool.size = 4captures all available reuse on the long consume-engine shards; bigger caches buy nothing. The other extreme — paris+shanghai and consume-rlp — has ~unique pre-state per test (~0% hit at any cache size) and getspool.size = 0, which short-circuits every new code path in hive (byte-identical to the pre-pool flow).pool.sizeFiles
.github/workflows/test-hive-eest.yml:erigontech/hive @ 06b9774991053bf4952c98750c53fc52bceb3991— see "Pre-merge" below.pool-sizeper matrix entry.--client.pool.size=${{ matrix.pool-size }}to the hive invocation.No other files touched.
Pre-merge
The hive-side change is up as a draft at ethereum/hive#1449. This Erigon PR currently pins the hive checkout at the corresponding
erigontech/hivebranch, which is fine for the trial dispatches but not for merge.Before this lands:
hive-versions.jsongets a routine bump to the new ethereum/hive ref.repository:/ref:override intest-hive-eest.ymlis reverted to the existingethereum/hive+steps.hive-version.outputs.refpattern.pool-sizematrix var and--client.pool.sizeflag stay — they're the actual change this PR ships.I'll squash the experiment commits down to that final shape once #1449 is in.
Test plan
hiverunner group across multiple dispatches.pool.sizederived from a trace-instrumented run and an offline LRU simulator (the curve plateaus past 4 on long shards).pool.size=0— wall time matches the no-pool baseline (no regression from the pool's per-test reset overhead).