stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains by mh0lt · Pull Request #21010 · erigontech/erigon

mh0lt · 2026-05-06T11:19:22Z

Summary

commitmentCalculator lazy-loads from the storage/account domains via a shared asOfStateReader. The reader is constructed with txNum=0 and only updated inside computeAndPublish / computeWithoutCheck — both of which run on a blockResult message. But cc.state.ApplyWrites runs on every txResult and triggers ensureAccount / ensureStorage lazy-loads through the same reader.

Since txResults arrive before the first blockResult, the very first lazy-load fires with asOfReader.txNum = 0. On a synced-from-genesis datadir txNum=0 is in the visible history window so seekInFiles returns gracefully. On a snapshot-loaded datadir the visible window starts well past genesis and seekInFiles errors with:

ensureStorage(...): seekInFiles(invIndex=storage,txNum=0) but data before txNum=2918750000 not available

The calculator publishes the wrapped error and FCU fails on every block.

Fix

Pin asOfReader.txNum to the current tx's txNum before ApplyWrites runs. Semantically GetAsOf(r.txNum) returns the value before tx N modified it = the pre-tx baseline, which is what lazy-load needs (only first-touch slots lazy-load; subsequent same-slot writes hit cs.storageState and skip the read path).

computeAndPublish overwrites this back to lastTxNum + 1 right before ComputeCommitment, so the per-tx setting only affects the lazy-load path and never leaks into the trie fold path.

Reproduction

Run the parallel commitment calculator on a snapshot-loaded chain (e.g. perf-devnet-3 starting at txNum~2.9B). Every block fails with the seekInFiles error above. Synced-from-genesis chains don't trip the bug because txNum=0 is in their window.

Test plan

CI green
Bench on snapshot-loaded perf-devnet-3 with EXEC3_PARALLEL=true: 17 SETUP blocks + 1 TEST block all VALID with 0 lazy-load errors (verified locally)
Bench on synced-from-genesis chain: no behavior change (lazy-load already worked at txNum=0 there)

🤖 Generated with Claude Code

commitmentCalculator constructs asOfReader with txNum=0 and only updates it inside computeAndPublish/computeWithoutCheck (which run on blockResult). But cc.state.ApplyWrites runs on every txResult and triggers ensureAccount/ensureStorage lazy-loads via the same asOfReader. Since the first message into cc.in is always a txResult (txs complete before their containing block boundary fires blockResult), the first lazy-load fires with asOfReader.txNum=0. On a synced-from-genesis datadir txNum=0 is in the visible history window so seekInFiles returns gracefully. On a snapshot-loaded datadir (perf-devnet-3, anything started from a chunked snapshot) the visible window starts well past genesis (~2.9B in our case) and seekInFiles errors with "data before txNum=2918750000 not available". The calculator publishes the wrapped error and FCU fails for every block. Pin asOfReader.txNum to r.txNum before ApplyWrites. Semantically GetAsOf(r.txNum) returns the value before tx N modified it = pre-tx state, which is the baseline ApplyWrites needs (only first-touch slots lazy-load; subsequent same-slot writes hit cs.storageState and skip the read path). computeAndPublish overwrites this back to lastTxNum+1 right before ComputeCommitment, so the per-tx setting only affects the lazy-load path and never leaks into the trie fold.

mh0lt · 2026-05-07T14:53:17Z

Linked as a precursor for #21017 (CI matrix exec_mode={serial,parallel}). Without this fix, the matrix's QA entries that run against snapshot-loaded datadirs (qa-rpc-performance-comparison-tests, qa-rpc-integration-tests-latest) hit seekInFiles(txNum=0) on every block under EXEC3_PARALLEL=true because asOfReader's initial txNum=0 falls outside the visible history window. Companion to #21032 (SD/incarnation differentiator) and #21039 (apply-loop completeness).

… pin Locks down the post-condition exercised by the fix in this PR: after handleMessage(*txResult), cc.asOfReader.txNum must equal the txResult's txNum. This is the property whose absence makes snapshot-loaded chains crash with seekInFiles(txNum=0) on the first lazy-load. Three subassertions: 1. First txResult bumps txNum from 0 → r.txNum (the basic fix). 2. Subsequent txResult bumps it again (the per-tx update). 3. Empty-writes txResult does NOT bump (matches the fix's `if len(r.writes) > 0` guard — no lazy-loads to seed, so leaving the prior pin alone is correct). Verified the test fails on origin/main without the fix (expected 12345, actual 0).

Copilot

Pull request overview

Fixes a commitment-calculator failure on snapshot-loaded datadirs where early per-tx lazy-loads could read with asOfReader.txNum=0 (outside the visible history window), causing seekInFiles(txNum=0) errors and failing FCU.

Changes:

Pin cc.asOfReader.txNum to the current txResult.txNum before ApplyWrites to ensure lazy-load reads use the correct pre-tx baseline.
Add a regression unit test asserting asOfReader.txNum is updated on txResult processing (and not updated for empty writes).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
execution/stagedsync/committer.go	Pins `asOfReader.txNum` on `txResult` before applying writes to prevent snapshot-window `txNum=0` lazy-load failures.
execution/stagedsync/committer_test.go	Adds a regression test validating `txResult` processing updates `asOfReader.txNum` as intended.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+// the snapshot-loaded lazy-load crash: cc.asOfReader is constructed with
+// txNum=0, and only computeAndPublish / computeAndCheck (which run on
+// blockResult) overwrite that field. But cc.state.ApplyWrites runs on
+// every txResult and triggers ensureAccount / ensureStorage lazy-loads
+// through the same reader. txResults arrive before the first blockResult,


Two related additions to the snapshot-vs-MDBX perf-equivalence investigation (memory: snapshot-vs-mdbx-performance-equivalence.md): 1. H_GetAsOf bench (db/state/snapshot_vs_mdbx_bench_test.go). New runHGetAsOf with four sub-benches for HistorySeek-via-GetAsOf on file-resident keys: GetLatest_baseline, GetAsOf_recent (asOf near endTxNum), GetAsOf_mid (asOf at endTxNum/2), GetAsOf_floor (asOf=1). Tests the path the calculator's HistoryStateReader.Read uses, which is distinct from getLatestFromFiles measured in H0. Real-datadir results on perf-devnet-3 (AccountsDomain, endTxNum=2.9B): GetLatest_baseline 202 ns/op 0 allocs GetAsOf_recent 570 ns/op 0 allocs <- 2.8x baseline, no result GetAsOf_mid 235 ns/op 5 allocs GetAsOf_floor 196 ns/op 4 allocs GetAsOf_recent (the calculator's pattern after PR #21010) scans the .ef looking for a record at-or-after endTxNum-1, finds none (most keys haven't changed in the last txNum), falls through to GetLatest. The 370ns/op overhead vs GetLatest is wasted scan. Confirms the GetAsOf shortcut described in getasof-regression-suspect.md as a real lever, though small in absolute terms (~2ms/block on the bloat workload). 2. Surface "took" + "keys" on the existing [commitment][cache-fp] Info log line (commitmentdb). Was already computed in the debug-level "[commitment] processed" log, but the bench runs with --log.dir.disable so debug logs aren't captured. This made it possible to attribute the 4.3s gap inside newPayload(TEST block) directly: the calculator's ComputeCommitment takes 4220ms for the 5910-key bloat block — 91% of the entire block wall time. Per-key cost is ~700us, consistent across blocks of all sizes. The actual perf lever for the bloat workload is making per-branch ComputeCommitment cheaper, not file/state reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…igontech#21017) > **2026-05-15 — erigontech#21153 has merged.** The companion PR carrying the substantive parallel-exec correctness + perf fix family (`mh/parallel-exec-fixes`, merged 2026-05-15 02:35Z as `958b2fbb85`) is now on `main`. This PR has been rebased onto fresh main; the trimmed branch contains only the serial/parallel exec-mode CI matrix yamls plus two CI hygiene fixes (a shared `build-erigon-image` job for the kurtosis matrix, and the `setup-erigon` apt-mirror flake fix), plus three workflow follow-ups from the 2026-05-13 review. > > **Landed precursors:** > - ✅ erigontech#21153 (merged 2026-05-15) — parallel-exec correctness/perf stack split from this PR > - ✅ erigontech#21088 (merged 2026-05-10) — parallel-commitment correctness for reorg/unwind + SD recreate > - ✅ erigontech#21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths > - ✅ erigontech#21039 (merged 2026-05-08) — apply-loop completeness false-flag on partial-batch exit > - ✅ erigontech#21010 (merged 2026-05-07) — `commitmentCalculator.asOfReader.txNum=0` lazy-load fix > > Open follow-up issues (separate from this PR's scope): erigontech#21106 (parallel-exec hardening), erigontech#21107 (stage-exec `from-0`/`chaintip`), erigontech#21108 (residual `EXEC3_PARALLEL` functional cases), erigontech#21136 (gated SkipLoads), erigontech#21138 (heuristic-removal structural cleanup). ## Summary Adds an `exec_mode: [serial, parallel]` matrix axis to every CI test workflow that exercises the EL execution path so divergence between `dbg.Exec3Parallel` true and false is caught on the PR rather than after release. The toggle is plumbed via `ERIGON_EXEC3_PARALLEL` — `envLookup` in [common/dbg/dbg_env.go:38](common/dbg/dbg_env.go#L38) auto-prepends the `ERIGON_` prefix to the source-side `EXEC3_PARALLEL` flag in [common/dbg/experiments.go:81](common/dbg/experiments.go#L81). ## Why `Exec3Parallel = false` is currently the source default on `main`. With no CI workflow setting the env var, every CI run today exercises only the **serial** path. The parallel path lands via `--bal` chains and tests like `experimentalBAL`, but the broad correctness signal (unit / race / hive / kurtosis / RPC integration) is single-mode. This PR makes both modes a per-PR signal. ## Affected workflows **Always-on** (matrix on every PR / dispatch / call): | Workflow | Runner | Parallelism | Cost | |---|---|---|---| | test-all-erigon.yml | GitHub-hosted (per-OS) | true parallel | wall-clock unchanged, +1× runner-min | | test-all-erigon-race.yml | GitHub-hosted | true parallel | same | | test-hive.yml | hive group | parallel if pool has slots | same (group is sized) | | test-hive-eest.yml | hive group | parallel if pool has slots | same | | test-kurtosis-assertoor.yml | `ubuntu-latest` | true parallel | same | **Auto-fire on PRs touching their own YAML, dispatch otherwise** (so regular PRs don't pay the cost; this PR triggers each one once for end-to-end validation): | Workflow | Notes | |---|---| | test-bench.yml | `go bench` | | qa-rpc-integration-tests-latest.yml | self-hosted single-pool, `max-parallel: 1` (shared testbed state) | | qa-rpc-performance-comparison-tests.yml | erigon runs serial+parallel, geth single-mode (placeholder ignored) | | qa-txpool-performance-test.yml | kurtosis txpool, `max-parallel: 1` | | qa-stage-exec.yml | 3 mode_names × 2 exec_modes = 6 entries | **Skipped** — `test-integration-caplin.yml` runs `cl/spectest` only, doesn't exercise the EL exec path; matrix-doubling would just re-run identical CL tests. ## Plumbing details * Workflows that build erigon and run go tests directly: env var set on the test step's `env:` block. * Hive-based workflows: an `ENV ERIGON_EXEC3_PARALLEL=...` line is appended to `clients/erigon/Dockerfile` during the existing sed-patch loop so every hive-launched erigon inherits the toggle. * Kurtosis-based workflows: a single upstream `build-erigon-image` job builds `test/erigon:current-base` once for the whole matrix and uploads it as a run-scoped artifact; each matrix entry downloads + `docker load`s and adds a cheap `ENV ERIGON_EXEC3_PARALLEL=…` layer on top to produce `test/erigon:current`. Avoids ~12 concurrent buildx jobs all writing the same `type=gha` cache scope and 504-ing. * Self-hosted single-pool workflows use `max-parallel: 1` to serialize matrix entries cleanly when state on the runner box is shared (testbed datadir, reference datadir, etc.). * All artifact / enclave / testbed-dir names are disambiguated by `exec_mode` so the two matrix entries don't clobber each other's outputs. ## Review follow-ups (2026-05-15 rebuild) After rebasing onto post-erigontech#21153 main, three workflow fixes from the 2026-05-13 review are applied as separate commits on top of the 4 trimmed CI commits: | Commit | Fixes | |---|---| | `ci: gate parallel-suffix QA test_name on client==erigon` | yperbasis Blocker 2 + Copilot thread #4 — geth's placeholder `exec_mode: parallel` previously caused its Grafana `test_name` to gain a stray `-parallel-` suffix, breaking historical dashboard queries | | `ci: align test-hive devp2p sim-limit between serial and parallel matrix legs` | yperbasis nit 5 + Copilot threads #3/#9 — both legs now run `sim-limit: eth` (discv5 doesn't exercise the EL exec path; matches the long-standing comment) | | `ci: fix Targetting typo in test-hive-eest.yml` | Copilot thread #6 — s/Targetting/Targeting/ | The yperbasis Blocker 1 (`cmd/integration/commands/stages.go:631` ignores `ERIGON_` prefix, so `qa-stage-exec`'s serial entry silently runs in parallel mode) is a Go change, raised as a separate small PR to keep this one strictly `.github/` only. ## Test plan This PR's CI run **is** the test. The 5 always-on workflows fire automatically on PR. The 5 perf workflows auto-fire because their `pull_request: paths` filter matches the workflow's own YAML — so opening this PR triggers all 10 affected workflows. What to look for in this PR's "Checks" tab: - [x] Each affected workflow has both `serial` and `parallel` matrix entries listed. - [x] Job display names show the mode (e.g. `tests-mac-linux (ubuntu-24.04, parallel)`). - [x] Wall-clock for hosted-runner workflows is essentially unchanged from main baseline (jobs ran concurrently on separate runners). - [x] Self-hosted single-pool workflows show ~2× wall-clock (matrix entries serialize). - [x] Both modes pass on every workflow. **If serial fails where parallel passes (or vice versa), that's a real regression we'd want to catch — exactly the point of this change.** After merge, regular PRs only pay the matrix cost on the 5 always-on workflows; the 5 perf workflows go back to being dispatch / schedule-only. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: taratorio <94537774+taratorio@users.noreply.github.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: milen <taratorio@users.noreply.github.com>

mh0lt requested a review from yperbasis as a code owner May 6, 2026 11:19

yperbasis requested review from awskii and taratorio May 6, 2026 11:52

yperbasis added this to the 3.5.0 milestone May 7, 2026

yperbasis assigned awskii May 7, 2026

This was referenced May 7, 2026

ci: matrix-test serial vs parallel exec across the test workflows #21017

Merged

execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit #21039

Merged

yperbasis requested a review from Copilot May 7, 2026 15:09

Copilot started reviewing on behalf of yperbasis May 7, 2026 15:10 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

yperbasis approved these changes May 7, 2026

View reviewed changes

mh0lt added this pull request to the merge queue May 7, 2026

Merged via the queue into main with commit 69d6181 May 7, 2026
42 checks passed

mh0lt deleted the mh/fix-commit-calc-asofreader-txnum branch May 7, 2026 19:51

mh0lt mentioned this pull request May 11, 2026

parallel_exec: chiado exec from 0 trie-root missmatch #20711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains#21010

stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains#21010
mh0lt merged 2 commits into
mainfrom
mh/fix-commit-calc-asofreader-txnum

mh0lt commented May 6, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mh0lt commented May 6, 2026

Summary

Fix

Reproduction

Test plan

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants