stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains#21010
Conversation
commitmentCalculator constructs asOfReader with txNum=0 and only updates it inside computeAndPublish/computeWithoutCheck (which run on blockResult). But cc.state.ApplyWrites runs on every txResult and triggers ensureAccount/ensureStorage lazy-loads via the same asOfReader. Since the first message into cc.in is always a txResult (txs complete before their containing block boundary fires blockResult), the first lazy-load fires with asOfReader.txNum=0. On a synced-from-genesis datadir txNum=0 is in the visible history window so seekInFiles returns gracefully. On a snapshot-loaded datadir (perf-devnet-3, anything started from a chunked snapshot) the visible window starts well past genesis (~2.9B in our case) and seekInFiles errors with "data before txNum=2918750000 not available". The calculator publishes the wrapped error and FCU fails for every block. Pin asOfReader.txNum to r.txNum before ApplyWrites. Semantically GetAsOf(r.txNum) returns the value before tx N modified it = pre-tx state, which is the baseline ApplyWrites needs (only first-touch slots lazy-load; subsequent same-slot writes hit cs.storageState and skip the read path). computeAndPublish overwrites this back to lastTxNum+1 right before ComputeCommitment, so the per-tx setting only affects the lazy-load path and never leaks into the trie fold.
|
Linked as a precursor for #21017 (CI matrix |
… pin
Locks down the post-condition exercised by the fix in this PR:
after handleMessage(*txResult), cc.asOfReader.txNum must equal
the txResult's txNum. This is the property whose absence makes
snapshot-loaded chains crash with seekInFiles(txNum=0) on the
first lazy-load.
Three subassertions:
1. First txResult bumps txNum from 0 → r.txNum (the basic fix).
2. Subsequent txResult bumps it again (the per-tx update).
3. Empty-writes txResult does NOT bump (matches the fix's
`if len(r.writes) > 0` guard — no lazy-loads to seed, so
leaving the prior pin alone is correct).
Verified the test fails on origin/main without the fix
(expected 12345, actual 0).
There was a problem hiding this comment.
Pull request overview
Fixes a commitment-calculator failure on snapshot-loaded datadirs where early per-tx lazy-loads could read with asOfReader.txNum=0 (outside the visible history window), causing seekInFiles(txNum=0) errors and failing FCU.
Changes:
- Pin
cc.asOfReader.txNumto the currenttxResult.txNumbeforeApplyWritesto ensure lazy-load reads use the correct pre-tx baseline. - Add a regression unit test asserting
asOfReader.txNumis updated ontxResultprocessing (and not updated for empty writes).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| execution/stagedsync/committer.go | Pins asOfReader.txNum on txResult before applying writes to prevent snapshot-window txNum=0 lazy-load failures. |
| execution/stagedsync/committer_test.go | Adds a regression test validating txResult processing updates asOfReader.txNum as intended. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // the snapshot-loaded lazy-load crash: cc.asOfReader is constructed with | ||
| // txNum=0, and only computeAndPublish / computeAndCheck (which run on | ||
| // blockResult) overwrite that field. But cc.state.ApplyWrites runs on | ||
| // every txResult and triggers ensureAccount / ensureStorage lazy-loads | ||
| // through the same reader. txResults arrive before the first blockResult, |
Two related additions to the snapshot-vs-MDBX perf-equivalence
investigation (memory: snapshot-vs-mdbx-performance-equivalence.md):
1. H_GetAsOf bench (db/state/snapshot_vs_mdbx_bench_test.go).
New runHGetAsOf with four sub-benches for HistorySeek-via-GetAsOf
on file-resident keys: GetLatest_baseline, GetAsOf_recent (asOf
near endTxNum), GetAsOf_mid (asOf at endTxNum/2), GetAsOf_floor
(asOf=1). Tests the path the calculator's HistoryStateReader.Read
uses, which is distinct from getLatestFromFiles measured in H0.
Real-datadir results on perf-devnet-3 (AccountsDomain, endTxNum=2.9B):
GetLatest_baseline 202 ns/op 0 allocs
GetAsOf_recent 570 ns/op 0 allocs <- 2.8x baseline, no result
GetAsOf_mid 235 ns/op 5 allocs
GetAsOf_floor 196 ns/op 4 allocs
GetAsOf_recent (the calculator's pattern after PR #21010) scans
the .ef looking for a record at-or-after endTxNum-1, finds none
(most keys haven't changed in the last txNum), falls through to
GetLatest. The 370ns/op overhead vs GetLatest is wasted scan.
Confirms the GetAsOf shortcut described in
getasof-regression-suspect.md as a real lever, though small in
absolute terms (~2ms/block on the bloat workload).
2. Surface "took" + "keys" on the existing [commitment][cache-fp]
Info log line (commitmentdb). Was already computed in the
debug-level "[commitment] processed" log, but the bench runs with
--log.dir.disable so debug logs aren't captured.
This made it possible to attribute the 4.3s gap inside
newPayload(TEST block) directly: the calculator's ComputeCommitment
takes 4220ms for the 5910-key bloat block — 91% of the entire
block wall time. Per-key cost is ~700us, consistent across blocks
of all sizes. The actual perf lever for the bloat workload is
making per-branch ComputeCommitment cheaper, not file/state reads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…igontech#21017) > **2026-05-15 — erigontech#21153 has merged.** The companion PR carrying the substantive parallel-exec correctness + perf fix family (`mh/parallel-exec-fixes`, merged 2026-05-15 02:35Z as `958b2fbb85`) is now on `main`. This PR has been rebased onto fresh main; the trimmed branch contains only the serial/parallel exec-mode CI matrix yamls plus two CI hygiene fixes (a shared `build-erigon-image` job for the kurtosis matrix, and the `setup-erigon` apt-mirror flake fix), plus three workflow follow-ups from the 2026-05-13 review. > > **Landed precursors:** > - ✅ erigontech#21153 (merged 2026-05-15) — parallel-exec correctness/perf stack split from this PR > - ✅ erigontech#21088 (merged 2026-05-10) — parallel-commitment correctness for reorg/unwind + SD recreate > - ✅ erigontech#21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths > - ✅ erigontech#21039 (merged 2026-05-08) — apply-loop completeness false-flag on partial-batch exit > - ✅ erigontech#21010 (merged 2026-05-07) — `commitmentCalculator.asOfReader.txNum=0` lazy-load fix > > Open follow-up issues (separate from this PR's scope): erigontech#21106 (parallel-exec hardening), erigontech#21107 (stage-exec `from-0`/`chaintip`), erigontech#21108 (residual `EXEC3_PARALLEL` functional cases), erigontech#21136 (gated SkipLoads), erigontech#21138 (heuristic-removal structural cleanup). ## Summary Adds an `exec_mode: [serial, parallel]` matrix axis to every CI test workflow that exercises the EL execution path so divergence between `dbg.Exec3Parallel` true and false is caught on the PR rather than after release. The toggle is plumbed via `ERIGON_EXEC3_PARALLEL` — `envLookup` in [common/dbg/dbg_env.go:38](common/dbg/dbg_env.go#L38) auto-prepends the `ERIGON_` prefix to the source-side `EXEC3_PARALLEL` flag in [common/dbg/experiments.go:81](common/dbg/experiments.go#L81). ## Why `Exec3Parallel = false` is currently the source default on `main`. With no CI workflow setting the env var, every CI run today exercises only the **serial** path. The parallel path lands via `--bal` chains and tests like `experimentalBAL`, but the broad correctness signal (unit / race / hive / kurtosis / RPC integration) is single-mode. This PR makes both modes a per-PR signal. ## Affected workflows **Always-on** (matrix on every PR / dispatch / call): | Workflow | Runner | Parallelism | Cost | |---|---|---|---| | test-all-erigon.yml | GitHub-hosted (per-OS) | true parallel | wall-clock unchanged, +1× runner-min | | test-all-erigon-race.yml | GitHub-hosted | true parallel | same | | test-hive.yml | hive group | parallel if pool has slots | same (group is sized) | | test-hive-eest.yml | hive group | parallel if pool has slots | same | | test-kurtosis-assertoor.yml | `ubuntu-latest` | true parallel | same | **Auto-fire on PRs touching their own YAML, dispatch otherwise** (so regular PRs don't pay the cost; this PR triggers each one once for end-to-end validation): | Workflow | Notes | |---|---| | test-bench.yml | `go bench` | | qa-rpc-integration-tests-latest.yml | self-hosted single-pool, `max-parallel: 1` (shared testbed state) | | qa-rpc-performance-comparison-tests.yml | erigon runs serial+parallel, geth single-mode (placeholder ignored) | | qa-txpool-performance-test.yml | kurtosis txpool, `max-parallel: 1` | | qa-stage-exec.yml | 3 mode_names × 2 exec_modes = 6 entries | **Skipped** — `test-integration-caplin.yml` runs `cl/spectest` only, doesn't exercise the EL exec path; matrix-doubling would just re-run identical CL tests. ## Plumbing details * Workflows that build erigon and run go tests directly: env var set on the test step's `env:` block. * Hive-based workflows: an `ENV ERIGON_EXEC3_PARALLEL=...` line is appended to `clients/erigon/Dockerfile` during the existing sed-patch loop so every hive-launched erigon inherits the toggle. * Kurtosis-based workflows: a single upstream `build-erigon-image` job builds `test/erigon:current-base` once for the whole matrix and uploads it as a run-scoped artifact; each matrix entry downloads + `docker load`s and adds a cheap `ENV ERIGON_EXEC3_PARALLEL=…` layer on top to produce `test/erigon:current`. Avoids ~12 concurrent buildx jobs all writing the same `type=gha` cache scope and 504-ing. * Self-hosted single-pool workflows use `max-parallel: 1` to serialize matrix entries cleanly when state on the runner box is shared (testbed datadir, reference datadir, etc.). * All artifact / enclave / testbed-dir names are disambiguated by `exec_mode` so the two matrix entries don't clobber each other's outputs. ## Review follow-ups (2026-05-15 rebuild) After rebasing onto post-erigontech#21153 main, three workflow fixes from the 2026-05-13 review are applied as separate commits on top of the 4 trimmed CI commits: | Commit | Fixes | |---|---| | `ci: gate parallel-suffix QA test_name on client==erigon` | yperbasis Blocker 2 + Copilot thread #4 — geth's placeholder `exec_mode: parallel` previously caused its Grafana `test_name` to gain a stray `-parallel-` suffix, breaking historical dashboard queries | | `ci: align test-hive devp2p sim-limit between serial and parallel matrix legs` | yperbasis nit 5 + Copilot threads #3/#9 — both legs now run `sim-limit: eth` (discv5 doesn't exercise the EL exec path; matches the long-standing comment) | | `ci: fix Targetting typo in test-hive-eest.yml` | Copilot thread #6 — s/Targetting/Targeting/ | The yperbasis Blocker 1 (`cmd/integration/commands/stages.go:631` ignores `ERIGON_` prefix, so `qa-stage-exec`'s serial entry silently runs in parallel mode) is a Go change, raised as a separate small PR to keep this one strictly `.github/` only. ## Test plan This PR's CI run **is** the test. The 5 always-on workflows fire automatically on PR. The 5 perf workflows auto-fire because their `pull_request: paths` filter matches the workflow's own YAML — so opening this PR triggers all 10 affected workflows. What to look for in this PR's "Checks" tab: - [x] Each affected workflow has both `serial` and `parallel` matrix entries listed. - [x] Job display names show the mode (e.g. `tests-mac-linux (ubuntu-24.04, parallel)`). - [x] Wall-clock for hosted-runner workflows is essentially unchanged from main baseline (jobs ran concurrently on separate runners). - [x] Self-hosted single-pool workflows show ~2× wall-clock (matrix entries serialize). - [x] Both modes pass on every workflow. **If serial fails where parallel passes (or vice versa), that's a real regression we'd want to catch — exactly the point of this change.** After merge, regular PRs only pay the matrix cost on the 5 always-on workflows; the 5 perf workflows go back to being dispatch / schedule-only. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: taratorio <94537774+taratorio@users.noreply.github.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: milen <taratorio@users.noreply.github.com>
Summary
commitmentCalculatorlazy-loads from the storage/account domains via a sharedasOfStateReader. The reader is constructed withtxNum=0and only updated insidecomputeAndPublish/computeWithoutCheck— both of which run on ablockResultmessage. Butcc.state.ApplyWritesruns on everytxResultand triggersensureAccount/ensureStoragelazy-loads through the same reader.Since
txResults arrive before the firstblockResult, the very first lazy-load fires withasOfReader.txNum = 0. On a synced-from-genesis datadirtxNum=0is in the visible history window soseekInFilesreturns gracefully. On a snapshot-loaded datadir the visible window starts well past genesis andseekInFileserrors with:The calculator publishes the wrapped error and FCU fails on every block.
Fix
Pin
asOfReader.txNumto the current tx'stxNumbeforeApplyWritesruns. SemanticallyGetAsOf(r.txNum)returns the value before tx N modified it = the pre-tx baseline, which is what lazy-load needs (only first-touch slots lazy-load; subsequent same-slot writes hitcs.storageStateand skip the read path).computeAndPublishoverwrites this back tolastTxNum + 1right beforeComputeCommitment, so the per-tx setting only affects the lazy-load path and never leaks into the trie fold path.Reproduction
Run the parallel commitment calculator on a snapshot-loaded chain (e.g. perf-devnet-3 starting at txNum~2.9B). Every block fails with the seekInFiles error above. Synced-from-genesis chains don't trip the bug because txNum=0 is in their window.
Test plan
EXEC3_PARALLEL=true: 17 SETUP blocks + 1 TEST block all VALID with 0 lazy-load errors (verified locally)🤖 Generated with Claude Code