Skip to content

stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains#21010

Merged
mh0lt merged 2 commits into
mainfrom
mh/fix-commit-calc-asofreader-txnum
May 7, 2026
Merged

stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains#21010
mh0lt merged 2 commits into
mainfrom
mh/fix-commit-calc-asofreader-txnum

Conversation

@mh0lt

@mh0lt mh0lt commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

commitmentCalculator lazy-loads from the storage/account domains via a shared asOfStateReader. The reader is constructed with txNum=0 and only updated inside computeAndPublish / computeWithoutCheck — both of which run on a blockResult message. But cc.state.ApplyWrites runs on every txResult and triggers ensureAccount / ensureStorage lazy-loads through the same reader.

Since txResults arrive before the first blockResult, the very first lazy-load fires with asOfReader.txNum = 0. On a synced-from-genesis datadir txNum=0 is in the visible history window so seekInFiles returns gracefully. On a snapshot-loaded datadir the visible window starts well past genesis and seekInFiles errors with:

ensureStorage(...): seekInFiles(invIndex=storage,txNum=0) but data before txNum=2918750000 not available

The calculator publishes the wrapped error and FCU fails on every block.

Fix

Pin asOfReader.txNum to the current tx's txNum before ApplyWrites runs. Semantically GetAsOf(r.txNum) returns the value before tx N modified it = the pre-tx baseline, which is what lazy-load needs (only first-touch slots lazy-load; subsequent same-slot writes hit cs.storageState and skip the read path).

computeAndPublish overwrites this back to lastTxNum + 1 right before ComputeCommitment, so the per-tx setting only affects the lazy-load path and never leaks into the trie fold path.

Reproduction

Run the parallel commitment calculator on a snapshot-loaded chain (e.g. perf-devnet-3 starting at txNum~2.9B). Every block fails with the seekInFiles error above. Synced-from-genesis chains don't trip the bug because txNum=0 is in their window.

Test plan

  • CI green
  • Bench on snapshot-loaded perf-devnet-3 with EXEC3_PARALLEL=true: 17 SETUP blocks + 1 TEST block all VALID with 0 lazy-load errors (verified locally)
  • Bench on synced-from-genesis chain: no behavior change (lazy-load already worked at txNum=0 there)

🤖 Generated with Claude Code

commitmentCalculator constructs asOfReader with txNum=0 and only updates
it inside computeAndPublish/computeWithoutCheck (which run on
blockResult). But cc.state.ApplyWrites runs on every txResult and triggers
ensureAccount/ensureStorage lazy-loads via the same asOfReader. Since the
first message into cc.in is always a txResult (txs complete before their
containing block boundary fires blockResult), the first lazy-load fires
with asOfReader.txNum=0.

On a synced-from-genesis datadir txNum=0 is in the visible history
window so seekInFiles returns gracefully. On a snapshot-loaded datadir
(perf-devnet-3, anything started from a chunked snapshot) the visible
window starts well past genesis (~2.9B in our case) and seekInFiles
errors with "data before txNum=2918750000 not available". The
calculator publishes the wrapped error and FCU fails for every block.

Pin asOfReader.txNum to r.txNum before ApplyWrites. Semantically
GetAsOf(r.txNum) returns the value before tx N modified it = pre-tx
state, which is the baseline ApplyWrites needs (only first-touch slots
lazy-load; subsequent same-slot writes hit cs.storageState and skip
the read path). computeAndPublish overwrites this back to lastTxNum+1
right before ComputeCommitment, so the per-tx setting only affects the
lazy-load path and never leaks into the trie fold.
@mh0lt mh0lt requested a review from yperbasis as a code owner May 6, 2026 11:19
@yperbasis yperbasis requested review from awskii and taratorio May 6, 2026 11:52
@yperbasis yperbasis added this to the 3.5.0 milestone May 7, 2026
@mh0lt

mh0lt commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Linked as a precursor for #21017 (CI matrix exec_mode={serial,parallel}). Without this fix, the matrix's QA entries that run against snapshot-loaded datadirs (qa-rpc-performance-comparison-tests, qa-rpc-integration-tests-latest) hit seekInFiles(txNum=0) on every block under EXEC3_PARALLEL=true because asOfReader's initial txNum=0 falls outside the visible history window. Companion to #21032 (SD/incarnation differentiator) and #21039 (apply-loop completeness).

… pin

Locks down the post-condition exercised by the fix in this PR:
after handleMessage(*txResult), cc.asOfReader.txNum must equal
the txResult's txNum. This is the property whose absence makes
snapshot-loaded chains crash with seekInFiles(txNum=0) on the
first lazy-load.

Three subassertions:
  1. First txResult bumps txNum from 0 → r.txNum (the basic fix).
  2. Subsequent txResult bumps it again (the per-tx update).
  3. Empty-writes txResult does NOT bump (matches the fix's
     `if len(r.writes) > 0` guard — no lazy-loads to seed, so
     leaving the prior pin alone is correct).

Verified the test fails on origin/main without the fix
(expected 12345, actual 0).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a commitment-calculator failure on snapshot-loaded datadirs where early per-tx lazy-loads could read with asOfReader.txNum=0 (outside the visible history window), causing seekInFiles(txNum=0) errors and failing FCU.

Changes:

  • Pin cc.asOfReader.txNum to the current txResult.txNum before ApplyWrites to ensure lazy-load reads use the correct pre-tx baseline.
  • Add a regression unit test asserting asOfReader.txNum is updated on txResult processing (and not updated for empty writes).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
execution/stagedsync/committer.go Pins asOfReader.txNum on txResult before applying writes to prevent snapshot-window txNum=0 lazy-load failures.
execution/stagedsync/committer_test.go Adds a regression test validating txResult processing updates asOfReader.txNum as intended.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +110 to +114
// the snapshot-loaded lazy-load crash: cc.asOfReader is constructed with
// txNum=0, and only computeAndPublish / computeAndCheck (which run on
// blockResult) overwrite that field. But cc.state.ApplyWrites runs on
// every txResult and triggers ensureAccount / ensureStorage lazy-loads
// through the same reader. txResults arrive before the first blockResult,
@mh0lt mh0lt added this pull request to the merge queue May 7, 2026
Merged via the queue into main with commit 69d6181 May 7, 2026
42 checks passed
@mh0lt mh0lt deleted the mh/fix-commit-calc-asofreader-txnum branch May 7, 2026 19:51
mh0lt pushed a commit that referenced this pull request May 8, 2026
Two related additions to the snapshot-vs-MDBX perf-equivalence
investigation (memory: snapshot-vs-mdbx-performance-equivalence.md):

1. H_GetAsOf bench (db/state/snapshot_vs_mdbx_bench_test.go).
   New runHGetAsOf with four sub-benches for HistorySeek-via-GetAsOf
   on file-resident keys: GetLatest_baseline, GetAsOf_recent (asOf
   near endTxNum), GetAsOf_mid (asOf at endTxNum/2), GetAsOf_floor
   (asOf=1). Tests the path the calculator's HistoryStateReader.Read
   uses, which is distinct from getLatestFromFiles measured in H0.

   Real-datadir results on perf-devnet-3 (AccountsDomain, endTxNum=2.9B):
     GetLatest_baseline 202 ns/op   0 allocs
     GetAsOf_recent     570 ns/op   0 allocs   <- 2.8x baseline, no result
     GetAsOf_mid        235 ns/op   5 allocs
     GetAsOf_floor      196 ns/op   4 allocs

   GetAsOf_recent (the calculator's pattern after PR #21010) scans
   the .ef looking for a record at-or-after endTxNum-1, finds none
   (most keys haven't changed in the last txNum), falls through to
   GetLatest. The 370ns/op overhead vs GetLatest is wasted scan.
   Confirms the GetAsOf shortcut described in
   getasof-regression-suspect.md as a real lever, though small in
   absolute terms (~2ms/block on the bloat workload).

2. Surface "took" + "keys" on the existing [commitment][cache-fp]
   Info log line (commitmentdb). Was already computed in the
   debug-level "[commitment] processed" log, but the bench runs with
   --log.dir.disable so debug logs aren't captured.

   This made it possible to attribute the 4.3s gap inside
   newPayload(TEST block) directly: the calculator's ComputeCommitment
   takes 4220ms for the 5910-key bloat block — 91% of the entire
   block wall time. Per-key cost is ~700us, consistent across blocks
   of all sizes. The actual perf lever for the bloat workload is
   making per-branch ComputeCommitment cheaper, not file/state reads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 26, 2026
…igontech#21017)

> **2026-05-15 — erigontech#21153 has merged.** The companion PR carrying the
substantive parallel-exec correctness + perf fix family
(`mh/parallel-exec-fixes`, merged 2026-05-15 02:35Z as `958b2fbb85`) is
now on `main`. This PR has been rebased onto fresh main; the trimmed
branch contains only the serial/parallel exec-mode CI matrix yamls plus
two CI hygiene fixes (a shared `build-erigon-image` job for the kurtosis
matrix, and the `setup-erigon` apt-mirror flake fix), plus three
workflow follow-ups from the 2026-05-13 review.
>
> **Landed precursors:**
> - ✅ erigontech#21153 (merged 2026-05-15) — parallel-exec correctness/perf stack
split from this PR
> - ✅ erigontech#21088 (merged 2026-05-10) — parallel-commitment correctness for
reorg/unwind + SD recreate
> - ✅ erigontech#21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths
> - ✅ erigontech#21039 (merged 2026-05-08) — apply-loop completeness false-flag on
partial-batch exit
> - ✅ erigontech#21010 (merged 2026-05-07) —
`commitmentCalculator.asOfReader.txNum=0` lazy-load fix
>
> Open follow-up issues (separate from this PR's scope): erigontech#21106
(parallel-exec hardening), erigontech#21107 (stage-exec `from-0`/`chaintip`),
erigontech#21108 (residual `EXEC3_PARALLEL` functional cases), erigontech#21136 (gated
SkipLoads), erigontech#21138 (heuristic-removal structural cleanup).

## Summary

Adds an `exec_mode: [serial, parallel]` matrix axis to every CI test
workflow that exercises the EL execution path so divergence between
`dbg.Exec3Parallel` true and false is caught on the PR rather than after
release.

The toggle is plumbed via `ERIGON_EXEC3_PARALLEL` — `envLookup` in
[common/dbg/dbg_env.go:38](common/dbg/dbg_env.go#L38) auto-prepends the
`ERIGON_` prefix to the source-side `EXEC3_PARALLEL` flag in
[common/dbg/experiments.go:81](common/dbg/experiments.go#L81).

## Why

`Exec3Parallel = false` is currently the source default on `main`. With
no CI workflow setting the env var, every CI run today exercises only
the **serial** path. The parallel path lands via `--bal` chains and
tests like `experimentalBAL`, but the broad correctness signal (unit /
race / hive / kurtosis / RPC integration) is single-mode. This PR makes
both modes a per-PR signal.

## Affected workflows

**Always-on** (matrix on every PR / dispatch / call):

| Workflow | Runner | Parallelism | Cost |
|---|---|---|---|
| test-all-erigon.yml | GitHub-hosted (per-OS) | true parallel |
wall-clock unchanged, +1× runner-min |
| test-all-erigon-race.yml | GitHub-hosted | true parallel | same |
| test-hive.yml | hive group | parallel if pool has slots | same (group
is sized) |
| test-hive-eest.yml | hive group | parallel if pool has slots | same |
| test-kurtosis-assertoor.yml | `ubuntu-latest` | true parallel | same |

**Auto-fire on PRs touching their own YAML, dispatch otherwise** (so
regular PRs don't pay the cost; this PR triggers each one once for
end-to-end validation):

| Workflow | Notes |
|---|---|
| test-bench.yml | `go bench` |
| qa-rpc-integration-tests-latest.yml | self-hosted single-pool,
`max-parallel: 1` (shared testbed state) |
| qa-rpc-performance-comparison-tests.yml | erigon runs serial+parallel,
geth single-mode (placeholder ignored) |
| qa-txpool-performance-test.yml | kurtosis txpool, `max-parallel: 1` |
| qa-stage-exec.yml | 3 mode_names × 2 exec_modes = 6 entries |

**Skipped** — `test-integration-caplin.yml` runs `cl/spectest` only,
doesn't exercise the EL exec path; matrix-doubling would just re-run
identical CL tests.

## Plumbing details

* Workflows that build erigon and run go tests directly: env var set on
the test step's `env:` block.
* Hive-based workflows: an `ENV ERIGON_EXEC3_PARALLEL=...` line is
appended to `clients/erigon/Dockerfile` during the existing sed-patch
loop so every hive-launched erigon inherits the toggle.
* Kurtosis-based workflows: a single upstream `build-erigon-image` job
builds `test/erigon:current-base` once for the whole matrix and uploads
it as a run-scoped artifact; each matrix entry downloads + `docker
load`s and adds a cheap `ENV ERIGON_EXEC3_PARALLEL=…` layer on top to
produce `test/erigon:current`. Avoids ~12 concurrent buildx jobs all
writing the same `type=gha` cache scope and 504-ing.
* Self-hosted single-pool workflows use `max-parallel: 1` to serialize
matrix entries cleanly when state on the runner box is shared (testbed
datadir, reference datadir, etc.).
* All artifact / enclave / testbed-dir names are disambiguated by
`exec_mode` so the two matrix entries don't clobber each other's
outputs.

## Review follow-ups (2026-05-15 rebuild)

After rebasing onto post-erigontech#21153 main, three workflow fixes from the
2026-05-13 review are applied as separate commits on top of the 4
trimmed CI commits:

| Commit | Fixes |
|---|---|
| `ci: gate parallel-suffix QA test_name on client==erigon` | yperbasis
Blocker 2 + Copilot thread #4 — geth's placeholder `exec_mode: parallel`
previously caused its Grafana `test_name` to gain a stray `-parallel-`
suffix, breaking historical dashboard queries |
| `ci: align test-hive devp2p sim-limit between serial and parallel
matrix legs` | yperbasis nit 5 + Copilot threads #3/#9 — both legs now
run `sim-limit: eth` (discv5 doesn't exercise the EL exec path; matches
the long-standing comment) |
| `ci: fix Targetting typo in test-hive-eest.yml` | Copilot thread #6 —
s/Targetting/Targeting/ |

The yperbasis Blocker 1 (`cmd/integration/commands/stages.go:631`
ignores `ERIGON_` prefix, so `qa-stage-exec`'s serial entry silently
runs in parallel mode) is a Go change, raised as a separate small PR to
keep this one strictly `.github/` only.

## Test plan

This PR's CI run **is** the test. The 5 always-on workflows fire
automatically on PR. The 5 perf workflows auto-fire because their
`pull_request: paths` filter matches the workflow's own YAML — so
opening this PR triggers all 10 affected workflows.

What to look for in this PR's "Checks" tab:
- [x] Each affected workflow has both `serial` and `parallel` matrix
entries listed.
- [x] Job display names show the mode (e.g. `tests-mac-linux
(ubuntu-24.04, parallel)`).
- [x] Wall-clock for hosted-runner workflows is essentially unchanged
from main baseline (jobs ran concurrently on separate runners).
- [x] Self-hosted single-pool workflows show ~2× wall-clock (matrix
entries serialize).
- [x] Both modes pass on every workflow. **If serial fails where
parallel passes (or vice versa), that's a real regression we'd want to
catch — exactly the point of this change.**

After merge, regular PRs only pay the matrix cost on the 5 always-on
workflows; the 5 perf workflows go back to being dispatch /
schedule-only.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: taratorio <94537774+taratorio@users.noreply.github.com>
Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: milen <taratorio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants