Skip to content

db/state: fix collation/pruning race and unwind visibility in BuildFilesInBackground#20445

Merged
yperbasis merged 58 commits into
mainfrom
fix/getLatestFromDb-deletion-entries
Apr 16, 2026
Merged

db/state: fix collation/pruning race and unwind visibility in BuildFilesInBackground#20445
yperbasis merged 58 commits into
mainfrom
fix/getLatestFromDb-deletion-entries

Conversation

@yperbasis

@yperbasis yperbasis commented Apr 9, 2026

Copy link
Copy Markdown
Member

Summary

Fixes two related bugs in the domain state layer that cause gas mismatches during execution (#20169).

Bug 1: Collation/pruning race

BuildFilesInBackground collates domain files by reading values from the DB via per-worker read-transactions opened at collation time. Between the moment a step is deemed ready and the collation reads, execution can commit new batches that overwrite step S values with step S+1 data. The collated file for step S then contains wrong values. After pruning removes the DB entries, GetLatest returns the stale file values, causing SSTORE gas mispricing (20,000 vs 2,900 = 17,100 gas per affected slot).

Fix: Restructure buildFiles into two phases:

  1. Phase 1 (sequential, single read-tx): Collate all domains and indices using one MDBX read-transaction. All collations see the exact same DB snapshot — zero race window.
  2. Phase 2 (parallel, no DB access): Build files from collations. This is the expensive part and remains fully parallel.

Additionally, add a committed txNum guard: don't collate step S until ComputeCommitment has confirmed all data through the end of step S is flushed.

Bug 2: Unwind entry visibility after filing

After a reorg, the unwind restores domain entries tagged with a step derived from the unwind-target txNum. If BuildFilesInBackground has filed that step, getLatestFromDb discards the restored entry (step covered by files) and falls through to getLatestFromFiles, returning the stale end-of-step value instead of the changeset-restored value.

The fix must use the current (atomically published) file boundary from Aggregator.EndTxNumMinimax(), not the tx-scoped DomainRoTx.FirstStepNotInFiles() — because BuildFilesInBackground may have filed new steps since the tx was opened, and getLatestFromDb will see those via the current view.

Fix: Pass Aggregator.EndTxNumMinimax() into the unwind and tag restored entries with max(naturalStep, currentFilesEndStep).

Changes

  • db/state/aggregator.go: single-tx collation + parallel file building; committed txNum guard; stepFullyCommitted helper; pass current file boundary to unwind
  • db/state/domain.go: bump unwind step tag past current filed range
  • db/state/aggregator_test.go: TestAggregator_CommittedTxNumGuard
  • db/state/domain_test.go: TestDomain_CollationIsolatedFromLaterSteps, TestDomain_UnwindRestoredEntryVisibility
  • execution/commitment/commitmentdb/commitment_context.go: export DecodeTxBlockNums; fix minUnwindale typo; short-value length guard

Test plan

  • make lint passes
  • TestDomain_CollationIsolatedFromLaterSteps — step S collation ignores step S+1 data
  • TestDomain_UnwindRestoredEntryVisibility — unwind-restored entries survive getLatestFromDb after filing (fails without fix)
  • TestAggregator_CommittedTxNumGuard — guard blocks premature file building
  • Integration: sync mainnet to chain tip without gas mismatches, including after reorgs

🤖 Generated with Claude Code

When a storage slot is zeroed (SSTORE→0), the DB stores an empty-value
entry. If snapshot files later advance past that entry's step,
getLatestFromDb discarded it (returned found=false) because of the
lastTxNumOfStep < files.EndTxNum() guard. The caller then fell through
to getLatestFromFiles, which scans files newest-to-oldest. Since
snapshot files represent deletions as absent keys (no tombstone), the
scan reached an older file containing the pre-deletion value and
returned stale data.

This caused EIP-2200 SSTORE gas mispricing: GetCommittedState returned a
non-zero "original" value for a deleted slot, taking the SSTORE_RESET
path (2,900 gas) instead of SSTORE_SET (20,000 gas), producing gasUsed
deficits that fail block validation.

Fix: treat DB deletion entries (len(v)==0) as authoritative regardless
of step age. The DB returns the most-recent entry per key (inverted-step
ordering), so an empty value means the key was deleted and no newer
write exists.

Fixes #20169

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yperbasis and others added 2 commits April 9, 2026 16:07
Return v ([]byte{}) instead of nil to preserve the empty-vs-nil
distinction. DeleteWithPrev and PutWithPrev(key, []byte{}, ...) are
indistinguishable at the DB level — both store empty values. Returning
nil broke TestDomain_GetAfterAggregation which writes legitimate empty
values and expects to read them back as []byte{}.

The critical part of the fix remains: found=true prevents the
fallthrough to getLatestFromFiles. For storage gas pricing, []byte{} is
correctly treated as zero by ReadAccountStorage (checks len(enc) > 0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as draft April 9, 2026 20:58
…otstrapping

Two additional fixes for the gas mismatch issue (#20169):

1. **Preserve deletion entries during domain pruning** (db/state/domain.go):
   The background file collation (`buildFilesInBackground`) runs concurrently
   with block execution. When it opens its MDBX read-transaction before the
   execution's CommitCycle flushes a deletion entry, the resulting domain file
   silently misses that deletion. Pruning then removes the DB entry because the
   file "covers" that step — leaving no record of the deletion in either the DB
   or the files. Subsequent reads fall through to an older file and return a
   stale pre-deletion value, causing EIP-2200 SSTORE gas mispricing.

   Fix: route Prune() through oldPrune() and skip deletion entries (empty
   values) during pruning. Combined with the getLatestFromDb fix from the
   previous commit, this ensures deletion evidence is never lost regardless
   of collation/execution timing.

2. **Tolerate domain-ahead-of-blocks during InsertBlocks** (inserters.go):
   When pre-verified state snapshots extend past the block snapshot boundary,
   NewSharedDomains returns ErrBehindCommitment. InsertBlocks only needs to
   accumulate block headers/bodies/TDs in the overlay — it does not need a
   valid commitment state. Allow it to proceed so the ForkChoice stage can
   run execution and close the TxNums gap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis force-pushed the fix/getLatestFromDb-deletion-entries branch from 955c24c to e1475aa Compare April 10, 2026 04:22
yperbasis and others added 2 commits April 10, 2026 10:59
The background file builder (buildFilesInBackground) runs concurrently
with block execution. Its collation opens a read-transaction that can
snapshot the DB before the execution's CommitCycle flushes all writes for
a step. The resulting file silently misses those entries. Pruning then
removes them from the DB (the file "covers" the step), and the values
are lost — causing both stale reads for deleted keys and missing reads
for existing keys. See #20169.

Fix: the execution layer publishes its last committed txNum via
Aggregator.SetLastCommittedTxNum after each CommitCycle flush+commit.
The file builder checks this before collating each step: if
lastCommittedTxNum < firstTxNum(step+1), the step may have uncommitted
data and collation is deferred.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously initialCycle switched to false only when a stage-loop
iteration completed in under 5 minutes. This meant the switch happened
very close to the chain tip, leaving only ~50 blocks with changesets.
Any reorg request beyond that depth failed with "bad forkchoice" because
the node had no changesets to unwind with.

Add an explicit check: if the execution stage has reached within
MAX_REORG_DEPTH (default 96) blocks of the header-chain tip, switch
initialCycle to false immediately. This ensures at least reorgBlockDepth
blocks always have changesets, regardless of iteration timing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis force-pushed the fix/getLatestFromDb-deletion-entries branch from 326bef9 to 5d257a0 Compare April 10, 2026 09:22
Reverts the InsertBlocks change from e1475aa that allowed
ErrBehindCommitment to be silently ignored. The bootstrapping issue
it addressed is better handled by the initialCycle switch fix in
5d257a0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis changed the title db/state: fix getLatestFromDb discarding deletion entries for old steps db/state: fix domain collation/pruning race causing gas mismatches Apr 10, 2026
OldPrune() is never called in production code. The deletion-entry
preservation for the active prune path is handled via the
IsDeletionEntry callback in TableScanningPrune.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread execution/execmodule/forkchoice.go Outdated
Comment thread execution/execmodule/forkchoice.go Outdated
yperbasis and others added 2 commits April 10, 2026 11:44
Mirror the serial executor fix: when the parallel executor is within
MaxReorgDepth of the batch end, switch initialCycle to false so
changesets are generated. Without this the parallel executor can reach
the tip with no changesets, making reorgs impossible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestDomain_DeletionSurvivesPruning: verifies that deletion entries
  (empty values) in the DB are preserved by the scanning-prune path's
  IsDeletionEntry callback, even after the step is collated into a file.

- TestDomain_NonDeletionEntrySurvivesPruning: verifies that regular
  (non-empty) values ARE pruned from the DB after collation into a file,
  ensuring the deletion-preservation logic doesn't block normal pruning.

- TestAggregator_LastCommittedTxNumGuard: verifies the guard logic that
  prevents buildFilesInBackground from collating a step whose data may
  not be fully committed (lastCommittedTxNum < firstTxNum(step+1)).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis

This comment was marked as low quality.

Address review feedback (AskAlexSharov): remove `lastCommittedTxNum`
state from the Aggregator. Instead, the background file builder reads
the latest committed txNum directly from the commitment domain state
in the DB via `GetLatest(kv.CommitmentDomain, KeyCommitmentState)`.

This avoids adding statefulness to the Aggregator and uses the existing
SharedDomains.ComputeCommitment write as the synchronization point.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread db/state/domain_test.go Outdated

This comment was marked as outdated.

AskAlexSharov added a commit that referenced this pull request Apr 16, 2026
…lesInBackground

Port of #20445 to release/3.4.

Fixes two related bugs causing gas mismatches during execution (#20169).

Bug 1: Collation/pruning race — BuildFilesInBackground opened a new
read-transaction per domain/index, so execution could commit step S+1
data between reads. The collated file for step S then contained wrong
values; after pruning, GetLatest returned stale file values.
Fix: restructure buildFiles into two phases — Phase 1 collates all
domains and indices using a single MDBX read-transaction; Phase 2
builds files in parallel from the collations (no DB access).
Also adds a committedTxNum guard: don't collate step S until
ComputeCommitment has confirmed all data through end of step S is flushed.

Bug 2: Unwind entry visibility — after a reorg, restored entries were
tagged with the natural step, which may already be covered by files.
getLatestFromDb discarded those entries and fell through to stale file values.
Fix: pass Aggregator.EndTxNumMinimax() into unwind and tag restored
entries with max(naturalStep, currentFilesEndStep).

Also: export DecodeTxBlockNums; fix minUnwindale typo; short-value guard.
Commit 835dd9b in ethpandaops/ethereum-package introduced GpuConfig,
a Starlark built-in that requires Kurtosis CLI >=1.18.1. CI installs
an older version so Starlark eval fails. Pin to e07503d (the commit
just before the regression) until CI is upgraded to Kurtosis 1.18.1.
Commit 835dd9b on main introduced GpuConfig, a Starlark built-in that
requires Kurtosis CLI >=1.18.1. CI installs an older version so Starlark
eval fails at startup. Pin to the 6.1.0 release tag which has GLOAS
support but not the breaking GpuConfig. Unpin once CI is on 1.18.1.
@yperbasis

Copy link
Copy Markdown
Member Author

Tracing results: deterministic gas mismatch at block 24,887,755

Setup

  • Branch fix/getLatestFromDb-deletion-entries (commit 00cee0a5)
  • NO_MERGE=true, --fcu.background.prune=false — merge and background prune both disabled
  • Fresh torrent-downloaded snapshots, clean chaindata
  • Added ad-hoc EVM tracer to dump SLOAD/SSTORE/BALANCE/SELFBALANCE for block 24,887,755

Per-tx gas comparison

tx canonical gas our gas diff
9 720,506 703,406 −17,100
48 372,786 355,686 −17,100
134 415,794 395,894 −19,900
184 691,502 657,302 −34,200
total −88,300

Root cause

Traced tx[9] (first diverging tx) on the real node. Found the specific corrupt value:

  • Contract: USDC (0xA0b86991c6218b36c1d19D4a2e9Eb0cE3606eB48)
  • Slot: 0xdd982ffea2f8bab549c39bd565f71cb62638c12955411757083fea7837ba6304
  • Canonical value at block 24,887,754: 0x0 (zero — balance was zeroed)
  • Our node's value: 0x30479e80 (810,106,496 — stale USDC balance)

The stale value was confirmed by SSTORE gas cost: canonical charges 20,000 (SET: slot was zero → non-zero), our node charges 2,900 (RESET: slot was already non-zero).

No earlier tx (0–8) in the block touches this slot. The stale value comes directly from GetLatest reading the domain file.

Why eth_getStorageAt shows correct values but execution reads stale ones

  • eth_getStorageAt(addr, slot, blockNum) uses GetAsOf → reads from history files → returns correct zero
  • Execution uses GetLatest → reads from domain .kv files → returns stale 0x30479e80
  • stateRoot comparison passes because it's computed from commitment history, not domain files

This is why all my earlier pre-block comparisons showed "0 mismatches" — they used eth_getStorageAt which reads correct history, while execution uses the corrupt domain files.

Which file is corrupt

The slot was non-zero canonically around block 24,850,000 (step ~8800) and zeroed by block 24,870,000 (step ~8816). The torrent-published domain file v2.0-storage.8800-8816.kv (or adjacent) captured the non-zero value but missed the deletion.

This is a DELETION entry that was lost during the original collation race on the node that built the published torrent files. The file has a stale non-zero value for a slot that should have a zero/deletion entry.

Why the earlier "0 mismatches" verification was wrong

My verification tool compared domain file entries against GetAsOf(endTxNum). Both the history AND the domain file have entries for this key — history has the correct deletion, the domain file has the stale non-zero value. The tool compared and found they both had "an entry" — it didn't detect that the VALUES differ because the domain file entry has a non-zero value while history says it should be zero at that point.

Fix needed

The getLatestFromDb deletion fix (treating DB deletion entries as authoritative, overriding stale file values) IS needed as defense-in-depth against corrupt published torrent files. The single-tx collation fix prevents producing NEW corrupt files, but can't protect against EXISTING corrupt files in the published torrent set.

@AskAlexSharov

AskAlexSharov commented Apr 16, 2026

Copy link
Copy Markdown
Collaborator

Fix needed - do it in separated PR plz. because i plan to merge this one. after #20593

@yperbasis

Copy link
Copy Markdown
Member Author

@AskAlexSharov OK. I'm reverting the unwind fix and keeping only the collation fix. Let's merge the collation fix and do further fixes in separate PRs.

Comment thread db/state/aggregator.go
// tx-scoped snapshot. BuildFilesInBackground may have filed new steps
// since this AggregatorRoTx was opened, and getLatestFromDb will use
// the current view — so the unwind must tag entries beyond it.
currentFilesEndStep := at.a.EndTxNumMinimax() / at.StepSize()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok

@AskAlexSharov

Copy link
Copy Markdown
Collaborator

current unwind fix looks ok: https://github.com/erigontech/erigon/pull/20445/changes#r3091413460

@yperbasis yperbasis marked this pull request as ready for review April 16, 2026 07:23
@yperbasis yperbasis enabled auto-merge April 16, 2026 07:25
@yperbasis yperbasis added this pull request to the merge queue Apr 16, 2026
Merged via the queue into main with commit 16ac985 Apr 16, 2026
65 checks passed
@yperbasis yperbasis deleted the fix/getLatestFromDb-deletion-entries branch April 16, 2026 09:16
AskAlexSharov added a commit that referenced this pull request Apr 16, 2026
…BuildFilesInBackground (#20594)

Port of #20445 to release/3.4.

## Summary

Fixes two related bugs in the domain state layer that cause gas
mismatches during execution (#20169).

### Bug 1: Collation/pruning race

`BuildFilesInBackground` collates domain files by reading values from
the DB via per-worker read-transactions opened at collation time.
Between the moment a step is deemed ready and the collation reads,
execution can commit new batches that overwrite step S values with step
S+1 data. The collated file for step S then contains wrong values. After
pruning removes the DB entries, `GetLatest` returns the stale file
values, causing SSTORE gas mispricing.

**Fix:** Restructure `buildFiles` into two phases:
1. **Phase 1 (sequential, single read-tx):** Collate all domains and
indices using one MDBX read-transaction. All collations see the exact
same DB snapshot — zero race window.
2. **Phase 2 (parallel, no DB access):** Build files from collations.
This is the expensive part and remains fully parallel.

Additionally, add a committed txNum guard: don't collate step S until
`ComputeCommitment` has confirmed all data through the end of step S is
flushed.

### Bug 2: Unwind entry visibility after filing

After a reorg, the unwind restores domain entries tagged with a step
derived from the unwind-target txNum. If `BuildFilesInBackground` has
filed that step, `getLatestFromDb` discards the restored entry (step
covered by files) and falls through to `getLatestFromFiles`, returning
the stale end-of-step value instead of the changeset-restored value.

**Fix:** Pass `Aggregator.EndTxNumMinimax()` into the unwind and tag
restored entries with `max(naturalStep, currentFilesEndStep)`.

### Changes

- `db/state/aggregator.go`: single-tx collation + parallel file
building; committed txNum guard; `stepFullyCommitted` helper; pass
current file boundary to unwind
- `db/state/domain.go`: bump unwind step tag past current filed range
- `db/state/aggregator_test.go`: `TestAggregator_CommittedTxNumGuard`
- `db/state/domain_test.go`:
`TestDomain_CollationIsolatedFromLaterSteps`,
`TestDomain_UnwindRestoredEntryVisibility`
- `execution/commitment/commitmentdb/commitment_context.go`: export
`DecodeTxBlockNums`; fix `minUnwindale` typo; short-value length guard
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gas mismatches on mainnet & Hoodi

6 participants