State Cache Consolidation (PR #1 of the perf stack) by mh0lt · Pull Request #21380 · erigontech/erigon

mh0lt · 2026-05-23T18:47:15Z

This is PR #1 of a 3-PR perf stack. It consolidates state caching across all modes — sync, tip-tracking, integration and testing — so the state cache is either on and completely trustworthy, or off to measure — never off because it unexpectedly breaks things.

The caches (the account/storage StateCache and the commitment BranchCache) are an internal implementation detail of SharedDomains. No external entity accesses or mutates them directly: callers drive state through Flush / Commit / GetLatest / DomainPut, and the full cache lifecycle (population, invalidation, commit-gating) is owned inside SharedDomains.

What this PR contains

`BranchCache` — single aggregator-scope commitment cache

Aggregator-lifetime cache: pinned root slot + bounded LRU tail, behind the sd.mem chain so unwinds and fork-validations see consistent state.
Wired into the trie read + encoder write paths.
Tx-precise unwind invalidation: entries are stamped with their per-key write txNum; sd.Unwind evicts everything above the unwind watermark (BranchCache.UnwindTo).

One switch for all caches

The BranchCache is a type of state cache, so it rides the existing USE_STATE_CACHE toggle rather than getting its own env. One operator switch turns all caching off — the relevant operation when bisecting a state-root mismatch, where an operator shouldn't have to reason about the interaction of several independent caches. (This is a deliberate deviation from the review's "add a separate BranchCache kill-switch" suggestion — flagged for confirmation.)

The BranchCache reflects only committed state — by construction

The BranchCache can never hold a value a failed commit rolled back: SharedDomains.Commit flushes the in-memory batch into the tx, commits, and only then applies the flushed commitment branches to the cache (the flush is implicit in committing; a failed commit applies nothing). Plain Flush — callers that own their own commit, e.g. offline tools — never touches the cache; read-through populates from committed files. The caches are an internal detail of SharedDomains; nothing else writes to them.

StateCache no-poisoning is #21386, not this PR. Unlike the BranchCache, the account/storage StateCache is an in-flight, cross-transaction cache — it holds prior txs' not-yet-committed writes within a batch, and later txs read them from it. So it can't be made commit-safe by simply invalidating on write (that breaks cross-tx reads in serial exec). Its no-poisoning is the txNum/epoch rework in #21386; this PR keeps its existing ValidateAndPrepare/unwind invalidation.

BUG #21138 — parallel-exec from-0 wrong trie root

ResetExec wipes the commitment DB table; the aggregator's in-memory BranchCache could still reference the just-deleted trie nodes, so a from-0 re-exec served stale entries when computing block 0's commitment → wrong root, dropping genesis-allocated balances no later block touched (mainnet block 46147, 0xA1E4380A3B1f749673E270229993eE55F35663b4). Fix: ResetExec clears the aggregator's BranchCache. TestFromZero_GenesisAllocPreservedAfterResetReExec passes on current main; the test's value here is keeping this PR's cache safe across reset, not fixing a live main bug.

Follow-ups (the rest of the stack)

StateCache LRU + Mode rework (PR #2 of the perf stack) #21386 (PR Pull from go-ethereum up to 2f24e25 (6 Mar 2019) #2 of the stack) — StateCache LRU + Mode rework: consistency + no performance drop-off at a 1 GB cache for long-running nodes; re-adds the warm StateCache repopulation deferred above, under its txNum/epoch model.
Pinning (the stack's third step) — no PR yet; in progress and under test on branch mh/branch-cache-trunk-pin, to be re-benchmarked before merge.
Unify the temporal kv GetLatest interface — eliminate accreted duck-typed metered-getter variants #21739 — interface-unification follow-up: collapse the duck-typed GetLatest variants into a single metered, txNum-returning GetLatest.

Testing

Behaves identically across parallel and serial exec — confirmed in CI across both exec modes. Unit coverage for the BranchCache (tiers, tx-precise UnwindTo, commit-gated population) plus the engine/exec-module FCU commit paths.

Given today's changes (the commit-gating of both caches), we will do another A/B performance run before merging.

Pure refactor — behavior preserved. Splits the encoded-branch→cells decode logic out of HexPatriciaHashed.unfoldBranchNode into a free function DecodeBranchInto so the same code is consumed by: - unfoldBranchNode (existing trie unfold path) - future cache populators (decoded-payload BranchCache) - future parallel pre-unfold orchestrator (Stage E) Today these would each have to re-derive the encoded-branch parsing logic. Centralising it ensures one decoder, one set of edge cases, one place to fix any bug in the on-disk format handling. DecodeBranchInto is intentionally PURE — it does not call deriveHashedKeys. Trie callers (which need hashed keys for the fold state machine) follow the decode with their own keccak loop. Cache callers can skip the derive step entirely until the cell is consumed by the trie. Tests: - TestDecodeBranchInto_RoundTrip: BranchEncoder.EncodeBranch produces bytes that DecodeBranchInto recovers cell-for-cell. Property test that keeps the canonical decoder consistent with the canonical encoder. - TestDecodeBranchInto_DeletedFlag: touchMap/afterMap convention with the deleted parameter. - TestDecodeBranchInto_TruncatedInput: clean errors on truncated input (no panic). Plus the existing commitment test suite (incl. trie-mismatch tests in TestBranchData_*) all pass without modification, confirming the refactor preserves unfoldBranchNode's behavior. This is the foundation for the next refactors in the representation-reduction track (see agentspecs/trie-data-pipeline-complexity-tax.md): subsequent PRs will introduce a decoded-payload cache that reads through this same decoder, and will lift unfoldKeyPath as a per-key traversal primitive that the warmer + future Stage E both consume. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduces a new BranchCache type, distinct from WarmupCache and designed for the longer-lived caching the cross-block persistence work needs (step 7 of the representation-reduction sequence). Distinguishing characteristics vs WarmupCache: - Bounded LRU tail with configurable capacity (vs WarmupCache's unbounded map). Suitable for caches that outlive a single Process without unbounded memory growth. - Single pinned slot for the root branch (compact prefix [0x00]). Root never evicts. Atomic-pointer load/store on the hot read path, no lock involved. - dirty-flag + PutIfClean invariants — same semantics as the invariants added to WarmupCache in the previous commit. Lets cross-block writers race safely with fold updates. - Lazy GetDecoded — same lazy-decode pattern as WarmupCache's GetBranchDecoded; cells populated on first decoded-read and cached for subsequent reads. NOT yet wired into the trie's read or write paths. This commit just adds the type, with tests. The trie integration (where this cache plugs into branchFromCacheOrDB and the encoder's PutBranch) is the discussion point at the step 6 boundary — see the conversation captured at this point in the representation-reduction sequence. Today the cache is intended to be ephemeral (per-Process, constructed alongside the trie, dies with it). Step 7 lifts the lifetime to the aggTx level for cross-block persistence; the cache shape (bounded LRU + pinned root + dirty-flag) is in place ahead of that. Tests: - TestBranchCache_RootPinning: root branch lands in pinned slot, deep branches land in LRU tail; per-tier hit counters update independently. - TestBranchCache_RootSurvivesEvictionPressure: root persists when tail is overfilled past capacity. - TestBranchCache_DirtyFlag: PutIfClean refuses dirty entry, unconditional Put replaces and clears dirty. - TestBranchCache_GetDecoded: lazy-decode round-trip with BranchEncoder; cells pointer reused across reads. - TestBranchCache_Invalidate: removes from both tiers. - TestBranchCache_Clear: empties both tiers, resets stats. - TestBranchCache_Stats: deterministic format with per-tier counts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Doc-only addition to BranchCache's package-level comment, capturing the caller invariants the cache assumes and the conditions under which the existing concurrent trie satisfies them. Three caller invariants: 1. Single writer per prefix at any moment. 2. Mark-dirty-then-Put discipline for racing writers. 3. Decoded cells from GetDecoded are read-only (alias entry storage). The current ConcurrentPatriciaHashed satisfies all three by construction: - Mounts partition by first nibble (disjoint prefix spaces, no cross-mount writes to the same key). - Root branch written by single sequential fold post-Wait. - Mount→root grid roll-up is rootMu-protected (in-memory grid only, separate from cache writes). Doc explicitly flags that any future parallel fold redesign (Stage F in agentspecs/stage-e-pre-unfold-design.md) MUST preserve these invariants — particularly the single-writer-per-prefix one, which breaks if parent branches are written incrementally as children complete in parallel. The required coordination layer goes at the orchestrator (per-parent atomic counter; only the last-decrementer writes the parent), NOT inside the cache. The cache's existing primitives (atomic dirty flag, thread-safe LRU, atomic root pointer) are sufficient for that orchestrator to build on. Motivation: Stage F is likely deferred because the bench data shows fold isn't the bottleneck for the canonical SSTORE-bloat workload. But adding the constraint to the cache later (after caching is in production) is much harder than documenting it now — correctness regressions from a missed coordination layer can hide for many blocks. Documenting the contract on the cache itself ensures any engineer touching parallel fold sees it. No code change. Doc-only. All tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 7a of the representation-reduction sequence: per-Process integration of the BranchCache type added in the previous commit. Plumbing: - Trie interface gains SetBranchCache(*BranchCache). HexPatriciaHashed implementation propagates to its branchEncoder. ConcurrentPatriciaHashed implementation propagates the SAME instance to root + all 16 mounts (sharing one cache is correct under the concurrency contract — mounts partition prefix space by first nibble, so cross-mount writes target distinct keys). - InitializeTrieAndUpdates constructs a new BranchCache(default) per trie instance and attaches it. Lifetime today = trie lifetime = per-Process. Future cross-block persistence work (step 7b) lifts this to aggTx scope by constructing the cache one layer up and passing it in. Read path (HPH.branchFromCacheOrDB): L1 WarmupCache (existing) → L2 BranchCache (new) → L3 ctx.Branch. L3 hits with non-empty result populate L2 so subsequent reads hit L2 within the cache's lifetime. L1 stays first because warmup workers may have pre-fetched with prefix-walk-derived freshness. Write path (BranchEncoder.CollectUpdate): - MarkDirty(prefix) BEFORE encode work — protects against concurrent warmup-style writers racing into PutIfClean during the encode (race documented in the cache's Concurrency Contract). - Put(prefixCopy, updateCopy) AFTER ctx.PutBranch succeeds — replaces the dirty entry with fresh canonical bytes. Single writer per prefix per fold step (current sequential fold + first-nibble mount partitioning) means no race on this Put. Lifecycle: HPH.Reset clears the BranchCache when called from the root trie (gated by !hph.mounted). Mounted subtries share the root's cache, so a mount calling Clear would dump entries the root still wants. Carries the invariant from PR #19954 commit 1612d56. Today's expected performance impact: minimal. Per-Process lifetime means cache is empty at Process start, so first reads always miss. The cache helps only branches that are read multiple times within ONE Process — uncommon in current code paths. Step 7b is where the real perf swing comes from (cross-block persistence so block N reads hit branches written by block N-1). This commit is the safe stepping stone: it validates the wire-up end-to-end (read path + write path + concurrency contract + lifecycle) without changing perf characteristics. Bench should match Run I baseline (7.16 mgas/s on canonical SSTORE-bloat block). All commitment tests pass, lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the BranchCache was constructed inside InitializeTrieAndUpdates, giving it per-SharedDomains lifetime. SharedDomains is reconstructed for every batch / many tx boundaries — verified with logs in the prototype (921 fresh BranchCache constructions per bench run) — so the cache started cold every batch and never delivered the cross-block hits the design is about. Per-Process scope kept cache-cleanup correctness simple but defeated the whole point of caching. Place the cache on the commitment Domain struct (one cache per aggregator, matching the pattern on add_execution_context_with_caches where each domain owns its valueCache). ConfigureDomains attaches the cache once after domains are initialised — idempotent, lifetime = aggregator lifetime. AggregatorRoTx exposes BranchCache() returning the commitment domain's cache, so SharedDomains' construction path can fetch it without forcing db/state/execctx to import db/state (db/state already imports execctx via squeeze.go, so the reverse import would create a cycle). The placeholder commitment.BranchCacheProvider interface lets the SD construction path use a duck-typed type assertion on tx.AggTx() any. Plumb the cache through NewSharedDomainsCommitmentContext into InitializeTrieAndUpdates as an explicit parameter; nil falls back to a fresh per-init cache so test helpers without an aggregator still get a valid cache. Reset behaviour: HexPatriciaHashed.Reset no longer calls Clear on the cache. Aggregator-scope persistence requires the cache to survive Reset between commitment calculations. Callers that genuinely need to invalidate the cache (unwind, fork validation) now call ClearBranchCache explicitly. The bench is forward-only so this is a safe change for measurement; an explicit unwind clear path can land in a follow-up commit when needed.

Adds two debug-gated diagnostics for cross-block-cache-lifetime investigations. Both off by default — meant to be flipped on when a correctness regression surfaces and you need to localise *where in the fold path* the cache started lying, instead of waiting for a downstream trie-root-mismatch many blocks later. 1. BRANCH_CACHE_VERIFY: branchFromCacheOrDB cross-checks every L2 (BranchCache) hit against ctx.Branch and increments a divergence counter when bytes disagree. Logs the prefix and both byte forms so the first divergent read shows up directly in the erigon log. BranchCache.VerifyDivergences() exposes the count for assertion in tests. 2. BRANCH_CACHE_FINGERPRINT: SharedDomainsCommitmentContext.ComputeCommitment emits a "[cache-fp]" log at end of every compute with (block, root, cache fingerprint, divergence count). Two builds running the same workload can be diffed offline ("first block at which their fp's differ") to nail down the first block where lifecycle invariants diverged. Fingerprint is an order-independent FNV-1a fold over (key-hash, data-hash) pairs across both root and tail tiers. Adds maphash.LRU.Range so the Fingerprint can iterate the LRU tail without touching recency (Peek under the hood). The wrapper otherwise discards original byte-keys on insert; mixing by hash instead of key is correctness-equivalent up to hash collision, which is acceptable at the working-set sizes here. Motivation: today we just chased a wrong-root regression to commit d673052 (aggregator-scope cache lifetime). Took two full bench cycles (~12 min) to localise. With the divergence detector running, the first divergent read would surface in the erigon log within seconds. With fingerprint logging in two builds (one good, one regressed), the diverging block boundary would be a single grep across two log files.

The deferred-encoding path (CollectDeferredUpdate + ApplyDeferredUpdates) parallelises EncodeBranch + merge work at apply time. The cache correctness consequence: between collect and apply, sd.mem holds the old state while the cache is also unchanged. When apply finally fires (end of Process or duplicate-prefix flush mid-Process), sd.mem advances but the cache isn't touched — CollectDeferredUpdate doesn't have a cache hook (and wiring one breaks TestSharedDomain_RepeatedUnwindAcrossStepBoundary + TestCustomTraceReceiptDomain because it violates an implicit trie-during-Process cache stability invariant). The result on the FV bench: cache stays at the very-first-Process's read-side L3 fallback bytes for prefixes the trie writes, while ctx.Branch advances to each block's actual state. Divergence on every hot prefix (root, root-zone children) starting from block 2. Use CollectUpdate (inline) instead. CollectUpdate writes sd.mem + WarmupCache + BranchCache atomically at fold time via PutBranch — cache mirrors sd.mem at every write, the trie sees its own writes consistently, and the cross-Process cache state matches what post-FCU MDBX commit produced. Loses the encoder's parallel-encoding optimisation, but bench profile is I/O-bound, not encode-CPU-bound, so the trade is favourable. History (ETL) writes are still inline via DomainPut. Splitting that ("sd.mem inline, history queued for flush at FCU") is the proper architectural answer to defer the slow disk work without touching the sd.mem invariant — tracked as a follow-up.

Bisection helper: force branchFromCacheOrDB to skip the L2 BranchCache read path entirely so every read goes via ctx.Branch (sd.mem → MDBX). Cache writes still fire so verify-mode can keep comparing cache vs canonical. Flipping the env at runtime distinguishes "cache holds bad data" (bench passes further with cache reads off) from "deeper compute bug" (same failure regardless). Used in the 2026-05-06 investigation to confirm the cache was actively corrupting block 13's compute on the canonical SSTORE-bloat bench: with cache reads on, wrong-trie-root at block 13. With reads off, blocks 13-16 produce the correct roots and the bench advances to a separate failure at block 17 (unrelated pre-existing bug). Default off; gate via env DISABLE_BRANCH_CACHE_READS=true.

When verifyBranchCache=true and a cache hit disagrees with ctx.Branch, sample sd.mem, sd.parent.mem, and tx-direct (MDBX) for the same prefix and dump all layers in the divergence log line. Comparing those four byte sequences against the cached and canonical bytes pinpoints which state layer holds the bytes the cache disagrees with — the rewriter we need to identify before fixing the canonical-store-divergence bugs the cache currently exposes. Decision matrix (read off the log line): - cache != tx, sd.mem == cache → in-memory writer is fresh, MDBX is stale (commit-timing issue). - cache != tx, tx == ctx.Branch, sd.mem != cache, parent.mem != cache → MDBX has been rewritten by something outside the CollectUpdate write path (collation, file build, squeeze). - cache != ctx.Branch, parent.mem matches ctx.Branch but sd.mem doesn't → parent merge is the source. - cache != ctx.Branch, all of sd.mem / parent.mem / tx == ctx.Branch → cache itself was populated incorrectly (write-side bug). Three changes: 1. SharedDomains.ProbeReadLayers (db/state/execctx/domain_shared.go): public method that samples sd.mem, sd.parent.mem (private field accessed from the same package), and tx.GetLatest. Read-only; copies bytes so callers can hold them past tx lifetime. 2. TrieContext (execution/commitment/commitmentdb/commitment_context.go): add probeSd + probeTx fields populated at trieContext() construction; expose ProbeStateLayers method that delegates to sd.ProbeReadLayers. The local `sd` interface gets the ProbeReadLayers method too so the duck-typed reference can call it without an import cycle to execctx. 3. branchFromCacheOrDB log site (execution/commitment/hex_patricia_hashed.go): on divergence with verifyBranchCache, type-assert ctx for the probe interface and append sd_mem / parent_mem / mdbx fields to the log line. Existing field shape preserved for log parsers; new fields are additive. Pre-existing test failures (TestSharedDomain_RepeatedUnwindAcrossStepBoundary, TestValidateChainAndUpdateForkChoiceWithSideForksThatGoBackAndForwardInHeight) are unchanged — they were failing on the stack before this probe landed and are part of what the divergence work is meant to localise.

Per-write provenance for divergence-detection diagnostics. When a divergence fires (cache hit disagrees with ctx.Branch), we now log which write site produced the cached bytes and when, so we can correlate the bad write against the FCU / build / step timeline. Tag fields added to branchCacheEntry: - origin short label of the write site (e.g. "CollectUpdate", "L3-fallback-read") - writeSeq monotonic counter per BranchCache instance - writeTimeNanos unix nanos at write time Put / PutIfClean signatures take an origin string. Two writers updated: - BranchEncoder.CollectUpdate → "CollectUpdate" - branchFromCacheOrDB L3-fallback Put → "L3-fallback-read" GetWithOrigin returns bytes plus the metadata; uses a non-counting peek so it can be called alongside Get without double-counting hits. The divergence-detection log site at branchFromCacheOrDB now appends cache_origin / cache_seq / cache_t_ns fields. Combine with the existing sd_mem / parent_mem / mdbx fields to localise both who wrote the stale bytes and which layer the canonical value lives in. Pre-existing test failures (TestValidateChainAndUpdateForkChoiceWithSideForksThatGoBackAndForwardInHeight) are unchanged from previous commits.

BranchCache previously sat in front of the sd.mem -> parent.mem -> MDBX read chain (consulted in branchFromCacheOrDB before ctx.Branch). The shared aggregator-scope cache was written from CollectUpdate by every SD running commitment compute, including fork-validator SDs whose writes never reach MDBX. Origin-tagged probe (run-step7b-probe-sdid) showed five distinct SD pointers writing the same prefix to a single cache entry, so any reader whose lineage didn't match the most-recent writer saw bytes that disagreed with MDBX -> wrong trie root from block 13 onward in the canonical SSTORE-bloat fork bench. Layering after this change: Read: sd.mem -> sd.parent.mem -> branchCache -> aggTx (MDBX) Write: sd.mem only (DomainPut path) Flush: sd.mem -> MDBX, then branchCache.Clear() The cache now mirrors MDBX-flushed bytes only. Writers' in-flight bytes live in sd.mem above; cache hits below sd.mem are always equivalent to reading MDBX, so cross-SD pollution is impossible by construction. Cache fills lazily on the MDBX-read path inside sd.GetLatest, and clear-on-flush prevents pre-flush bytes from coexisting with new MDBX state. Per-key invalidation is a follow-up (PR2). BranchCache entries gain a step field so Get returns (data, step, ok) matching the aggTx contract. Without this, sd.GetLatest's cache hit returned step=0 and CheckDataAvailable rejected the boot SeekCommitment with "commitment state out of date". Removed: - cache.Put from CollectUpdate (commitment.go) - cache.Put + divergence detection from branchFromCacheOrDB (hex_patricia_hashed.go); now just calls ctx.Branch - L3-fallback Put (cache fills via sd.GetLatest now) Validated on canonical cold bench (run-step8b): first FCU VALID, all payloads through end VALID, 0 cache divergences, 0 wrong-root errors. Prior probe bench had 23 divergences and INVALID payloads from the fail block onward. Probe scaffolding (SiteIdentity, ProbeStateLayers, divergence counters) left in place for now; can be stripped in a cleanup follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ache state Update the BranchCache type comment to reflect the architectural state after the WarmupCache consolidation (steps 2a-2c, 3, 4): - BranchCache is the single branch cache (WarmupCache deleted) - Aggregator-scope lifetime, plumbed via BranchCacheProvider - Passive store: cache itself never reaches into underlying state - Branch warmer is branch-scoped; no leaf-data prefetch - Block-processing trie walker takes Updates from executor + memoization for siblings — no prefetch needed - Witness / proof generation walker drives its own state reads; if that path turns out to be cold-bound it indicates a need for separate account/storage caches (treat as separate concern with different scope/lifetime/invalidation; do not regrow the branch warmer to cover it) - disk_sto / disk_acc counters on cache-fp log surface any unexpected fall-through to ctx.Account / ctx.Storage as a signal of memoization gap or missing walker-side prefetch Doc-only; no behaviour change.

Adds a third cache tier between the root-pin slot and the LRU tail: a per-prefix pinned map fed by PinEntry. Pinned entries: - Never evict (no LRU pressure on this tier). - Are checked in lookup before the LRU tail, after the root slot. - Survive Put/SD.Flush updates: a Put for a pinned prefix updates the pinned entry in place rather than displacing it. Cross-block correctness via the existing dirty-flag invalidation discipline is preserved — the new bytes land in the same pinned slot. - Carry the same metadata as Put-tier entries (step, origin, writeSeq, writeTimeNanos) so divergence-detection and Stats treat them uniformly. Sized by the preload policy. Intended consumer is the storage trunk preload for big contracts (the 'storage root trunk cache for big accounts' direction): per-contract trunk branches at depth 65-70 get pinned at SD/cache creation, persist for the cache's lifetime. PinEntry is the public API; pinnedHits/pinnedMisses atomic counters are the new stats. PinnedCount() exposes current size for observability. Step 2 of the storage-trunk pin prototype.

Adds a function to pre-pin commitment branches for a given contract's storage subtree. Walks the trie depth-by-depth from depth 64 (storage subtree root) down to maxDepth: reads each branch via the supplied CommitmentReader, pins it via BranchCache.PinEntry, decodes the child bitmap, and recurses only into children that actually exist (no blind 16-way probing). contractHash is keccak256(address); the trunk lives at the prefix corresponding to the first 64 nibbles of the storage path. For a dense storage subtree (the SSTORE-bloat workload's bloat contract), expected pin count is ~16 + 256 + 4096 ≈ 4.4 K branches at depth 65/66/67, plus the root at depth 64. Sparse subtrees produce fewer. Loading strategy: per-prefix lookups via the reader. Simplest correct implementation. A bulk seg.Getter range-scan over sorted .kv would amortize disk seeks (per the parity-cluster observation in the consolidation memo) but requires building a prefix-range API on top of recsplit; defer until per-prefix lookup is shown to bottleneck. Step 3 of the storage-trunk pin prototype.

Adds a SharedDomains constructor hook that, when PIN_CONTRACT_TRUNKS is set, fires a one-shot background goroutine to preload the storage-subtree-trunk of each listed contract into BranchCache's pinned tier. Format: comma-separated list of 64-hex-char contract hashes (each is keccak256(addr)). Mechanism: - BranchCache.TryClaimPreload (atomic CAS) ensures the goroutine fires exactly once per cache lifetime, even though many SDs may be constructed (per-tx instances etc.). - Goroutine wraps sd.GetLatest as a CommitmentReader and calls commitment.PreloadContractTrunk for each contract hash, depth 64-70. - Logs progress per contract on completion. Closure-over-(sd, tx) is the prototype shape — works for the bench (both live for the whole process). Production deployment needs to revisit the lifetime — sd's tx may not outlive the goroutine. Step 4 of the storage-trunk pin prototype. Bench measurement is the next step (commit 5).

Previous async-goroutine shape (d204c1b) shared the SD's MDBX tx with the calling thread. Concurrent cursor use under the same tx tripped Go's cgo-pointer-pinning runtime check: panic: runtime error: cgo argument has Go pointer to unpinned Go pointer surfacing in an unrelated PruneBlocks goroutine during boot. Make the preload synchronous in the SD constructor for now: same TryClaimPreload guard (fires once per cache lifetime), but no goroutine. Boot pays the per-contract preload time as a one-off. Background-with-own-tx is the proper shape and remains a follow-up; owning the SD's tx exclusively for the preload duration is the safe shape until that lands.

… cap The previous bench (run-pin-trunk-instrumented-cold-cgroup-191347) hung at SD construction with no [trunk-preload] log lines for 5+ minutes. Erigon never reached "engine RPC ready" so all blocks came in as SYNCING. Two changes to localise + bound: 1. **Localisation**: add INFO logs at triggerTrunkPreload entry, per-contract starting/done with took, and a 500-prefix progress log inside PreloadContractTrunk. Whatever it does (or hangs on) is now observable. 2. **Bound**: cap PreloadContractTrunk at 10000 branches (vs ~4.4K expected for a saturated 4-level subtree at maxDepth=67). Drops maxDepth from 70 → 67 in the trigger (depth 64-67 = 16+256+4096 max branches) so we don't recurse into the per-slot tail where pinning has no value. Preload fails-fast on pathological subtrees rather than blocking SD construction indefinitely.

The previous shape (d204c1b) ran triggerTrunkPreload BEFORE sd.SeekCommitment in NewSharedDomains. Bench result: when the preload fired (PIN_CONTRACT_TRUNKS set), every subsequent block came back SYNCING — engine kept attempting backward-download which fails on this peerless setup, no block ever validated, no cache-fp ever fired. Without the preload firing, the same binary works fine (verify-bench PASS at 3.26s). Hypothesis (untested but matches the symptom): preload's sd.GetLatest reads ran before SeekCommitment had resolved the SD's view of the chain head. Pinned values were therefore inconsistent with the committed state, and the trie compute on the first block got wrong root → SYNCING → backward-download → no peers → death spiral with no Flush ever updating the (stale) pinned entries. Fix is mechanical: move the preload call to after SeekCommitment. The TryClaimPreload guard still ensures fire-once-per-cache lifetime. If subsequent bench shows pin_count > 0 + pin_hit > 0 + blocks validating normally, the hypothesis is confirmed; if SYNCING repeats, the bug is something else and we need to revert and debug differently.

…ParaTrieDB Previous prototype iterations both broke block validation: 1. Async sharing the SD's MDBX tx (d204c1b) → cgo "unpinned Go pointer" panic from concurrent cursor use. 2. Synchronous from NewSharedDomains (5a81976 / 4c9ead456d) → blocked the engine HTTP handler for ~3-4s during the preload window, causing the bench's first NewPayload to be dropped. Confirmed: the bench's height=24358001 is ABSENT from the erigon log; the next received block (24358002) then fails backward-download (no peers) → SYNCING forever. Restructure: - Move trigger from NewSharedDomains to EnableParaTrieDB. The latter is called from the staged-sync exec-stage init, NOT from request handlers, AND has access to a kv.TemporalRoDB. - triggerTrunkPreload now takes the DB (not a tx) and spawns a goroutine that opens its OWN tx via db.BeginTemporalRo. No shared cursors with the main pipeline; no blocking the engine. - Reader uses tx.GetLatest directly (not sd.GetLatest) — the SD layering would re-introduce shared-state risk and isn't needed (pinned bytes don't depend on sd.mem state). Same TryClaimPreload guard ensures the preload fires once per BranchCache lifetime regardless of how many SDs construct. If this works the bench should: - Show [trunk-preload] log lines firing once - Pin ~4369 branches - TEST block cache-fp shows pin_hit > 0 and files_comm < 1K - All blocks validate normally (no SYNCING failure)

Make the trunk-pin maxDepth configurable via env (default 67) so we can sweep depths to find the memory/perf sweet spot without rebuilding. Bump the per-contract maxBranches cap from 10K to 200K so deeper saturated subtrees don't get truncated mid-walk.

The previous code disabled the Warmuper for the parallel commitment path out of concern that it would interact with the calculator's SetUpdates call. In practice the Warmuper's reads are independent of the calculator's update buffer — they pre-fetch branch data while EVM execution runs, and the calculator's SetUpdates only affects ComputeCommitment's input set, not the warmup paths. Re-enabling produces a measured 8× throughput improvement on the perf-devnet-3 SSTORE-bloated benchmark (block 24358306, the canonical fixture for #20920), restoring the win first observed in Run H/I of the trie-perf investigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds atomic counters (branch hit/miss/evict + bytes-served, account hit/miss, storage hit/miss) to WarmupCache, plus a Stats() string formatter and ResetStats() for per-Process accumulation reset. Counters are updated on every Get/Evict path (existing Put paths were already counted via cache size). Useful for: - Confirming warmup effectiveness in production logs - Per-block diagnostics when investigating commitment perf - Future per-pool dashboards once a coordinator/observability layer lands (tracked separately) No behavior change beyond the counter updates themselves. Stats() format is one line, suitable for embedding in the existing LogCommitments output. ResetStats() zeros counters without touching cached data — useful for per-Process windowed measurement. Clear() also resets counters along with the data, since data and counters were accumulated together. Test: TestWarmupCache_Stats covers hit/miss/evict accounting across branch/account/storage paths and verifies Stats() format + ResetStats() preserves data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure refactor — behavior preserved. Lifts the unfold-loop from HexPatriciaHashed.followAndUpdate into its own method, parameterized over (hashedKey, plainKey) and intended as the per-key traversal primitive that future orchestrators consume. Today only followAndUpdate calls it, replacing the inline loop with a one-line call. The extracted method preserves the existing metric attribution (StartUnfolding) and trace-print behavior verbatim. Why now: this is the second step in the representation-reduction sequence (see agentspecs/trie-data-pipeline-complexity-tax.md). Future PRs will introduce orchestrators that drive unfold-only walks of touched-key paths to fill cell state without going through the full fold/update cycle: - Cache populator (decoded-payload BranchCache) needs to walk a touched-key path and capture the cells encountered, without triggering fold or modifying the trie's update buffer. - Stage E parallel pre-unfold orchestrator drives unfoldKeyPath across multiple HexPatriciaHashed instances concurrently to pre-warm trie state before commit. Both consume the same primitive. Centralising it now means each future orchestrator is a thin wrapper rather than a duplicate of the unfold-loop logic. Tests: full commitment test suite passes without modification (all 8+ test files in execution/commitment/), confirming the refactor preserves followAndUpdate's behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the carry-as-is correctness invariants from the PR #19954 investigation as scaffolding on the existing WarmupCache: - branchEntry.dirty atomic.Bool — signals "stale until cleared" - PutBranchIfClean(prefix, data) bool — skips write if entry dirty - MarkBranchDirty(prefix) — mark for later refusal of stale puts These are scaffolding additions; no callsites use them yet. Existing PutBranch unconditionally overwrites and clears any prior dirty flag (creates a fresh entry), preserving today's semantic exactly for existing callers. Why now: the prototype investigation (see agentspecs/commitment-cache-prototype-dev-context.md) found that inline-invalidate-on-write is incompatible with deferred encoding — update-in-place breaks correctness because there's a window between fold (computes hash, holds new state) and encoder (writes encoded bytes) where readers see stale cached bytes. The reth-research (agentspecs/reth-1ggas-research.md §4) calls the dirty-flag pattern out as the design that resolves this without forcing synchronous encoding: the encoder marks the entry dirty BEFORE its own write completes, so any racing read knows to bypass the cache for that key. Today's WarmupCache lifecycle (per-Process, warmup completes before fold begins) does NOT exhibit this race — these invariants are infrastructure for the future cross-block persistence work where warmup-style writers can outlive their parent Process. Tests: - TestWarmupCache_DirtyFlag: PutBranchIfClean refuses dirty entry, unconditional PutBranch clears dirty. - TestWarmupCache_DirtyFlag_MarkAbsentKey: marking absent key is no-op (no panic, no entry created). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an additive read method that returns cached branches in already- decoded form, lazy-decoding on first decoded-read per entry and caching the parsed cells for subsequent reads. No existing callsite changes; existing GetBranch / GetAndEvictBranch / PutBranch callers continue to work with encoded bytes unchanged. Why now: this is step 5 of the representation-reduction sequence (see agentspecs/trie-data-pipeline-complexity-tax.md). The trie's read path currently does GetBranch (encoded) + DecodeBranchInto on every cache hit — paying decode CPU on every read. Switching that callsite to GetBranchDecoded (in a separate later commit) eliminates the redundant decode. The ENCODED form remains the source of truth — the encoder needs it for the merge-with-prev step, and it's what gets written to disk via PutBranch. The decoded form is derived lazily and cached alongside the entry. When PutBranch overwrites an entry, the new entry starts fresh and the next decoded read re-derives from the new bytes. API design notes: - Returns (bitmap, *[16]cell, ok). Caller derives touchMap/afterMap from bitmap based on its own deleted-vs-present-after context — the cache stores cells independent of that context so the same entry serves both readers. - The returned *[16]cell aliases entry-owned storage. Read-only consumption is safe across concurrent calls (decode runs at most once per entry via sync.Once); MUST NOT be modified in place. - Decode error → ok=false (don't count as hit OR miss; caller falls through to canonical re-read). Tests: - TestWarmupCache_GetBranchDecoded: round-trip equality with direct DecodeBranchInto, plus same-pointer reuse on repeat reads. - TestWarmupCache_GetBranchDecoded_Miss: behaves like GetBranch on absent keys. - TestWarmupCache_GetBranchDecoded_TruncatedData: graceful failure on corrupt entry (no panic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ng payloads Closes a timing hole that surfaced once we wired the aggregator-scope BranchCache: between an FCU completing and the next newPayload, MDBX hasn't been committed yet (RunLoop's CommitCycle only fires under memory pressure), but currentContext.mem holds the latest writes from MergeExtendingFork. The fresh doms created in ValidateChain has no parent and a fresh roTx, so its ctx.Branch reads stale-MDBX while the aggregator-scope BranchCache (populated by the prior FV's CollectUpdate writes) holds the fresh state. That's the cross-newPayload divergence pattern observed in the bench (8-61 divergences and wrong-trie-root errors at block 3-14 across runs). Set doms.SetParent(currentContext) when the new payload extends the current canonical head (header.ParentHash == ReadHeadBlockHash). For fork payloads that don't extend head, leave parent unset: unwindToCommonCanonical below reverts doms's view to the common ancestor, and exposing currentContext.mem (post-divergence canonical writes) via the parent chain would shadow the unwound base and break fork validation. Verified by TestReorgsWithInsertChain — the "head-only" predicate is what the current single-canonical-chain SD topology supports. A proper per-branch SD lineage (each fork's validation chains to the last validated SD on its own branch, not always currentContext) is the follow-up needed for concurrent multi-fork validation. The current design supports a single canonical chain only; that's enough to close the divergence we have today, with the lineage extension tracked separately.

Foundation for the "Snapshot vs MDBX read-cost equivalence" investigation (memory: snapshot-vs-mdbx-performance-equivalence.md). This file produces the headline ratio that quantifies the gap the investigation aims to close: warm-cache reads from snapshot .kv files should cost the same as warm-cache reads from MDBX (same disk, same page cache). H0 measures how far apart they are today. Five sub-benches: - MDBX_path full chain, key in MDBX - File_path full chain, key in file - Forced_file_path file-only debug path, file-resident keys - Forced_db_path DB-only debug path, MDBX-resident keys - Bloom_miss_path file-only debug path, MDBX-resident keys (file misses in xorfilter for every probe) Two operating modes; only synthetic is wired in this commit: - Synthetic (testDbAndAggregatorBench fixture): writes 64 full 16-tx steps, BuildFiles + repeated PruneSmallBatches drains all but the tip step into files. Phase 2 keys at txNums past the built-step boundary stay in MDBX. Partition by *actual* residency after setup so bench inputs match where keys really live. - Real-datadir (--snapdatadir flag): TODO. Opens an existing chaindata+snapshots datadir read-only and picks keys via cursor iteration / .kv decompressor walk. Required for production- relevant numbers since synthetic has tiny files and small values. Initial synthetic results on AMD EPYC 4244P (Accounts domain): MDBX_path 173 ns/op File_path 226 ns/op (1.31x MDBX) Forced_file_path 30 ns/op Forced_db_path 158 ns/op Bloom_miss_path 30 ns/op Synthetic dataset is too small to surface the production gap that pprof shows (xorfilter at 35% CPU on real bloat workload). H1-H4 benches and the real-datadir mode are the next steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the --snapdatadir flag path to the H0 bench. Opens an existing chaindata + snapshots datadir read-only via the same recipe as cmd/integration (mdbx Accede + state.New + temporal.New) and picks keys by cursor-walking the per-domain values table. Pragmatic adjustments: - On heavily-pruned production datadirs (perf-devnet-3-run was 100% pruned), every MDBX values-table row is step-shadowed by a file, so getLatestFromDb returns ok=false. The MDBX-side sub-benches skip in this case; File_path numbers stand on their own and the synthetic MDBX_path baseline serves as the cross- mode comparator. - skipIfEmpty short-circuits per sub-bench rather than failing the whole run, so we can still get the file-path numbers. First production numbers (AMD EPYC 4244P, AccountsDomain, 2012 file-resident keys from perf-devnet-3-run, fully pruned): File_path 211 ns/op (synthetic was 226; essentially same) Forced_file_path 30 ns/op (synthetic was 30; identical) Surprising finding: real .kv file reads cost the same as synthetic. This means production bloat-workload bottleneck is NOT in getLatestFromFiles — it must be in HistorySeek (.ef history files walked by HistoryStateReader.GetAsOf). The GetAsOf shortcut work flagged in getasof-regression-suspect.md is the right lead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related additions to the snapshot-vs-MDBX perf-equivalence investigation (memory: snapshot-vs-mdbx-performance-equivalence.md): 1. H_GetAsOf bench (db/state/snapshot_vs_mdbx_bench_test.go). New runHGetAsOf with four sub-benches for HistorySeek-via-GetAsOf on file-resident keys: GetLatest_baseline, GetAsOf_recent (asOf near endTxNum), GetAsOf_mid (asOf at endTxNum/2), GetAsOf_floor (asOf=1). Tests the path the calculator's HistoryStateReader.Read uses, which is distinct from getLatestFromFiles measured in H0. Real-datadir results on perf-devnet-3 (AccountsDomain, endTxNum=2.9B): GetLatest_baseline 202 ns/op 0 allocs GetAsOf_recent 570 ns/op 0 allocs <- 2.8x baseline, no result GetAsOf_mid 235 ns/op 5 allocs GetAsOf_floor 196 ns/op 4 allocs GetAsOf_recent (the calculator's pattern after PR #21010) scans the .ef looking for a record at-or-after endTxNum-1, finds none (most keys haven't changed in the last txNum), falls through to GetLatest. The 370ns/op overhead vs GetLatest is wasted scan. Confirms the GetAsOf shortcut described in getasof-regression-suspect.md as a real lever, though small in absolute terms (~2ms/block on the bloat workload). 2. Surface "took" + "keys" on the existing [commitment][cache-fp] Info log line (commitmentdb). Was already computed in the debug-level "[commitment] processed" log, but the bench runs with --log.dir.disable so debug logs aren't captured. This made it possible to attribute the 4.3s gap inside newPayload(TEST block) directly: the calculator's ComputeCommitment takes 4220ms for the 5910-key bloat block — 91% of the entire block wall time. Per-key cost is ~700us, consistent across blocks of all sizes. The actual perf lever for the bloat workload is making per-branch ComputeCommitment cheaper, not file/state reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds three groups of fields to the existing [commitment][cache-fp] log line so the calculator's per-block behaviour is observable at Info level (the bench runs with --log.dir.disable, so debug logs aren't captured): - took, keys: ComputeCommitment wall + key-count for the block. Pre-existing internally, now surfaced. - load, skipped, reset: process-cumulative counts of computeCellHash decisions: * load = had no memoized stateHash, fetched value from DB * skipped = had memoized stateHash, reused without fetch * reset = had stateHash but had to invalidate it Surfaced via new commitment.SkipLoadResetCounters(). - files_acc / files_sto / files_code / files_comm: per-domain file-read counts pulled from sd.Metrics().Domains[domain]. Decomposes the aggregate `files=N` from the [domain reads] log line into its actual sources (e.g. on the SSTORE-bloated block the 32k file reads break down as 5.9k Storage value loads + 26.6k Commitment branch reads + a handful of others). All counters are cumulative; per-block deltas are obtained by subtracting consecutive cache-fp lines. Pure observability — no behaviour change. Used as the measurement framework for the snapshot-vs-MDBX perf-equivalence investigation and the follow-on commits that target specific levers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The WithFlushCallback path updated the downstream BranchCache before flushLocked wrote to MDBX. On a flush that fails and is not retried (e.g. a background post-forkchoice flush that logs the error and continues) that left the aggregator-lifetime cache permanently ahead of MDBX, so a later read served branch bytes that never reached disk. Run the callbacks after flushLocked succeeds instead. The reorder is safe because flushLocked flushes the domain writers, not sd.domains (the in-memory history map), so the mem batch still holds this batch's values during the write→callback window — a reader (sd buffer → cache → MDBX) finds them in mem or MDBX, never missing from all sources. Everything stays under latestStateLock, so no concurrent DomainPut interleaves. If flushLocked errors, the cache is simply not updated for the batch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ters The sstoreInsert/Update/Delete/Noop and hasStorageMiss package-global atomics were incremented on the commitment/exec hot paths but their getters (SstoreClassificationCounts, HasStorageMissCount) have no consumer anywhere in the stack (#21380, #21386, the perfviz view) — write-only prototype perf-debug scaffolding. The canonical metrics framework (kvmetrics, #21663) is in main and covers KV reads, not these. Remove the counters, their Record* funcs/getters, and the call sites; this also drops execution/state's only import of execution/commitment. If SSTORE classification is wanted in production later, express it via the kvmetrics collector rather than bespoke globals. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mh0lt · 2026-06-11T00:30:32Z

CI is green on the latest push. Summary of what's been addressed since the last review:

Split (the blocker + most majors) — the trunk-pin/adaptive subsystem is extracted to mh/branch-cache-trunk-pin; this PR is now the consolidation core only. That removed the immortal-pin wrong-root blocker, the Clear()/c.pinned race, the adaptive-default-on concern, the controller/cache lifetime mismatch, both swallowed-cursor-error sites, the PublishMetrics underflow, and the uncancellable preload goroutine.

Remaining review items, now done:

SKIP_EIP684_HASPREFIX correctness-broken scaffold removed.
GetDecoded (+ the [16]cell footprint), the dirty-flag protocol (PutIfClean/MarkDirty), and the divergence/verify scaffold moved out to the trunk-pin branch (no production consumers here; also drops the per-Put time.Now()/writeSeq overhead and a state→commitment import).
Flush callback reordered to run after the MDBX write, so a failed/un-retried flush can't leave the BranchCache ahead of MDBX.
Dead sstore* / hasStorageMiss counters removed (write-only, no consumers anywhere in the stack). skipLoadReset kept — it feeds a live skip/reset-ratio log line.
EnableWarmupCache (write-only) removed; reset_stages.go got the requested log.Warn on the silent HasAgg fallthrough + a defer on the aggregator tx; dead ensureStorage + the warmuper.go pass-through removed; the stale agentspecs/"Cleared by sd.Flush"/txN=0-Put comments fixed.
Interface debt (MeteredGetter / MeteredGetterWithTxN / GetLatest) tracked in Unify the temporal kv GetLatest interface — eliminate accreted duck-typed metered-getter variants #21739 for a dedicated follow-up (references db/state/kvmetrics: process-level channel-fed KV-read metrics collector #21663's ctx-carried approach).

Re-requesting review, thanks.

taratorio · 2026-06-11T03:16:58Z

@mh0lt as per #21380 (comment) please provide some numbers from A/B testing this on mainnet to verify this doesn't introduce performance regressions since it touches warmup.

Re-adds the pinned-cache subsystem on top of the clean BranchCache core (#21380): per-prefix pinned tier (PinEntry/PinnedCount), trunk preload (preload*.go), adaptive pin controller (adaptive_pin.go), trunk_pin_metrics, and the SharedDomains preload wiring. Reconciled onto the post-cleanup core: the lazy-decode read path (GetDecoded), dirty-flag write discipline (PutIfClean/MarkDirty), and the divergence verify counter were dropped as dead here — they are not used by the pin tier and will be re-added by the immortal-pin re-implementation when wired. GetWithOrigin and the per-entry write provenance (origin/writeSeq/writeTimeNanos) are retained for that re-implementation's divergence localization. BranchCache.Put carries the origin label. Carries the known immortal-pin wrong-root blocker; not to be merged as-is.

WithFlushCallback / FlushConfig.DomainCallbacks deliver the value's per-key write txNum alongside the step, and SharedDomains.Flush stamps the BranchCache entry with it instead of the batch high-water sd.txNum. This makes flush-time cache population tx-precise: an unwind to a txNum inside the latest step invalidates exactly the entries above it, not the whole step. The txNum was dropped from the callback during the FlushOption tidy as "not required" here, but the stacked state-cache PR needs it — the state/code caches key their (txNum, epoch) unwind invalidation on this value (#21752). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…perf-statecache-lru-pr Reconcile #21386 onto the cleaned commitment-cache model from #21380. Resolution: - branch_cache.go (+test), temporal_mem_batch.go, kv_interface.go -> take #21380's cleaned model: reduced BranchCache API, txN-watermark UnwindTo, FlushOption pattern, flush-callback-after-MDBX-write. - preload*.go, trunk_pin_test.go -> deleted (trunk-pin extracted to mh/branch-cache-trunk-pin). - domain_shared.go -> combine: keep #21386's StateCache refresh, keccak code-cache fix, codeHash->code read bypass, per-worker kvmetrics (wm) and the collector/reqMetrics fields; adopt #21380's cleaned FlushOption multi-domain callback (cb-after-MDBX-write), MeteredGetterWithTxN watermark + txN=0 skip; drop the extracted adaptive-pin controller + PublishMetrics. Notable behavioural deltas (flagged for review): - BranchCache unwind: #21386's epoch/unwindFloor model -> #21380's txN-watermark UnwindTo (same effect: evict entries above the unwind point). - StateCache flush entries now stamped with sd.txNum (batch high-water) as the unwind watermark: the cleaned WithFlushCallback exposes step, not per-key txN; sd.txNum is a safe conservative upper bound. - temporal_mem_batch.go DomainMetrics refs retargeted changeset -> kvmetrics. Carries: keccak codeHash fix, #21706 CodePath recovery, stateCache. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Byte-match #21380's WithFlushCallback txNum doc so re-syncing this stacked branch onto #21380 attributes the kv plumbing cleanly to #21380 (#21752). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Port the resident fixed-array trunk onto the #21380+#21386 cache stack: an accountTrunk holding account-trie branches at nibble depths 1-4 in dense arrays (d1[16]/d2[256]/d3[4096]/d4[65536]) indexed directly by the compact-hex prefix, plus a per-contract storageTrunk (depths 0-3 + deep overflow) in a pinned map keyed by account hash. Each slot is an atomic.Pointer, so trunk reads/writes take no mutex and don't serialize through a shared lock the way the LRU tail and a storage trunk's deep overflow map do. Trunk and storage-trunk entries flow through the shared lookup/store/Invalidate walk, so Get's (txN, epoch) staleness check covers them unchanged. Adds PinEntry/PinnedCount and a SetMissCallback seam for the residency/adaptive layer (added separately). BRANCH_CACHE_TRUNK_DISABLE routes depths 1-4 back to the tail for A/B.

mh0lt · 2026-06-11T21:33:49Z

@mh0lt as per #21380 (comment) please provide some numbers from A/B testing this on mainnet to verify this doesn't introduce performance regressions since it touches warmup.

State-cache benchmark — bench-n1 vs bench-n2

Setup

n1 = 3.6.0-dev-6d1324b8, USE_STATE_CACHE on ([state-cache] logged 8,381×); restarted 06-10 19:39.
n2 = 3.6.0-dev-c2269cc0, no state cache ([state-cache] logged 0×); uptime 2.7 days.
Hosts identical: 16 core / 64 GB / /dev/md3 3.5 T (datadir ~448 G) / beacon-chain-v7 CL.
Config: mainnet, --externalcl, --prune.mode=full, debug verbosity.
Metric: per-block execution= from head updated, joined on block number (hashes identical both nodes).
Window: last 24 h, 7,183 joined blocks (25288988..25296171). State-cache hit-rate: flat 14–16% throughout.
Phases keyed to time since the 06-10 19:39 restart.

Phase totals

Phase	Since restart	Blocks	n2 mean	n1 mean	Δ mean	Δ median	n1 win%
Start	0 – ~11 min	69	150.8	159.7	+9.0 ms	+3 ms	40.6%
Warmup	~11 min – ~14 h	4,234	152.1	154.5	+2.4 ms	−1 ms	50.6%
Steady state	~14 h onward	2,695	151.5	133.6	−18.0 ms	−13 ms	70.5%
Pooled 24 h	—	7,183	151.5	146.0	−5.4 ms	−5 ms	58.5%

Excluded: 185 blocks before the 19:39 restart (n1 on old pre-cache binary).

Hourly timeline

Hour	Blocks	n2 mean	n1 mean	Δ	n1 win%	cache hit%
06-10 18	7	108.3	126.3	+18.0	0%	—
06-10 19	297	140.4	134.5	−5.9	68%	14.5
06-10 20	300	148.3	135.3	−13.0	69%	14.6
06-10 21	300	144.0	146.3	+2.3	66%	15.0
06-10 22	300	136.9	147.0	+10.1	47%	14.7
06-10 23	299	151.7	144.6	−7.1	63%	14.1
06-11 00	296	137.6	158.5	+20.8	34%	14.7
06-11 01	300	152.9	157.8	+4.9	39%	14.4
06-11 02	298	156.1	164.3	+8.2	26%	14.2
06-11 03	298	155.7	154.9	−0.8	53%	14.1
06-11 04	300	192.6	194.3	+1.7	42%	13.8
06-11 05	299	155.7	187.7	+32.0	13%	14.0
06-11 06	297	148.1	147.5	−0.6	51%	14.0
06-11 07	299	155.8	141.8	−14.0	86%	14.4
06-11 08	298	137.1	131.2	−6.0	54%	14.7
06-11 09	300	159.7	157.8	−1.9	60%	14.8
06-11 10	299	150.9	141.6	−9.3	61%	15.2
06-11 11	300	151.3	141.3	−10.0	63%	14.8
06-11 12	297	151.1	123.2	−27.9	98%	14.8
06-11 13	300	182.4	129.8	−52.6	97%	14.8
06-11 14	299	141.7	147.1	+5.4	24%	15.1
06-11 15	300	140.8	127.1	−13.7	82%	15.0
06-11 16	298	156.5	128.4	−28.1	89%	15.5
06-11 17	299	158.8	130.0	−28.8	86%	15.9
06-11 18	298	130.5	133.6	+3.1	34%	15.0
06-11 19	5	132.4	132.0	−0.4	40%	15.5

Start phase — n1 per-minute read latency & execution

Minute	dbdur	fdur	exec
19:39	111.9 µs	120.4 µs	134.5 ms
19:40	137.5 µs	165.6 µs	146.0 ms
19:41	150.8 µs	132.4 µs	137.4 ms
19:42	104.6 µs	192.6 µs	174.8 ms
19:43	58.8 µs	230.6 µs	175.4 ms
19:44	63.4 µs	198.2 µs	141.0 ms
19:45	78.4 µs	470.6 µs	290.8 ms
19:46	52.2 µs	71.4 µs	150.6 ms
19:47	62.6 µs	80.8 µs	99.6 ms
19:48	72.8 µs	92.6 µs	200.4 ms
19:49	66.8 µs	82.2 µs	156.6 ms
19:50	64.0 µs	79.6 µs	139.6 ms

Warmup phase — commitment time, heavy blocks (≥2k keys), per hour

Hour (06-11)	Since restart	n1 (cache)	n2 (baseline)	n1 − n2
00	~4.3 h	87.8 ms	55.1 ms	+32.7
01	~5.3 h	77.9 ms	76.3 ms	+1.6
02	~6.3 h	90.1 ms	84.9 ms	+5.2
03	~7.3 h	78.9 ms	81.7 ms	−2.8
04	~8.3 h	103.4 ms	102.5 ms	+0.9
05	~9.3 h	110.6 ms	86.6 ms	+24.0
06	~10.3 h	85.1 ms	85.9 ms	−0.8
07	~11.3 h	76.1 ms	85.7 ms	−9.6
08	~12.3 h	73.4 ms	64.8 ms	+8.6
09	~13.3 h	95.7 ms	82.6 ms	+13.1
10	~14.3 h	63.8 ms	66.3 ms	−2.5
11	~15.3 h	70.5 ms	70.3 ms	+0.2
12	~16.3 h	55.8 ms	71.1 ms	−15.3
13	~17.3 h	59.3 ms	87.3 ms	−28.0
14	~18.3 h	69.8 ms	57.4 ms	+12.4
15	~18.8 h	54.4 ms	56.4 ms	−2.0
16	~20.3 h	61.0 ms	75.0 ms	−14.0
17	~21.3 h	59.8 ms	74.1 ms	−14.3
18	~22.3 h	58.1 ms	54.6 ms	+3.5
19	~23.3 h	54.6 ms	51.2 ms	+3.4

Steady phase — 20 largest n1-faster blocks

Block	n2	n1	Δ	mgas
25295241	439 ms	121 ms	−318 ms	28.6
25294735	374 ms	104 ms	−270 ms	20.7
25294215	414 ms	160 ms	−254 ms	43.5
25294466	490 ms	267 ms	−223 ms	47.1
25294467	427 ms	221 ms	−206 ms	52.9
25294686	458 ms	255 ms	−203 ms	31.0
25294431	377 ms	176 ms	−201 ms	44.6
25294531	369 ms	180 ms	−189 ms	49.7
25294562	334 ms	151 ms	−183 ms	48.5
25294395	283 ms	108 ms	−175 ms	28.2
25294528	402 ms	237 ms	−165 ms	59.8
25294524	461 ms	297 ms	−164 ms	33.8
25295734	316 ms	162 ms	−154 ms	37.2
25294432	358 ms	205 ms	−153 ms	49.0
25295672	359 ms	210 ms	−149 ms	48.8
25293761	406 ms	259 ms	−147 ms	39.4
25294558	301 ms	156 ms	−145 ms	39.0
25294041	382 ms	238 ms	−144 ms	59.6
25294596	388 ms	245 ms	−143 ms	54.5
25294549	267 ms	126 ms	−141 ms	36.1

Steady phase — 20 largest n1-slower blocks

Block	n2	n1	Δ	mgas
25293861	249 ms	444 ms	+195 ms	59.6
25293640	249 ms	385 ms	+136 ms	55.9
25294746	202 ms	322 ms	+120 ms	51.1
25293888	210 ms	319 ms	+109 ms	55.4
25294771	185 ms	282 ms	+97 ms	52.8
25294974	206 ms	298 ms	+92 ms	60.0
25295105	246 ms	338 ms	+92 ms	56.5
25294649	247 ms	328 ms	+81 ms	56.8
25294784	170 ms	246 ms	+76 ms	48.6
25294763	226 ms	295 ms	+69 ms	52.9
25294779	227 ms	295 ms	+68 ms	49.2
25293862	137 ms	204 ms	+67 ms	36.0
25294758	230 ms	295 ms	+65 ms	45.1
25295921	129 ms	192 ms	+63 ms	26.3
25293903	148 ms	209 ms	+61 ms	35.9
25296094	246 ms	303 ms	+57 ms	53.4
25293544	93 ms	147 ms	+54 ms	28.1
25293867	143 ms	197 ms	+54 ms	40.2
25294749	152 ms	206 ms	+54 ms	31.6
25294747	178 ms	231 ms	+53 ms	38.1

taratorio · 2026-06-12T03:22:59Z

@mh0lt as per #21380 (comment) please provide some numbers from A/B testing this on mainnet to verify this doesn't introduce performance regressions since it touches warmup.

State-cache benchmark — bench-n1 vs bench-n2

Setup

n1 = 3.6.0-dev-6d1324b8, USE_STATE_CACHE on ([state-cache] logged 8,381×); restarted 06-10 19:39.

n2 = 3.6.0-dev-c2269cc0, no state cache ([state-cache] logged 0×); uptime 2.7 days.

Hosts identical: 16 core / 64 GB / /dev/md3 3.5 T (datadir ~448 G) / beacon-chain-v7 CL.

Config: mainnet, --externalcl, --prune.mode=full, debug verbosity.

Metric: per-block execution= from head updated, joined on block number (hashes identical both nodes).

Window: last 24 h, 7,183 joined blocks (25288988..25296171). State-cache hit-rate: flat 14–16% throughout.

Phases keyed to time since the 06-10 19:39 restart.

Phase totals

Phase Since restart Blocks n2 mean n1 mean Δ mean Δ median n1 win%
Start 0 – ~11 min 69 150.8 159.7 +9.0 ms +3 ms 40.6%
Warmup ~11 min – ~14 h 4,234 152.1 154.5 +2.4 ms −1 ms 50.6%
Steady state ~14 h onward 2,695 151.5 133.6 −18.0 ms −13 ms 70.5%
Pooled 24 h — 7,183 151.5 146.0 −5.4 ms −5 ms 58.5%
Excluded: 185 blocks before the 19:39 restart (n1 on old pre-cache binary).

Hourly timeline

Hour Blocks n2 mean n1 mean Δ n1 win% cache hit%
06-10 18 7 108.3 126.3 +18.0 0% —
06-10 19 297 140.4 134.5 −5.9 68% 14.5
06-10 20 300 148.3 135.3 −13.0 69% 14.6
06-10 21 300 144.0 146.3 +2.3 66% 15.0
06-10 22 300 136.9 147.0 +10.1 47% 14.7
06-10 23 299 151.7 144.6 −7.1 63% 14.1
06-11 00 296 137.6 158.5 +20.8 34% 14.7
06-11 01 300 152.9 157.8 +4.9 39% 14.4
06-11 02 298 156.1 164.3 +8.2 26% 14.2
06-11 03 298 155.7 154.9 −0.8 53% 14.1
06-11 04 300 192.6 194.3 +1.7 42% 13.8
06-11 05 299 155.7 187.7 +32.0 13% 14.0
06-11 06 297 148.1 147.5 −0.6 51% 14.0
06-11 07 299 155.8 141.8 −14.0 86% 14.4
06-11 08 298 137.1 131.2 −6.0 54% 14.7
06-11 09 300 159.7 157.8 −1.9 60% 14.8
06-11 10 299 150.9 141.6 −9.3 61% 15.2
06-11 11 300 151.3 141.3 −10.0 63% 14.8
06-11 12 297 151.1 123.2 −27.9 98% 14.8
06-11 13 300 182.4 129.8 −52.6 97% 14.8
06-11 14 299 141.7 147.1 +5.4 24% 15.1
06-11 15 300 140.8 127.1 −13.7 82% 15.0
06-11 16 298 156.5 128.4 −28.1 89% 15.5
06-11 17 299 158.8 130.0 −28.8 86% 15.9
06-11 18 298 130.5 133.6 +3.1 34% 15.0
06-11 19 5 132.4 132.0 −0.4 40% 15.5

Start phase — n1 per-minute read latency & execution

Minute dbdur fdur exec
19:39 111.9 µs 120.4 µs 134.5 ms
19:40 137.5 µs 165.6 µs 146.0 ms
19:41 150.8 µs 132.4 µs 137.4 ms
19:42 104.6 µs 192.6 µs 174.8 ms
19:43 58.8 µs 230.6 µs 175.4 ms
19:44 63.4 µs 198.2 µs 141.0 ms
19:45 78.4 µs 470.6 µs 290.8 ms
19:46 52.2 µs 71.4 µs 150.6 ms
19:47 62.6 µs 80.8 µs 99.6 ms
19:48 72.8 µs 92.6 µs 200.4 ms
19:49 66.8 µs 82.2 µs 156.6 ms
19:50 64.0 µs 79.6 µs 139.6 ms

Warmup phase — commitment time, heavy blocks (≥2k keys), per hour

Hour (06-11) Since restart n1 (cache) n2 (baseline) n1 − n2
00 ~4.3 h 87.8 ms 55.1 ms +32.7
01 ~5.3 h 77.9 ms 76.3 ms +1.6
02 ~6.3 h 90.1 ms 84.9 ms +5.2
03 ~7.3 h 78.9 ms 81.7 ms −2.8
04 ~8.3 h 103.4 ms 102.5 ms +0.9
05 ~9.3 h 110.6 ms 86.6 ms +24.0
06 ~10.3 h 85.1 ms 85.9 ms −0.8
07 ~11.3 h 76.1 ms 85.7 ms −9.6
08 ~12.3 h 73.4 ms 64.8 ms +8.6
09 ~13.3 h 95.7 ms 82.6 ms +13.1
10 ~14.3 h 63.8 ms 66.3 ms −2.5
11 ~15.3 h 70.5 ms 70.3 ms +0.2
12 ~16.3 h 55.8 ms 71.1 ms −15.3
13 ~17.3 h 59.3 ms 87.3 ms −28.0
14 ~18.3 h 69.8 ms 57.4 ms +12.4
15 ~18.8 h 54.4 ms 56.4 ms −2.0
16 ~20.3 h 61.0 ms 75.0 ms −14.0
17 ~21.3 h 59.8 ms 74.1 ms −14.3
18 ~22.3 h 58.1 ms 54.6 ms +3.5
19 ~23.3 h 54.6 ms 51.2 ms +3.4

Steady phase — 20 largest n1-faster blocks

Block n2 n1 Δ mgas
25295241 439 ms 121 ms −318 ms 28.6
25294735 374 ms 104 ms −270 ms 20.7
25294215 414 ms 160 ms −254 ms 43.5
25294466 490 ms 267 ms −223 ms 47.1
25294467 427 ms 221 ms −206 ms 52.9
25294686 458 ms 255 ms −203 ms 31.0
25294431 377 ms 176 ms −201 ms 44.6
25294531 369 ms 180 ms −189 ms 49.7
25294562 334 ms 151 ms −183 ms 48.5
25294395 283 ms 108 ms −175 ms 28.2
25294528 402 ms 237 ms −165 ms 59.8
25294524 461 ms 297 ms −164 ms 33.8
25295734 316 ms 162 ms −154 ms 37.2
25294432 358 ms 205 ms −153 ms 49.0
25295672 359 ms 210 ms −149 ms 48.8
25293761 406 ms 259 ms −147 ms 39.4
25294558 301 ms 156 ms −145 ms 39.0
25294041 382 ms 238 ms −144 ms 59.6
25294596 388 ms 245 ms −143 ms 54.5
25294549 267 ms 126 ms −141 ms 36.1

Steady phase — 20 largest n1-slower blocks

Block n2 n1 Δ mgas
25293861 249 ms 444 ms +195 ms 59.6
25293640 249 ms 385 ms +136 ms 55.9
25294746 202 ms 322 ms +120 ms 51.1
25293888 210 ms 319 ms +109 ms 55.4
25294771 185 ms 282 ms +97 ms 52.8
25294974 206 ms 298 ms +92 ms 60.0
25295105 246 ms 338 ms +92 ms 56.5
25294649 247 ms 328 ms +81 ms 56.8
25294784 170 ms 246 ms +76 ms 48.6
25294763 226 ms 295 ms +69 ms 52.9
25294779 227 ms 295 ms +68 ms 49.2
25293862 137 ms 204 ms +67 ms 36.0
25294758 230 ms 295 ms +65 ms 45.1
25295921 129 ms 192 ms +63 ms 26.3
25293903 148 ms 209 ms +61 ms 35.9
25296094 246 ms 303 ms +57 ms 53.4
25293544 93 ms 147 ms +54 ms 28.1
25293867 143 ms 197 ms +54 ms 40.2
25294749 152 ms 206 ms +54 ms 31.6
25294747 178 ms 231 ms +53 ms 38.1

looks like mixed results, cache off was much quicker on some blocks, cache on was much quicker on other blocks.
the hourly mean times are nearly identical
likely the poor 14-16% cache hit explains this
overall not seeing any exciting gains

Copilot

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

+			if bc := aggTx.BranchCache(); bc != nil {
+				bc.Clear()
+			}
+			branchCacheCleared = true


+func (dm *DomainMetrics) UpdateFileReadsUnique(domain kv.Domain, key []byte, start time.Time) {
+	// Composite "<domain>:<key>" so two domains can hold the same prefix
+	// shape (commitment compact-encoded paths vs accounts plain addresses)
+	// without colliding.
+	domainKey := domain.String() + ":" + string(key)
+	_, alreadySeen := dm.seenFileReads.LoadOrStore(domainKey, struct{}{})
+	bucket := lenBucket(len(key))
+


+// SetBranchCache attaches a BranchCache for branch read-through and
+// write-through. Also propagates to the trie's BranchEncoder so encoder
+// writes update the cache via mark-dirty-then-Put (see CollectUpdate).
+//
+// nil disables — both this trie's branchFromCacheOrDB Level-2 and the
+// encoder's cache write-back become no-ops.
+func (hph *HexPatriciaHashed) SetBranchCache(c *BranchCache) {


yperbasis

Reviewed at head 65810b14f3. The extraction resolved the prior blocker and most majors; core, new tests, and CI all check out. Requesting changes on what's left.

Blocking

Commit-failure cache poisoning — the unresolved half of the flush-window major. The flush callback now runs after the MDBX write, but not after commit. If sd.Flush succeeds (cache updated, ClearRam follows) and rwTx.Commit() then fails, the aggregator-lifetime cache is permanently ahead of MDBX with no masking layer left; the bg path logs-and-continues (forkchoice.go:685-688). The retry then unfolds against poisoned branches and computes wrong roots until restart — pre-PR, a failed bg commit recovered with a cold cache. Same shape in the foreground commit, SetHead, and the exec3 stage-loop. Containable fix: on any error after a successful Flush, clear the cache (or stash the tuples and apply them post-commit); runForkchoiceFlushCommit covers both FCU paths.
Dead diagnostics surface (divergence-detector leftovers, all zero callers): ProbeReadLayers/ProbeStateLayers/SiteIdentity, plus Metrics() and ProbeReadLayers forced into the commitmentdb sd interface (commitment_context.go:46-50); SkipLoadResetCounters (hex_patricia_hashed.go:1763-1784); ContainsHashStats; WarmerBranchOutcomeStats. Delete. Related: the two unconditional atomics on every ExistenceFilter.ContainsHash process-wide (existence_filter.go:88-115) are still ungated, unlike the other new metrics — gate by dbg.KVReadLevelledMetrics or drop.
PR body no longer describes the PR. It still advertises the extracted trunk-preload pipeline (pinned tier, adaptive controller, wave-BFS, MDBX-resident prefetch), the divergence detector, and GetWithOrigin; "BranchCache cleared on SetHead" is actually watermark eviction via sd.Unwind; and the "Generated with Claude Code" footer is forbidden repo-wide. Also: TestFromZero_GenesisAllocPreservedAfterResetReExec passes on current main — the ResetExec clear keeps this PR's own cache safe across reset rather than fixing a live main bug; the body should say so.

Should fix

Comment policy (CLAUDE.md hard limits): the ~110-line BranchCache header (references the add_execution_context_with_caches branch), the ValidateChain "DO NOT CHANGE…" essay (exec_module.go:502-536) whose cherry-pick note references e.latestGen() which doesn't exist on this branch, the in-source PR reference "#21752" in kv_interface.go, and "step 2b of WarmupCache deletion" narration in hex_patricia_hashed.go. Factually stale: Invalidate's doc still contrasts MarkDirty/PutIfClean, both removed (branch_cache.go:283-284).
seenFileReads is unbounded (state_changeset.go:509-515): process-cumulative sync.Map, one entry per unique prefix ever file-read, never cleared, plus two string allocs per file read — a slow leak whenever dbg.KVReadLevelledMetrics is enabled on a long-running node. Cap it or document the trade-off at the flag.
No kill-switch for the BranchCache itself — the "on and completely trustworthy, or off to measure" thesis currently only holds for USE_STATE_CACHE. A dbg env to disable the BranchCache would serve both A/B measurement and emergency mitigation.

Minor

The acknowledged UnwindTo Put-vs-scan race ("the next UnwindTo will catch it") is unreachable today only via engine-semaphore serialization plus mem masking — add a test or assertion-mode check so a future concurrency change doesn't silently break it. BranchCache.Get returns the cache-owned slice uncopied — document the no-mutate contract. Two log.Debug lines fire on every SetStateCache call.

…drop dead diagnostics Resolves the open review on PR #21380. Correctness (commit-failure cache poisoning): the aggregator-lifetime caches must only ever reflect committed state. SharedDomains.Commit now owns the flush+commit and applies the flushed commitment branches to the BranchCache ONLY after the commit succeeds (flush is implicit in committing). A failed commit can no longer leave a cache ahead of MDBX — by construction, with no caller-side "clear on error" discipline. Plain Flush no longer touches the caches (callers that own their own commit get a cold-but-correct result). The exec-module commit sites (forkchoice x3, set_head, exec3 commit-cycle) call sd.Commit. The caches are documented as an internal detail of SharedDomains. One switch: the BranchCache is gated on USE_STATE_CACHE (it is a state cache), so one operator switch disables all caching — rather than a separate env. Dead diagnostics removed (zero callers): SkipLoadResetCounters + its disk-load counters, ProbeReadLayers/ProbeStateLayers/SiteIdentity (+ probe fields and the sd-interface methods), ContainsHashStats, WarmerBranchOutcomeStats, and the ungated ExistenceFilter.ContainsHash atomics. Comments: trimmed the BranchCache header, the ValidateChain and changesetMu essays, the WarmupCache-deletion narration; removed the in-source #21752 ref and the stale MarkDirty/PutIfClean reference; documented Get's no-mutate contract and the seenFileReads unbounded-growth trade-off. Dropped the per-call SetStateCache debug logs and the reset-stages warning. Added a BranchCache late-Put/UnwindTo invariant test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… both caches) Pulls forward the minimal StateCache consistency guarantee from #21386 so this PR is logically consistent across both aggregator-lifetime caches. The StateCache write path (domainPut/DomainDel) now INVALIDATES the cache entry instead of storing the written value. The value already lives in sd.mem (the write path's local copy), so storing it in the cache both double-stored it and placed an uncommitted value into a long-lived cache — which a failed commit would leave ahead of MDBX (the same poisoning class fixed for the BranchCache). The cache now holds only committed state; reads repopulate from committed files. Consequently the ClearWithHash-on-invalid-block call is removed: an invalid block (and fork validation, which never commits) only invalidates entries, never stores wrong ones, so there is nothing to clear. #21386 will, on rebase, add warm post-commit repopulation back under its txNum/epoch model and remove the now-redundant block-hash machinery (ValidateAndPrepare/ClearWithHash). Reorg invalidation (RevertWithDiffset) and the read-through populate are unchanged here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mh0lt · 2026-06-12T10:55:58Z

Thanks — addressed all of it. Summary of the changes:

Blocking

Commit-failure cache poisoning — fixed by construction rather than by clearing on error. SharedDomains.Commit now owns the flush+commit: it flushes the mem batch into the tx, commits, and only then applies the flushed commitment branches to the BranchCache. A failed commit applies nothing, so the cache can never sit ahead of MDBX — no caller-side "clear on error" discipline, no window. Plain Flush no longer touches the cache at all (callers that own their own commit get a cold-but-correct result). The FCU paths, SetHead, and the exec3 commit-cycle all call sd.Commit.
While here I also closed the StateCache half of the same hazard (it was populated pre-commit in domainPut): the write path now invalidates the entry instead of storing it (the value already lives in sd.mem), so both caches hold only committed state. This pulls a minimal slice of StateCache LRU + Mode rework (PR #2 of the perf stack) #21386's consistency work forward to keep PR Pull from go-ethereum up to 26aea73 (7 Feb 2019) #1 logically consistent across both caches — see the PR description; StateCache LRU + Mode rework (PR #2 of the perf stack) #21386 re-adds warm repopulation under its txNum/epoch model on rebase.
Dead diagnostics — deleted: SkipLoadResetCounters (+ its disk-load counters), ProbeReadLayers/ProbeStateLayers/SiteIdentity (+ the probe fields and the two sd-interface methods), ContainsHashStats, WarmerBranchOutcomeStats. The two unconditional ExistenceFilter.ContainsHash atomics are gone too.
PR body — rewritten to describe what's actually here (no trunk-preload/adaptive/divergence-detector/GetWithOrigin); the "Generated with Claude Code" footer is removed; the reset test is described as keeping this PR's cache safe across reset rather than fixing a live main bug; "SetHead clears the cache" reworded to watermark eviction via sd.Unwind. Also marked as PR Pull from go-ethereum up to 26aea73 (7 Feb 2019) #1 of 3 and linked the follow-ups (consistency/perf Pull from go-ethereum up to 2f24e25 (6 Mar 2019) #2, pinning Pull go-ethereum up to 26aea736 (7 Feb 2019) #3, and the GetLatest-unification follow-up Unify the temporal kv GetLatest interface — eliminate accreted duck-typed metered-getter variants #21739).

Should-fix

Comment policy: trimmed the BranchCache header, the ValidateChain and changesetMu essays, the WarmupCache-deletion narration; removed the in-source #21752 reference and the stale MarkDirty/PutIfClean doc.
seenFileReads: documented the bounded-growth trade-off at the field (intentionally not capped — a cap would corrupt UniqueFileReadCount; only accumulates under KVReadLevelledMetrics, a short-lived diagnostic toggle).
Kill-switch — done differently, flagging for your OK. Rather than a separate BranchCache env, the BranchCache rides the existing USE_STATE_CACHE toggle: it is a state cache, and one switch that disables all caching is the right operation when bisecting a state-root mismatch — an operator shouldn't have to reason about the interaction of several independent cache switches. Happy to add a dedicated env instead if you'd prefer.

Minor

Added a BranchCache late-Put/UnwindTo invariant test; documented Get's no-mutate contract; dropped the two per-call SetStateCache debug logs (and the reset-stages warning, which was also masking the Copilot-flagged nil-cache case).

…ning for both caches)" This reverts commit 2415c13.

yperbasis

Reviewed the delta since this morning's review (65810b14f3..2415c13772). The two pushes are genuinely responsive — blocking items 1–3 and should-fixes 4–6 from the morning review are all structurally addressed — but the new commit-gating work introduces two regressions, both proven red/green, one of which is already failing CI's serial matrix at head. The body's "behaves identically across parallel and serial exec — confirmed in CI" no longer holds: serial is red.

Blocking

1. The StateCache invalidate-on-write commit (`2415c13772`) breaks serial execution — 4 deterministic CI failures

tests / tests-mac-linux (macos-15, serial) and race-tests / tests-linux (execution-other, serial) both fail, deterministically (the mac job failed twice):

TestValidateForkPayloadOffNonTipCanonicalBlockWithCache
TestUpdateForkChoiceRecoversWhenStateAheadOfTxNums
TestUpdateForkChoiceForwardExecutesAfterStateAheadRecovery
TestAssembleBlockGasOverflow

Local repro:

ERIGON_EXEC3_PARALLEL=false go test ./execution/execmodule/ -run 'TestValidateForkPayloadOffNonTipCanonicalBlockWithCache' -count=1

Surfacing the swallowed ValidationError shows the root cause — validating canonical block 2 (not even the fork part of the test) fails with:

[5/5 Execution] invalid block, txnIdx=0, nonce too high: address 0xD46FbdBa…, tx: 1 state: 0

i.e. the sender reads pre-block-1 state through the cache path.

Single-hunk bisect: restoring stateCache.Put(domain, k, v) in domainPut — keeping everything else in the commit, including the ClearWithHash removal — makes all four pass. With USE_STATE_CACHE=false both heads pass, so this is a stale cache hit, not a missing warm value.

Mechanism: the read-through populate (domain_shared.go:861-863) is unguarded. Pre-change it was coherent only because write-time Puts always overwrote any stale read-through entry. With invalidate-on-write, any reader holding an older view that reads a key after the write deleted it re-populates the pre-block value — permanently. That is exactly the consistency problem #21386's txNum/epoch model is scoped to solve; the "minimal slice" pulled forward is not self-consistent without the guard. Recommendation: drop the StateCache half from this PR (revert the domainPut hunk and the ClearWithHash removal — both belong in #21386), or pull the epoch guard forward too.

2. Plain `Flush` + self-commit now leaves the BranchCache stale (`f4c82e8f34`)

Red/green repro — fails at head, passes at 65810b14f3, where the flush callback still refreshed the cache per-key:

TestBranchCacheStaleAfterPlainFlushCommit (drop into db/state/execctx/)

package execctx_test

import (
	"testing"

	"github.com/stretchr/testify/require"

	"github.com/erigontech/erigon/common/log/v3"
	"github.com/erigontech/erigon/db/kv"
	"github.com/erigontech/erigon/db/state/execctx"
)

func TestBranchCacheStaleAfterPlainFlushCommit(t *testing.T) {
	stepSize := uint64(100)
	db := newTestDb(t, stepSize)
	ctx := t.Context()
	logger := log.New()

	key := []byte{0x0a, 0x0b}

	// Window 1: write K=v1, plain Flush, self-commit (cache untouched, empty).
	rwTx, err := db.BeginTemporalRw(ctx)
	require.NoError(t, err)
	defer rwTx.Rollback()
	sd, err := execctx.NewSharedDomains(ctx, rwTx, logger)
	require.NoError(t, err)
	defer sd.Close()

	err = sd.DomainPut(kv.CommitmentDomain, rwTx, key, []byte("v1-branch-bytes"), 1, nil)
	require.NoError(t, err)
	require.NoError(t, sd.Flush(ctx, rwTx))
	sd.ClearRam(true)
	require.NoError(t, rwTx.Commit())
	sd.Close()

	// Window 2: read K (read-through populates the BranchCache with v1),
	// then overwrite K=v2, plain Flush, self-commit.
	rwTx, err = db.BeginTemporalRw(ctx)
	require.NoError(t, err)
	defer rwTx.Rollback()
	sd, err = execctx.NewSharedDomains(ctx, rwTx, logger)
	require.NoError(t, err)
	defer sd.Close()

	v, _, err := sd.GetLatest(kv.CommitmentDomain, rwTx, key)
	require.NoError(t, err)
	require.Equal(t, []byte("v1-branch-bytes"), v)

	err = sd.DomainPut(kv.CommitmentDomain, rwTx, key, []byte("v2-branch-bytes"), 2, nil)
	require.NoError(t, err)
	require.NoError(t, sd.Flush(ctx, rwTx))
	sd.ClearRam(true)
	require.NoError(t, rwTx.Commit())
	sd.Close()

	// Window 3: fresh SD, fresh tx. MDBX holds v2; a coherent cache must not
	// serve v1.
	roTx, err := db.BeginTemporalRw(ctx)
	require.NoError(t, err)
	defer roTx.Rollback()
	sd, err = execctx.NewSharedDomains(ctx, roTx, logger)
	require.NoError(t, err)
	defer sd.Close()

	v, _, err = sd.GetLatest(kv.CommitmentDomain, roTx, key)
	require.NoError(t, err)
	require.Equal(t, []byte("v2-branch-bytes"), v,
		"BranchCache served a stale pre-flush value after a plain Flush + self-commit")
}

Sequence: write K=v1 → Flush → self-commit; read K (read-through populates the cache with v1); write K=v2 → Flush → self-commit; a fresh SD then reads v1 from the cache while MDBX holds v2.

Affected callers — all flush commitment writes, self-commit, and keep computing in the same process, with the cache wired by the USE_STATE_CACHE=true default:

rebuild_commitment_history (db/state/squeeze.go:533) — and RebuildStateReader explicitly routes commitment reads through LatestStateReader → the BranchCache, so the rebuild's own reads hit the stale entries
cmd/integration stage_exec chain-tip loop (cmd/integration/commands/stages.go:761) and batch loop (stages.go:807)
cmd/integration state_stages (cmd/integration/commands/state_stages.go:286)

The integration tool — the thing used to bisect state-root mismatches — now self-corrupts its commitment reads after the first flush window. CI green doesn't cover this: EEST/hive exercise the execmodule sd.Commit paths only.

Root asymmetry: the StateCache got write-time invalidation in domainPut; the BranchCache got nothing at write or flush time — its coherence now exists only via sd.Commit (5 execmodule sites). Containable fix that keeps the no-poisoning property: flushMem always invalidates flushed commitment keys via the existing callback (invalidation is always safe — worst case cold), and Commit additionally stashes and re-Puts post-commit (warm). Every caller then gets the promised "cold-but-correct" for real. Also fix the now-false comment at domain_shared.go:833-838 ("Refreshed per-key by sd.Flush … a cache hit never coexists with a newer MDBX state" — the repro disproves both halves).

Should fix

No test calls sd.Commit. The body claims unit coverage for "commit-gated population"; it doesn't exist anywhere in the tree. The new semantics (cache advances only after a successful commit; failed commit leaves it unchanged; Flush leaves it alone) are this round's centerpiece and deserve red/green coverage — the repro above can seed the Flush side.
sd.Commit would silently mis-handle a mem-backed tx. MemoryMutation.Commit "succeeds" without durability, so passing a BlockOverlay would populate the cache with never-committed state — the exact hazard the design declares impossible "by construction". All 5 current call sites pass real MDBX rw-txs; add a cheap runtime guard rejecting overlay-backed txs (prefer code that enforces the invariant over a doc note).
Stale doc on runForkchoiceFlushCommit (forkchoice.go:813-822): still says the roTx is "released between Flush and Commit … closing it after Flush is safe", but the code now closes it before sd.Commit. The early close is functionally fine (flushMem writes mem + pending updates via the new rwTx; flushPendingUpdates passes prev-values so nothing reads the old roTx) — but this comment documents the safety argument and must match the code.
The posted A/B numbers describe a different build — collected at 6d1324b8, before the trunk-pin extraction, the commit-gating, and invalidate-on-write (which changes the StateCache hit profile: write-hot keys go cold until re-read). The body already promises a re-run after today's changes — agreed, and it should gate the merge, especially since fixing the two blockers above will touch this code again.

Verified resolved from the morning review

Commit-failure poisoning: the sd.Commit stash-then-apply design is sound for its call sites, and all node-runtime commit paths (FCU fg/bg, TooFarAway, hasMore CommitCycle, SetHead, ProcessFrozenBlocks) route through it; no tx writes occur after Commit in any of them.
Dead diagnostics: all listed symbols gone; everything builds.
PR body: rewritten accurately (footer gone, reset-test framing fixed) — modulo the two now-false claims flagged above ("identical across serial/parallel", "commit-gated population" coverage).
Comment policy: BranchCache header is ~14 lines; the ValidateChain/changesetMu essays trimmed; the in-source #21752 ref and stale MarkDirty/PutIfClean doc gone; Get's no-mutate contract and seenFileReads trade-off documented; SetStateCache debug logs dropped; the late-Put/UnwindTo invariant test added.
The Clear() race is gone with the pinned-tier extraction (in-place root.Store(nil) + tail.Purge()).
Kill-switch via USE_STATE_CACHE — accepted. Wiring at aggregator.go:353 is correct, and one switch for all caches is a reasonable call. Dropping ClearWithHash-on-invalid is also sound in isolation (the ValidateAndPrepare hash-mismatch self-heals with a full clear) — but it ships in the same commit as blocking item 1, so it goes back with it unless split out.

Verified locally: go build on all affected packages; commitment + execctx suites green; the serial repro/bisect and the cache-off counterfactuals as above.

mh0lt · 2026-06-12T12:13:58Z

Known issue: StateCache fork-validation correctness is incidental (fixed by #21386)

While hardening this PR we established that the StateCache's correctness on the fork-validation / unwind path is incidental — it passes by luck, and this is true on both this branch and main. We are flagging it here as a known issue, resolved by #21386, and are deliberately not merging the stack at this stage so the two PRs stay independently reviewable and the brittleness stays explicit rather than being papered over by a merge.

What we proved

All experiments use the four fork-validation/unwind tests:
TestValidateForkPayloadOffNonTipCanonicalBlockWithCache, TestUpdateForkChoiceRecoversWhenStateAheadOfTxNums, TestUpdateForkChoiceForwardExecutesAfterStateAheadRecovery, TestAssembleBlockGasOverflow.

1. The underlying read path is correct — the cache is not load-bearing for correctness.
With USE_STATE_CACHE=false (cache fully off) all four tests pass, in both serial and parallel exec. So mem → parent.mem → files always yields the right answer; the cache is a pure optimisation layered on top.

2. The pass is fragile to the write strategy.
Switching the write path from the current write-through (domainPut → stateCache.Put) to the more principled invalidate-on-write (domainPut → stateCache.Delete, "cache holds committed state only") makes all four tests fail with BadBlock. Restoring the invalid-block ClearWithHash cleanup does not help (isolated via a 2×2) — the breakage is purely the Put → Delete write strategy, not the invalid-block cleanup.

3. ASSERT_STATE_CACHE shows both strategies serve stale values.
ASSERT_STATE_CACHE panics whenever a cache hit disagrees with the authoritative committed value. It is a valid staleness detector here because the cache is consulted only after mem and parent.mem miss — so a hit that differs from committed state is cross-batch/pre-unwind leakage, not legitimate in-flight state. Both write strategies diverge:

Put    (current, test PASSES): accounts key=d3db8cce… txNum=1  cached=0101…(nonce 1, advanced)  db=00…(nonce 0, base)
Delete (test FAILS, BadBlock): accounts key=92097af3… txNum=5  cached=00…(nonce 0, base)        db=0101…(nonce 1, advanced)

The cache is stale in both cases. Put leaves a stale-ahead entry whose read at txNum=1 happens not to flip the resulting trie root → the test passes. Delete leaves a stale-behind entry whose read at txNum=5 does flip the root → BadBlock. Which one "passes" is luck, not correctness.

Root cause

The StateCache (introduced in #18868, unchanged on main) is module-scope, read-through-populated, and has no unwind invalidation — its only coherence machinery is hash-based (ClearWithHash / RevertWithDiffset / ValidateAndPrepare). Fork validation does an unwind-to-common-ancestor on the domain state, but the shared module cache is not unwound by txNum, so tip-era entries survive into the fork re-execution. Put masks this by continuously rewriting entries to the re-exec's own values; Delete removes them and lets read-through repopulate a mid-unwind snapshot. Neither is correct by construction. main carries the byte-identical cache model (same methods, same read-through Put, no txNum unwind); the …WithCache tests are new in this stack, so main was simply never exercised on this path, but the unwind-incoherence is structurally present there too.

Fixed by #21386

#21386 reworks the cache to a txNum/epoch model: populate at flush (committed-only) and invalidate by txNum watermark on unwind (epoch disambiguates a reused txNum across forks). On #21386 the same four tests pass, and under ASSERT_STATE_CACHE there is no divergence — verified with a witnessed cache hit (accounts/99b523a1… txNum=1 via the commitment-unfold read path), so it is genuine coherence across the unwind, not a zero-hit pass.

Decision

Tracked as a known issue on this PR (and main), resolved by StateCache LRU + Mode rework (PR #2 of the perf stack) #21386.
Not merging State Cache Consolidation (PR #1 of the perf stack) #21380 into StateCache LRU + Mode rework (PR #2 of the perf stack) #21386 at this stage — the PRs stay independent and the brittleness stays documented here.

yperbasis

CI is red

…onto #21380) Brings #21380's review-fix commit (f4c82e8: trimmed comments, dropped debug logs) onto #21386, plus the net-zero invalidate-on-write commit + its revert. Conflict resolutions: - db/state/execctx/domain_shared.go: reconciled the two commit-gating models into one. flushMem/Flush stay plain (cold-but-correct). Commit now stashes the flush-callback tuples for ALL caches (CommitmentDomain->BranchCache, Accounts/Storage/Code->StateCache) during flush and applies them only after tx.Commit() succeeds — extending #21380's by-construction commit-gating to the StateCache so no cache can be advanced past durable MDBX on a failed commit. Restored ProbeReadLayers (dropped by auto-merge). - db/state/changeset/state_changeset.go: kept #21386's relocation of the metrics types to db/state/kvmetrics; dropped the stale duplicate DomainIOMetrics block. - execution/commitment/branch_cache_test.go: kept #21386's Unwind-API test; dropped #21380's UnwindTo-API tests (that API was renamed/superseded in #21386). - execution/commitment/commitmentdb/commitment_context.go: kept #21386's sd interface additions (ProbeReadLayers, Metrics, kvmetrics import). - db/state/execctx/flush_storage_cache_test.go: rewrote Flush->Commit since the caches are now commit-gated (Flush no longer warms them). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mh0lt · 2026-06-12T15:56:05Z

@yperbasis the red CI was not a code failure — it was a GitHub runner running out of disk space. The eest-spec-blocktests-stable-race-pre-cancun-sequential job crashed mid-run with:

System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/.../Worker_....log'

ci-gate then went red only because it aggregates required jobs (one required job had failed). I re-ran the failed job and it passed (15m49s), and ci-gate is now green — CI is fully green (no failing or pending checks; the earlier failure was infra, not this PR).

All review items are addressed (taratorio's approved; Copilot's 4 comments handled). Could you re-review / clear the change request when you have a moment? 🙏

Mark Holt and others added 30 commits May 21, 2026 21:35

mh0lt and others added 2 commits June 10, 2026 23:57

mh0lt mentioned this pull request Jun 11, 2026

Unify the temporal kv GetLatest interface — eliminate accreted duck-typed metered-getter variants #21739

Open

mh0lt requested a review from yperbasis June 11, 2026 00:30

Merge remote-tracking branch 'origin/main' into mh/perf-caches-pr

e5b40cf

mh0lt mentioned this pull request Jun 11, 2026

kv/state caches: tidy FlushOption + make cache consistency under unwind (txNum, epoch)-based without full iteration #21752

Open

taratorio approved these changes Jun 12, 2026

View reviewed changes

yperbasis requested a review from Copilot June 12, 2026 07:53

Copilot started reviewing on behalf of yperbasis June 12, 2026 07:54 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

yperbasis requested changes Jun 12, 2026

View reviewed changes

mh0lt and others added 2 commits June 12, 2026 10:32

Revert "db/state, execution: commit-gate the StateCache too (no poiso…

f8d5acc

…ning for both caches)" This reverts commit 2415c13.

yperbasis requested changes Jun 12, 2026

View reviewed changes

mh0lt requested a review from yperbasis June 12, 2026 15:56

Conversation

mh0lt commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR contains

BranchCache — single aggregator-scope commitment cache

One switch for all caches

The BranchCache reflects only committed state — by construction

BUG #21138 — parallel-exec from-0 wrong trie root

Follow-ups (the rest of the stack)

Testing

Uh oh!

mh0lt commented Jun 11, 2026

Uh oh!

taratorio commented Jun 11, 2026

Uh oh!

mh0lt commented Jun 11, 2026

State-cache benchmark — bench-n1 vs bench-n2

Setup

Phase totals

Hourly timeline

Start phase — n1 per-minute read latency & execution

Warmup phase — commitment time, heavy blocks (≥2k keys), per hour

Steady phase — 20 largest n1-faster blocks

Steady phase — 20 largest n1-slower blocks

Uh oh!

taratorio commented Jun 12, 2026

State-cache benchmark — bench-n1 vs bench-n2

Setup

Phase totals

Hourly timeline

Start phase — n1 per-minute read latency & execution

Warmup phase — commitment time, heavy blocks (≥2k keys), per hour

Steady phase — 20 largest n1-faster blocks

Steady phase — 20 largest n1-slower blocks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Blocking

Should fix

Minor

Uh oh!

mh0lt commented Jun 12, 2026

Uh oh!

yperbasis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Blocking

1. The StateCache invalidate-on-write commit (2415c13772) breaks serial execution — 4 deterministic CI failures

2. Plain Flush + self-commit now leaves the BranchCache stale (f4c82e8f34)

Should fix

Verified resolved from the morning review

Uh oh!

mh0lt commented Jun 12, 2026

Known issue: StateCache fork-validation correctness is incidental (fixed by #21386)

What we proved

Root cause

Fixed by #21386

Decision

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Uh oh!

mh0lt commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mh0lt commented May 23, 2026 •

edited

Loading

`BranchCache` — single aggregator-scope commitment cache

yperbasis left a comment •

edited

Loading

1. The StateCache invalidate-on-write commit (`2415c13772`) breaks serial execution — 4 deterministic CI failures

2. Plain `Flush` + self-commit now leaves the BranchCache stale (`f4c82e8f34`)