Skip to content

[r3.4] execution/cache: fix StateCache dropping deleted storage keys, causing stale reads#20271

Merged
yperbasis merged 1 commit into
release/3.4from
cherry-pick-20265
Apr 1, 2026
Merged

[r3.4] execution/cache: fix StateCache dropping deleted storage keys, causing stale reads#20271
yperbasis merged 1 commit into
release/3.4from
cherry-pick-20265

Conversation

@yperbasis

Copy link
Copy Markdown
Member

Cherry-pick of #20265 to release/3.4.

Summary

Fix a bug in StateCache.Get()/Put() that caused deleted storage keys to be treated as cache misses, allowing stale pre-delete values to be read from the DB.

Root cause of #20169 (Erigon nodes stuck on Hoodi).

🤖 Generated with Claude Code

…g stale reads (#20265)

## Summary

Fix a bug in `StateCache.Get()`/`Put()` that caused deleted storage keys
to be treated as cache misses, allowing stale pre-delete values to be
read from the DB.

**Root cause of #20169** (Erigon nodes stuck on Hoodi).

## The bug

`StateCache` is an in-memory cache in `SharedDomains` that caches
storage values between transactions in the same batch. When a storage
slot is deleted (written to zero via `DomainDel`):

1. `DomainDel` → `stateCache.Delete(key)` — removes key from cache
2. `GetLatest` reads mem batch → gets `nil` → `stateCache.Put(key, nil)`
— but `Put` sees `len(nil)==0` → calls `cache.Delete(key)` again
(removes from cache!)
3. Next tx reads → `stateCache.Get(key)` → key not found → returns
`(nil, false)` — **cache miss**
4. Falls through to DB → returns the old non-zero value that's still in
MDBX

## Impact

On Hoodi, block 2523978 tx[21] cleared 5 storage slots at contract
`0x1d150609ee9edcc6143506ba55a4faaedd562cd9`. The clearing was lost in
the cache, so those slots retained stale non-zero values. 1,525 blocks
later, block 2525503 tx[25] hit those slots: 5 SSTOREs cost 91,100 less
gas than expected (warm/existing writes instead of cold/new slot
creation), causing a gas mismatch that stuck the node.

## Fix

- **`Get`**: Remove the `len(v)==0 → false` short-circuit so that cached
empty values are returned as `([]byte{}, true)` (cache hit with empty
value), not `(nil, false)` (cache miss).
- **`Put`**: When `len(value)==0`, store `[]byte{}` sentinel instead of
deleting the key, so the "deleted" state is preserved in cache.

## Why release/3.3 is not affected

release/3.3's `SharedDomains` does not have `stateCache`. All reads go
directly to the mem batch → DB, bypassing the buggy cache layer.

## Test plan

- [x] `go test ./execution/cache/...` passes
- [x] Re-execution from clean snapshots on Hoodi produces correct state
(block 2525503 passes with `diff=0`)
- [ ] Full Hoodi sync from scratch with this fix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis requested a review from mh0lt as a code owner April 1, 2026 14:14
@yperbasis yperbasis added this to the 3.4.0 milestone Apr 1, 2026
@yperbasis yperbasis enabled auto-merge (squash) April 1, 2026 14:15
@yperbasis yperbasis merged commit c247ad3 into release/3.4 Apr 1, 2026
22 checks passed
@yperbasis yperbasis deleted the cherry-pick-20265 branch April 1, 2026 14:25
@AskAlexSharov AskAlexSharov requested a review from Copilot April 6, 2026 00:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Cherry-pick to release/3.4 fixing a correctness bug in the SharedDomains in-memory StateCache where deleted (empty) storage values were treated as cache misses, allowing stale pre-delete values to be read from DB and potentially causing execution/gas mismatches (e.g., #20169).

Changes:

  • Fix StateCache.Get() to return cached empty values as cache hits instead of forcing a miss.
  • Fix StateCache.Put() to preserve empty/nil values in cache (instead of translating them into deletes).
  • Add regression tests validating Put(nil) / Put([]byte{}) + Get() semantics for deleted storage keys.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
execution/cache/state_cache.go Adjusts StateCache Get/Put behavior so empty values remain cached and detectable as hits.
execution/cache/generic_cache.go Updates GenericCache Get to no longer treat “zero-sized” values as cache misses.
execution/cache/cache_test.go Adds focused unit regression tests for deleted-key caching semantics.
execution/tests/blockchain_test.go Adds an additional regression test that exercises StateCache deleted-key hit behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2362 to +2386
// Tests the SharedDomains StateCache directly, reproducing the exact
// sequence that caused Hoodi nodes to get stuck:
//
// 1. DomainPut writes key=0x42 (populates stateCache with 0x42)
// 2. Flush + ClearRam (persists to DB, clears mem buffer)
// 3. DomainDel deletes the key (stateCache.Delete removes entry)
// 4. Flush + ClearRam (persists deletion to DB, clears mem buffer)
// 5. GetLatest reads the key — should return nil (deleted)
//
// With the bug, step 3's DomainDel calls stateCache.Delete (removing
// the entry) but step 4's Flush path calls GetLatest which reads nil
// from mem buffer and does stateCache.Put(nil) → which ALSO deletes
// from cache. After ClearRam, the mem buffer is empty and the stateCache
// has no entry. Step 5's GetLatest falls through to the DB, which
// returns nil (correct, since Flush persisted the deletion).
//
// However, if the stateCache was populated between step 2 and step 3
// by a GetLatest (which caches 0x42 from DB), and the DomainDel's
// stateCache.Delete removes it, and then stateCache.Put(nil) removes
// the deletion sentinel, the cache has NO entry for the key. If the
// DB read in step 5 somehow returns the stale value (e.g. due to
// transaction isolation), the stale value is returned.
//
// This test exercises the StateCache at the cache level to verify
// that Put(nil) correctly records the deletion as a cache hit.

Copilot AI Apr 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression test comment describes a Flush/ClearRam/DB scenario, but the test itself only exercises StateCache.Put/Delete/Get and doesn't model SharedDomains.Flush/ClearRam behavior. Consider rewriting this doc to reflect the actual reproduction this test performs (Delete + Put(nil) must remain a cache hit) so future readers don't assume the test covers DB flush/isolation semantics.

Suggested change
// Tests the SharedDomains StateCache directly, reproducing the exact
// sequence that caused Hoodi nodes to get stuck:
//
// 1. DomainPut writes key=0x42 (populates stateCache with 0x42)
// 2. Flush + ClearRam (persists to DB, clears mem buffer)
// 3. DomainDel deletes the key (stateCache.Delete removes entry)
// 4. Flush + ClearRam (persists deletion to DB, clears mem buffer)
// 5. GetLatest reads the key — should return nil (deleted)
//
// With the bug, step 3's DomainDel calls stateCache.Delete (removing
// the entry) but step 4's Flush path calls GetLatest which reads nil
// from mem buffer and does stateCache.Put(nil) → which ALSO deletes
// from cache. After ClearRam, the mem buffer is empty and the stateCache
// has no entry. Step 5's GetLatest falls through to the DB, which
// returns nil (correct, since Flush persisted the deletion).
//
// However, if the stateCache was populated between step 2 and step 3
// by a GetLatest (which caches 0x42 from DB), and the DomainDel's
// stateCache.Delete removes it, and then stateCache.Put(nil) removes
// the deletion sentinel, the cache has NO entry for the key. If the
// DB read in step 5 somehow returns the stale value (e.g. due to
// transaction isolation), the stale value is returned.
//
// This test exercises the StateCache at the cache level to verify
// that Put(nil) correctly records the deletion as a cache hit.
// The production bug involved SharedDomains, but this test intentionally
// covers only the cache-level behavior that made the bug possible.
//
// Reproduced sequence in StateCache:
// 1. Put a value for a storage key.
// 2. Delete the key.
// 3. Put(nil) for the same key to represent a cached deletion.
// 4. Get must still be a cache hit and return the deleted marker
// as (nil, true), rather than a cache miss as (nil, false).
//
// The critical invariant is that Delete followed by Put(nil) must leave
// a cached tombstone. If Get reports a miss instead, callers may fall
// through to lower layers and observe stale data.

Copilot uses AI. Check for mistakes.
//
// This test exercises the StateCache at the cache level to verify
// that Put(nil) correctly records the deletion as a cache hit.
func TestStateCacheDeletedStorageSSTOREGas(t *testing.T) {

Copilot AI Apr 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name mentions "SSTOREGas", but the assertions only validate cache hit semantics for deleted storage keys. Renaming to something that matches the behavior under test (e.g., Delete+Put(nil) then Get is hit) would make failures easier to interpret.

Suggested change
func TestStateCacheDeletedStorageSSTOREGas(t *testing.T) {
func TestStateCacheDeletedStorageDeletePutNilGetHit(t *testing.T) {

Copilot uses AI. Check for mistakes.
Comment on lines +2390 to +2394
c := cache.NewStateCache(100, 100, 100, 100, 100)

key := make([]byte, 52) // addr(20) + slot(32)
key[0] = 0x1d
key[51] = 0xa2

Copilot AI Apr 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test duplicates the new StateCache Put(nil)/Get regression coverage already added in execution/cache/cache_test.go. To keep execution/tests focused on blockchain-level behavior (and avoid redundant maintenance), consider keeping the regression only in the cache package tests or moving this test alongside the other StateCache unit tests.

Copilot uses AI. Check for mistakes.
Comment on lines 83 to 89
if cache == nil {
return
}
if domain == kv.CommitmentDomain && bytes.Equal(key, commitmentdb.KeyCommitmentState) {
return
}
if len(value) == 0 {
cache.Delete(key)
return
}
cache.Put(key, common.Copy(value))

Copilot AI Apr 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CommitmentDomain special-case here appears unreachable because NewStateCache/NewDefaultStateCache don't initialize a commitment cache (CommitmentDomain is commented out), so Put() returns early for that domain. Consider either removing this guard or enabling the commitment cache to avoid dead code/confusing intent.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants