[r3.4] execution/cache: fix StateCache dropping deleted storage keys, causing stale reads#20271
Conversation
…g stale reads (#20265) ## Summary Fix a bug in `StateCache.Get()`/`Put()` that caused deleted storage keys to be treated as cache misses, allowing stale pre-delete values to be read from the DB. **Root cause of #20169** (Erigon nodes stuck on Hoodi). ## The bug `StateCache` is an in-memory cache in `SharedDomains` that caches storage values between transactions in the same batch. When a storage slot is deleted (written to zero via `DomainDel`): 1. `DomainDel` → `stateCache.Delete(key)` — removes key from cache 2. `GetLatest` reads mem batch → gets `nil` → `stateCache.Put(key, nil)` — but `Put` sees `len(nil)==0` → calls `cache.Delete(key)` again (removes from cache!) 3. Next tx reads → `stateCache.Get(key)` → key not found → returns `(nil, false)` — **cache miss** 4. Falls through to DB → returns the old non-zero value that's still in MDBX ## Impact On Hoodi, block 2523978 tx[21] cleared 5 storage slots at contract `0x1d150609ee9edcc6143506ba55a4faaedd562cd9`. The clearing was lost in the cache, so those slots retained stale non-zero values. 1,525 blocks later, block 2525503 tx[25] hit those slots: 5 SSTOREs cost 91,100 less gas than expected (warm/existing writes instead of cold/new slot creation), causing a gas mismatch that stuck the node. ## Fix - **`Get`**: Remove the `len(v)==0 → false` short-circuit so that cached empty values are returned as `([]byte{}, true)` (cache hit with empty value), not `(nil, false)` (cache miss). - **`Put`**: When `len(value)==0`, store `[]byte{}` sentinel instead of deleting the key, so the "deleted" state is preserved in cache. ## Why release/3.3 is not affected release/3.3's `SharedDomains` does not have `stateCache`. All reads go directly to the mem batch → DB, bypassing the buggy cache layer. ## Test plan - [x] `go test ./execution/cache/...` passes - [x] Re-execution from clean snapshots on Hoodi produces correct state (block 2525503 passes with `diff=0`) - [ ] Full Hoodi sync from scratch with this fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Cherry-pick to release/3.4 fixing a correctness bug in the SharedDomains in-memory StateCache where deleted (empty) storage values were treated as cache misses, allowing stale pre-delete values to be read from DB and potentially causing execution/gas mismatches (e.g., #20169).
Changes:
- Fix
StateCache.Get()to return cached empty values as cache hits instead of forcing a miss. - Fix
StateCache.Put()to preserve empty/nil values in cache (instead of translating them into deletes). - Add regression tests validating
Put(nil)/Put([]byte{})+Get()semantics for deleted storage keys.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
execution/cache/state_cache.go |
Adjusts StateCache Get/Put behavior so empty values remain cached and detectable as hits. |
execution/cache/generic_cache.go |
Updates GenericCache Get to no longer treat “zero-sized” values as cache misses. |
execution/cache/cache_test.go |
Adds focused unit regression tests for deleted-key caching semantics. |
execution/tests/blockchain_test.go |
Adds an additional regression test that exercises StateCache deleted-key hit behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Tests the SharedDomains StateCache directly, reproducing the exact | ||
| // sequence that caused Hoodi nodes to get stuck: | ||
| // | ||
| // 1. DomainPut writes key=0x42 (populates stateCache with 0x42) | ||
| // 2. Flush + ClearRam (persists to DB, clears mem buffer) | ||
| // 3. DomainDel deletes the key (stateCache.Delete removes entry) | ||
| // 4. Flush + ClearRam (persists deletion to DB, clears mem buffer) | ||
| // 5. GetLatest reads the key — should return nil (deleted) | ||
| // | ||
| // With the bug, step 3's DomainDel calls stateCache.Delete (removing | ||
| // the entry) but step 4's Flush path calls GetLatest which reads nil | ||
| // from mem buffer and does stateCache.Put(nil) → which ALSO deletes | ||
| // from cache. After ClearRam, the mem buffer is empty and the stateCache | ||
| // has no entry. Step 5's GetLatest falls through to the DB, which | ||
| // returns nil (correct, since Flush persisted the deletion). | ||
| // | ||
| // However, if the stateCache was populated between step 2 and step 3 | ||
| // by a GetLatest (which caches 0x42 from DB), and the DomainDel's | ||
| // stateCache.Delete removes it, and then stateCache.Put(nil) removes | ||
| // the deletion sentinel, the cache has NO entry for the key. If the | ||
| // DB read in step 5 somehow returns the stale value (e.g. due to | ||
| // transaction isolation), the stale value is returned. | ||
| // | ||
| // This test exercises the StateCache at the cache level to verify | ||
| // that Put(nil) correctly records the deletion as a cache hit. |
There was a problem hiding this comment.
The regression test comment describes a Flush/ClearRam/DB scenario, but the test itself only exercises StateCache.Put/Delete/Get and doesn't model SharedDomains.Flush/ClearRam behavior. Consider rewriting this doc to reflect the actual reproduction this test performs (Delete + Put(nil) must remain a cache hit) so future readers don't assume the test covers DB flush/isolation semantics.
| // Tests the SharedDomains StateCache directly, reproducing the exact | |
| // sequence that caused Hoodi nodes to get stuck: | |
| // | |
| // 1. DomainPut writes key=0x42 (populates stateCache with 0x42) | |
| // 2. Flush + ClearRam (persists to DB, clears mem buffer) | |
| // 3. DomainDel deletes the key (stateCache.Delete removes entry) | |
| // 4. Flush + ClearRam (persists deletion to DB, clears mem buffer) | |
| // 5. GetLatest reads the key — should return nil (deleted) | |
| // | |
| // With the bug, step 3's DomainDel calls stateCache.Delete (removing | |
| // the entry) but step 4's Flush path calls GetLatest which reads nil | |
| // from mem buffer and does stateCache.Put(nil) → which ALSO deletes | |
| // from cache. After ClearRam, the mem buffer is empty and the stateCache | |
| // has no entry. Step 5's GetLatest falls through to the DB, which | |
| // returns nil (correct, since Flush persisted the deletion). | |
| // | |
| // However, if the stateCache was populated between step 2 and step 3 | |
| // by a GetLatest (which caches 0x42 from DB), and the DomainDel's | |
| // stateCache.Delete removes it, and then stateCache.Put(nil) removes | |
| // the deletion sentinel, the cache has NO entry for the key. If the | |
| // DB read in step 5 somehow returns the stale value (e.g. due to | |
| // transaction isolation), the stale value is returned. | |
| // | |
| // This test exercises the StateCache at the cache level to verify | |
| // that Put(nil) correctly records the deletion as a cache hit. | |
| // The production bug involved SharedDomains, but this test intentionally | |
| // covers only the cache-level behavior that made the bug possible. | |
| // | |
| // Reproduced sequence in StateCache: | |
| // 1. Put a value for a storage key. | |
| // 2. Delete the key. | |
| // 3. Put(nil) for the same key to represent a cached deletion. | |
| // 4. Get must still be a cache hit and return the deleted marker | |
| // as (nil, true), rather than a cache miss as (nil, false). | |
| // | |
| // The critical invariant is that Delete followed by Put(nil) must leave | |
| // a cached tombstone. If Get reports a miss instead, callers may fall | |
| // through to lower layers and observe stale data. |
| // | ||
| // This test exercises the StateCache at the cache level to verify | ||
| // that Put(nil) correctly records the deletion as a cache hit. | ||
| func TestStateCacheDeletedStorageSSTOREGas(t *testing.T) { |
There was a problem hiding this comment.
Test name mentions "SSTOREGas", but the assertions only validate cache hit semantics for deleted storage keys. Renaming to something that matches the behavior under test (e.g., Delete+Put(nil) then Get is hit) would make failures easier to interpret.
| func TestStateCacheDeletedStorageSSTOREGas(t *testing.T) { | |
| func TestStateCacheDeletedStorageDeletePutNilGetHit(t *testing.T) { |
| c := cache.NewStateCache(100, 100, 100, 100, 100) | ||
|
|
||
| key := make([]byte, 52) // addr(20) + slot(32) | ||
| key[0] = 0x1d | ||
| key[51] = 0xa2 |
There was a problem hiding this comment.
This test duplicates the new StateCache Put(nil)/Get regression coverage already added in execution/cache/cache_test.go. To keep execution/tests focused on blockchain-level behavior (and avoid redundant maintenance), consider keeping the regression only in the cache package tests or moving this test alongside the other StateCache unit tests.
| if cache == nil { | ||
| return | ||
| } | ||
| if domain == kv.CommitmentDomain && bytes.Equal(key, commitmentdb.KeyCommitmentState) { | ||
| return | ||
| } | ||
| if len(value) == 0 { | ||
| cache.Delete(key) | ||
| return | ||
| } | ||
| cache.Put(key, common.Copy(value)) |
There was a problem hiding this comment.
The CommitmentDomain special-case here appears unreachable because NewStateCache/NewDefaultStateCache don't initialize a commitment cache (CommitmentDomain is commented out), so Put() returns early for that domain. Consider either removing this guard or enabling the commitment cache to avoid dead code/confusing intent.
Cherry-pick of #20265 to
release/3.4.Summary
Fix a bug in
StateCache.Get()/Put()that caused deleted storage keys to be treated as cache misses, allowing stale pre-delete values to be read from the DB.Root cause of #20169 (Erigon nodes stuck on Hoodi).
🤖 Generated with Claude Code