Skip to content

Pull from go-ethereum up to 26aea73 (7 Feb 2019)#1

Merged
AlexeyAkhunov merged 3 commits into
masterfrom
upstream_1
May 28, 2019
Merged

Pull from go-ethereum up to 26aea73 (7 Feb 2019)#1
AlexeyAkhunov merged 3 commits into
masterfrom
upstream_1

Conversation

@AlexeyAkhunov

Copy link
Copy Markdown
Contributor

No description provided.

atsushi-ishibashi and others added 3 commits February 7, 2019 10:44
* node: close AccountsManager in new Close method

* p2p/simulations, p2p/simulations/adapters: handle node close on shutdown

* node: move node ephemeralKeystore cleanup to stop method

* node: call Stop in Node.Close method

* cmd/geth: close node.Node created with makeFullNode in cli commands

* node: close Node instances in tests

* cmd/geth, node: minor code style fixes

* cmd, console, miner, mobile: proper node Close() termination
@AlexeyAkhunov AlexeyAkhunov merged this pull request into master May 28, 2019
yperbasis added a commit that referenced this pull request Jun 3, 2019
AlexeyAkhunov added a commit that referenced this pull request Feb 14, 2021
dmitry123 pushed a commit that referenced this pull request Jan 11, 2022
the ankr's implementation of erigon for bsc
enriavil1 pushed a commit that referenced this pull request Jan 29, 2022
Giulio2002 added a commit that referenced this pull request Jun 14, 2022
Giulio2002 added a commit that referenced this pull request Jun 14, 2022
* progress #1

* progress #2

* proper file naming

* more mature memory mutation
AlexeyAkhunov pushed a commit that referenced this pull request Jun 29, 2022
* rpc: add eth_callMany (#1)

* clean the repo

* clean style

* remove unwanted err check

* fix header bug

* Add RPC `debug_traceCallMany` (#4)

* update submodule

* fix error msg
Giulio2002 added a commit that referenced this pull request Jul 3, 2022
Giulio2002 added a commit that referenced this pull request Jul 3, 2022
* experiment #1

* experiment #2

* experiment #3

* experiment 4
revitteth referenced this pull request in revitteth/erigon Jul 13, 2022
feat(ci): run hive tests as part of CI
BlinkyStitt referenced this pull request in llamanodes/erigon Jan 3, 2023
@Brando753 Brando753 mentioned this pull request Sep 14, 2023
taratorio pushed a commit that referenced this pull request Jul 23, 2024
mh0lt added a commit that referenced this pull request May 8, 2026
- Remove dead LightCollector.DeleteAccount IncarnationPath emit (concern #3):
  IBS.Selfdestruct's versionWritten already publishes IncarnationPath via
  the version map; CollectorWrites is not the rawWrites source.
- Add MIRROR-OF tag on versionedWriteCollector.DeleteAccount cross-
  referencing LightCollector.DeleteAccount and noting that IncarnationPath
  is emitted via IBS.Selfdestruct, not either DeleteAccount (concern #7).
- Rewrite TestSDOfPreExistingContract_FullPipeline to populate vm with
  IBS.Selfdestruct's versionWritten emits; assert BalancePath=0 comes
  from vm.Read (not preBlockBalance from stateReader fallback) and that
  the trie sees a leaf with {Balance=0, Nonce=preBlock, CodeHash=preBlock}
  through the default FlushToUpdates branch (concern #1).
- Add TestSDStorageCascade_EmitsPerSlotDeletes locking in the storage-
  cascade load-bearing invariant: normalizeWriteSet's vm.StorageKeys
  loop emits StoragePath=0 entries that overwrite cs.storageState's
  pre-SD slot values, so each slot emits DeleteUpdate (concern #6).
- Update defensive-test docstrings to spell out unreachability from real
  production writesets and cross-link to the realistic-pipeline tests
  that cover the actual fix mechanism (concern #4).
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request May 9, 2026
…tyRemoval (erigontech#21032)

## Summary

Fixes a wrong trie-root in the parallel commitment calculator when
post-tx writesets are indistinguishable between two cases:

1. **SELFDESTRUCT of a pre-existing contract** — serial keeps the leaf
with `incarnation>0`, zero balance/nonce/empty codeHash. Parallel was
emitting `DeleteUpdate` and removing the leaf.
2. **EIP-161 emptyRemoval** of a touched-empty account (e.g. `0xff…fe`
after the EIP-4788 system call) — serial emits `DeleteUpdate`. Parallel
was emitting a zero-account UPDATE.

The discriminator is the **pre-block incarnation**, which the writeset
alone didn't carry. Fix wires it through `LightCollector.DeleteAccount`
→ `IncarnationPath` write → `calcAccountState.Incarnation` → 3-way
branch in `FlushToUpdates`.

## Dependency direction

This PR is a **precursor for erigontech#21017** (the CI matrix that runs every
test under both `serial` and `parallel` exec modes). Without this fix,
parallel-mode tests on hive `rpc-compat`, `engine
api/cancun/withdrawals`, and the eest selfdestruct sub-suites all fail
with wrong-trie-root errors at SD/empty-removal blocks. **erigontech#21017 cannot
land until this PR lands.**

The matrix in erigontech#21017 will validate this PR end-to-end via the hive
parallel sub-suites — meaning this PR's parallel-exec changes don't run
in *this* PR's own CI, only in erigontech#21017's. The rebased erigontech#21017 is the
meaningful go/no-go signal.

## Changes

* `LightCollector.DeleteAccount` now emits `IncarnationPath` alongside
`SelfDestructPath=true` when `original.Incarnation > 0`. Calc receives
the pre-deletion incarnation through the same channel as every other
write — no exec-side state leakage.
* `calc_state.ApplyWrites` consumes `IncarnationPath` into
`calcAccountState.Incarnation` via direct type-assertion (panic on
mismatch — see *Concern 3* below).
* `calc_state.FlushToUpdates` switches on a 3-way `Deleted` branch with
`isAllZero` defense-in-depth:
* `Deleted && Incarnation>0 && all-zero` → zero-account UPDATE (matches
serial's `DomainDel`-leaves-post-CREATE-encoding for SD'd contracts)
* `Deleted && all-zero && Incarnation==0` → `DeleteUpdate` (matches
serial's `DomainDel`-truly-empties for EIP-161 emptyRemoval)
* `Deleted` with retained non-zero values → regular UPDATE — defensive
coverage for OOG-CREATE2-with-retained-balance and any future
write-ordering race
* `SelfDestructPath` also marks all tracked storage slots dirty so
`FlushToUpdates` emits per-slot updates alongside the account reset.
Load-bearing invariant: `normalizeWriteSet`'s `vm.StorageKeys(addr)`
loop emits `StoragePath=0` entries that arrive in `ApplyWrites` after
`SelfDestructPath`, overwriting the marked slots' values to zero so they
emit `DeleteUpdate` not `StorageUpdate(pre-SD value)`. Inline comment in
`calc_state.go` spells this out — see *Concern 2* below.

## Earlier draft snags (resolved)

The first draft also added an `IncarnationPath > 0` exclusion to
`normalizeWriteSet`'s empty-account check. This was **redundant** (the
empty-check already requires `Nonce == 0`, which excludes successful
CREATE/CREATE2) and **broke OOG-during-CREATE2 cases** (which leave
`Nonce=0/Balance=0/Code=empty/Incarnation=1` and *must* still be
deleted). Removed in `9539998f14`. The `exec3_parallel.go` diff in this
PR is now comments-only.

## Reviewer concerns addressed

### erigontech#1 (yperbasis): PR description was stale

This body. ✓

### erigontech#2 (yperbasis + Copilot): SD storage-cascade load-bearing invariant

Inline comment in `calc_state.go`'s `SelfDestructPath` case now spells
out the dependency on `normalizeWriteSet`'s `vm.StorageKeys(addr)` loop.
✓

### erigontech#3 (yperbasis + Copilot): IncarnationPath guarded type-assertion

Changed to direct `w.Val.(uint64)` to match the other paths. Silent zero
would route a real SD into the EIP-161 branch and reproduce the very
wrong-root bug — better to panic at the source. ✓

### erigontech#4 (yperbasis):
`TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate`
docstring

Updated to clarify: this is **defensive coverage** for the third
`FlushToUpdates` branch in isolation, NOT a direct repro of the
eest_devnet OOG path. The actual OOG fix is the removal of the redundant
`IncarnationPath > 0` clause from `normalizeWriteSet` (the OOG writeset
has `Nonce=0` → empty-account → `DeleteUpdate`, not
`Deleted+RetainedBalance`). End-to-end coverage of that path lives in
the eest_devnet suite, surfaced via erigontech#21017's matrix. ✓

### erigontech#5 (yperbasis): `versionedWriteCollector.DeleteAccount` asymmetry —
*intentional non-fix*

Decision: **keep the asymmetry, document why.** Inline comment added on
`versionedWriteCollector.DeleteAccount` explaining:

* It's wired only into `txResult.finalize` (fee calc + post-Cancun
system calls).
* Neither path SDs a pre-existing contract today, so the
SD-with-incarnation differentiator is unreachable from here.
* If a future caller ever does emit `DeleteAccount` on a pre-existing
contract through this collector, the comment flags that this code should
mirror `LightCollector.DeleteAccount`'s `IncarnationPath` emit.

Adding the emit speculatively was rejected because: (a) it changes the
writeset shape for paths that today don't need it, (b) any test
exercising the new emit would be vacuous since no production caller hits
the `original.Incarnation > 0` branch, and (c) the comment is enough to
attribute the bug at first sight if someone *does* reach that code path
in the future.

## Intentional non-fixes

* **Concern erigontech#5 above** — `versionedWriteCollector.DeleteAccount` left
without the `IncarnationPath` emit (rationale above).
* **Defensive `TestFlushToUpdates_DeletedWithRetainedBalance` test
kept** despite the state being unreachable from real LightCollector
writesets today — protects the FlushToUpdates branch in isolation
against future ApplyWrites refactors that might drop the
`BalancePath`-clears-`Deleted` invariant.

## Test plan

- [x] All 6 unit tests in `calc_state_test.go` pass
(`TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate`,
`TestFlushToUpdates_DeletedWithoutIncarnation_EmitsDelete`,
`TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate`,
`TestFlushToUpdates_LiveAccount_EmitsFullUpdate`,
`TestApplyWrites_IncarnationPath`,
`TestApplyWrites_BalancePathClearsDeleted`)
- [x] eest_devnet
`for_amsterdam/constantinople/eip1052_extcodehash/extcodehash/extcodehash_subcall_create2_oog`
all 6 variants pass locally
- [x] Full `for_amsterdam/constantinople` eest_devnet suite passes
- [x] `make lint` clean
- [x] CI on `9539998f14` was green

End-to-end validation comes via erigontech#21017's CI matrix once it rebases on
top of this PR.
sudeepdino008 added a commit that referenced this pull request May 21, 2026
…21310)

## Summary

Two spec-deviations in `getFilterBlockTree` compound to reject valid
current-epoch leaves at every epoch boundary, causing Caplin's `GetHead`
to fall back to the justified checkpoint root (a ~30-50 slot regression)
and triggering execution-side unwinds.

## Observed symptom (from #21301)

On bloatnet, once execution catches up to chain tip, the node enters a
steady-state cycle of **22-36 block execution unwinds every ~6 minutes**
— one per epoch boundary. Histogram from one ~3h run (110 unwind
events):

```
   1 size=11
   1 size=14
   4 size=15
   1 size=16
   4 size=18
   2 size=19
   3 size=20
   5 size=21
   9 size=22
  14 size=23     ← most common
   3 size=24
   5 size=25
   4 size=26
   4 size=27
   8 size=28
   1 size=30
   2 size=31
   4 size=32
   3 size=33
   4 size=34
   ...many size=35-36
```

The unwinds are on the **same chain** (no real reorg) — execution is
forced to roll back to an older head because Caplin regresses its
`GetHead()` return value:

```
08:38:45 Caplin FCU: headSlot=652928  currentSlot=652988  lagSlots=60   eth1Head=0xe0f3...
08:39:36 Caplin FCU: headSlot=652993  currentSlot=652993  lagSlots=0    eth1Head=0x263b... ← at tip
08:44:00 Caplin FCU: headSlot=652959  currentSlot=653015  lagSlots=56   eth1Head=0x7674... ← regressed 34 slots
```

The 08:44:00 FCU triggers a 23-block exec unwind:

```
[forkchoice] entering unwind path  fcuNum=24701328  canonHash=0x7674... fcuHash=0x7674...
                                   hashesDiffer=false  finishProgress=24701350  executionAhead=true
Unwind Execution from=24701350 to=24701327
```

`hashesDiffer=false` confirms same-chain. `executionAhead=true` because
exec raced ahead of Caplin's regressing head.

## Root cause: filter rejects mid-epoch leaves

Tracing inside `GetHead`:

```
[GetHead] cache miss, rebuilding from justifiedCheckpoint  justifiedEpoch=20410
[filterTree] reject leaf: checkpoint mismatch  blockRoot=0xe644... slot=653150
            justifiedOk=false  finalizedOk=false
            storeJustEpoch=20410  blockJustEpoch=20411
            storeFinalEpoch=20409  blockFinalEpoch=20410
[GetHead] filtered tree size  viableBlocks=0  totalHeads=1
[GetHead] rebuilt head  headHash=0x... headSlot=653119  ← fallback to justified root
Caplin is sending forkchoice  headSlot=653119  currentSlot=653174  lagSlots=55
```

Every leaf in the tree gets filtered because its
`unrealizedJustifications[blockRoot].epoch = N+1` but the store's
`justifiedCheckpoint.epoch = N` (store's realized lags by 1 epoch
mid-epoch). The walk falls back to the justified root → 50+ slot
regression.

## Spec deviations (the fix)

### 1. Voting-source selection used unrealized unconditionally

Spec
[`get_voting_source`](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/fork-choice.md#get_voting_source)
picks unrealized only for **prior-epoch** blocks (pull-up justification
view); current-epoch blocks should use the block's realized state
checkpoint:

```python
def get_voting_source(store: Store, block_root: Root) -> Checkpoint:
    block = store.blocks[block_root]
    current_epoch = get_current_store_epoch(store)
    block_epoch = compute_epoch_at_slot(block.slot)
    if current_epoch > block_epoch:
        return store.unrealized_justifications[block_root]
    else:
        return store.block_states[block_root].current_justified_checkpoint
```

Erigon's code:

```go
// Use per-block unrealized justifications (spec: store.unrealized_justifications[block_root])
// Fall back to realized checkpoints if unrealized not available
currentJustifiedCheckpoint, has := f.getUnrealizedJustification(blockRoot)
if !has {
    currentJustifiedCheckpoint, has = f.forkGraph.GetCurrentJustifiedCheckpoint(blockRoot)
    ...
}
```

→ Always picks unrealized when available. The "fall back" comment
misreads the spec — spec is a conditional branch, not a fallback.

### 2. Justified/finalized check was strict equality

Spec
[`filter_block_tree`](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/fork-choice.md#filter_block_tree)
has a lenient `+2 >= current_epoch` lookback:

```python
correct_justified = (
    store.justified_checkpoint.epoch == GENESIS_EPOCH
    or voting_source.epoch == store.justified_checkpoint.epoch
    or voting_source.epoch + 2 >= current_epoch  # ← lenient lookback
)
```

Erigon's code only allowed the first two clauses (strict `Equal()`).

## Regression provenance

The deviations were introduced in #20035 (`caplin: unified Engine API
client for standalone mode`, Mar 25 2026) by Mark Holt:

```
082697d  cl/phase1/forkchoice/get_head.go  +Use per-block unrealized justifications
```

Before that PR, `getFilterBlockTree` used
`f.forkGraph.GetCurrentJustifiedCheckpoint(blockRoot)` directly — which
returns the block's **realized** justified checkpoint. Block.realized
matched store.realized (both on the same epoch-boundary timeline), so
the equality check passed.

PR #20035 introduced the per-block unrealized lookup
(`store.unrealized_justifications` per the spec), but missed the
conditional branching (spec deviation #1) and didn't add the +2 lookback
(spec deviation #2). The result: block.unrealized goes ahead by 1 epoch
mid-epoch, and store.realized doesn't catch up until `OnTick`'s
epoch-boundary path runs.

## Fix

Restore spec-faithful behavior in `getFilterBlockTree`:

1. Branch the voting source selection on `currentEpoch > blockEpoch`:
   - Prior-epoch → `store.unrealized_justifications[block_root]`
- Current/future-epoch → block's realized `current_justified_checkpoint`

In both branches, return `false` (reject the leaf) if the lookup is
absent. `OnBlock` populates `unrealizedJustifications` for every
imported block above the finalized slot, so a missing entry indicates
either an invariant violation or a leaf below finalized — neither should
produce a viable head.

2. Replace the strict `Equal()` check on the justified side with the
spec's three flat disjuncts: `store.justified_checkpoint.epoch ==
GENESIS_EPOCH || voting_source.epoch == store.justified_checkpoint.epoch
|| voting_source.epoch + 2 >= current_epoch`.

3. Replace the finalized-side checkpoint-equality check with the spec's
ancestor-descent rule: `store.finalized_checkpoint.root ==
get_checkpoint_block(block_root, finalized.epoch)`, implemented via
`f.Ancestor(blockRoot, finalizedSlot)`.

4. Snapshot `currentEpoch` once at the start of the walk (in
`getFilteredBlockTree`) and thread it through the recursion so the whole
rebuild uses a single store-epoch value — `OnTick` updates `f.time`
without holding `f.mu`, so per-leaf `f.Slot()` reads can mix epochs
across leaves around a slot boundary.

## Validation

Tested on bloatnet at chain tip, on `performance` HEAD + this fix:

- **Before fix**: 22-36 block unwinds every ~6 min, one per epoch
boundary, persisting indefinitely. 110 events in 3h.
- **After fix**: zero unwinds at epoch boundaries observed across 30+
minutes (~5 epoch boundaries crossed). `[filterTree] reject` no longer
fires.

Effects:

- **CPU waste eliminated**: ~30 blocks of execution work were being
discarded and re-done every cycle.
- **db_size**: unchanged (was already not affected since rolled-back
state was in MemoryMutation overlay, not MDBX).
- **Logs**: clean — no more spurious `Unwind Execution` lines at tip.

## Test plan

- [x] Run on bloatnet to chain tip, observe no large unwinds at epoch
boundaries
- [x] Confirm no head regression in `Caplin is sending forkchoice` logs
(lagSlots stays ~0 at tip)
- [ ] Spectest still passes (`make spectest`)

## Refs

Full diagnosis trail: #21301
lystopad added a commit that referenced this pull request May 21, 2026
Findings from Copilot (3) and yperbasis (8). The Copilot findings on Peers
routing, empty NodeInfo, and close(statusReady) are real; yperbasis
identified the matching doc/code mismatches and a few polish items.

1. Peers() / message routing (Copilot #1). The multi-sentry client uses
   Peers() to decide which sentry owns each peer and routes SendMessageById
   via that sentry's gRPC. With the previous reporter-only gating, every
   peer mapped to Servers[0] (which is the *lowest* configured protocol,
   ETH69 in the default — yperbasis #1 noted the comment said "highest").
   Servers[0]'s goodPeers doesn't have eth/70 or eth/71 peers, so
   SendMessageById silently no-op'd.

   Fix: each GrpcServer.Peers() now reports its own goodPeers, filtered to
   skip entries where both protocol and witProtocol are zero. Each peer
   ends up in exactly one eth-sentry's goodPeers (the negotiated version)
   plus, at most, the sentry hosting the wit sideprotocol, so admin_peers
   aggregation is naturally non-duplicating and routing is correct.

2. NodeInfo() (Copilot #3). Non-reporters returning empty replies
   polluted admin_nodeInfo with blank entries that sorted first. With the
   shared p2p.Server every sentry has the same Node ID and the same enode,
   so they now return identical NodeInfo. node/eth.NodesInfo deduplicates
   by Enode before sorting.

3. SetStatus's close(ss.statusReady) (Copilot #2) panicked for callers
   that construct GrpcServer outside NewGrpcServer (existing
   TestSentryServerImpl_* does this). Guarded the close with a nil check;
   awaitStatus tolerates a nil channel via the select's other cases.

4. SetP2PServer returns an error instead of panicking on double-call
   (yperbasis #3). The "ownership decided up front" invariant is still
   enforced, just propagated up the Provider.Initialize path cleanly.

5. SimplePeerCount filters protocol=0 ghosts (yperbasis #4). With wit/0
   deduped to one sentry, peers that negotiate eth/N on a different sentry
   end up as protocol=0/witProtocol=0 ghosts on the wit-hosting sentry.
   Counting them would emit a bogus eth.ProtocolToString[0] bucket in the
   GoodPeers log. The Peers() filter already drops them from admin_peers;
   this aligns SimplePeerCount.

6. awaitStatus logs a Debug line when its timeout fires (yperbasis #5)
   so operators can tell "core didn't send status in time" from "core
   never tried" when the caller disconnects the peer with
   PeerErrorLocalStatusNeeded. Doc comment clarifies the ctx.Done case
   too (yperbasis #7).

Drops the reportsPeers flag and IsPeerReporter accessor — no longer needed
once per-sentry goodPeers replaces the reporter gating. SetP2PServer
signature loses its second argument.

Tests updated for the new shape; new TestGrpcServer_PeersReturnsPerSentryGoodPeers
(per-sentry view + ghost-entry filter) and TestGrpcServer_SetStatus_NilStatusReadyIsSafe
(close-nil guard).

Deferred to follow-up:
- BootstrapNodes/DNS resolution helper to dedupe logic between
  makeP2PServer and startSharedP2PServer (yperbasis #6).
- End-to-end Provider.Initialize test in local mode (yperbasis #8).
- [r3.4] backport PR (yperbasis #9).

Co-Authored-By: Claude
AskAlexSharov pushed a commit to HoustonOla35/erigon that referenced this pull request May 22, 2026
…rgs) (erigontech#21211)

## Summary

First incremental cut toward
[erigontech#21138](erigontech#21138 structural
goal: **one finalize function per parallel-exec result, with
`IntraBlockState` used nowhere outside workers**.

This PR removes two finalize variants that are already unreachable from
production:

| Function | LOC | Production callers on main |
|---|---|---|
| `finalizeWithIBS` (full IBS reconstruction, BAL-compat path) | ~120 |
0 |
| `finalizeTx` (delta-args variant, direct fee-balance path) | ~250 | 0
(only `TestFinalizeTx_AllScenarios`) |

Plus the test suite that exclusively exercised the delta-args path
(`TestFinalizeTx_*`, fixture builders `coinbaseIsRecipientScenario` /
`selfTransferScenario`, helpers `hasCoinbaseDelta` /
`adjustForTransferDelta` / `buildWriteMap` / `fmtWriteVal` /
`extractBalanceReads`) and one stale comment in
`engine_api_bal_test.go`.

Net: **-690 lines**, **+1 line**, no semantic change.

## Why now

The parallel-exec correctness stack landed in erigontech#21153 (merged
2026-05-15). The combined effect of that PR plus erigontech#21177 routed all
production finalize flows through `finalizeTxSimple` — these two
functions became unreachable. Removing them shrinks `exec3_parallel.go`
from 3640 → 3268 lines, making subsequent IBS-dependency drains easier
to review.

The next steps in the erigontech#21138 sequence:

- **PR 2** — drain IBS dependency erigontech#1 (SD address lookup):
`LogSelfDestructedAccounts` consumes `result.SelfDestructedWithBalance`
only, no `ibs.GetRemovedAccountsWithBalance()` call.
- **PR 3** — drain IBS deps erigontech#2 (`AddLog` → return logs) and erigontech#3
(`AddBalance` bookkeeping → already on `CollectorWrites`);
`finalizeTxSimple` becomes IBS-free.
- Later — `normalizeWriteSet` → `filterWritesByVersionMap`;
`calcState.ApplyWrites` → `VersionedWrites.TouchUpdates`; move
EIP-7002/7251 syscall execution into the worker pool.

End state: one `finalizeTx`, no IBS outside workers.

## Test plan

- [x] `make lint` clean
- [x] `make test-short` (full `execution/stagedsync`, `execution/state`,
`execution/tests`, `rpc/jsonrpc` packages) green under
`EXEC3_PARALLEL=true`
- [x] BAL family (`TestEngineApiBAL*`) 8/8 parallel
- [x] `TestEIP7708BurnLogWhenCoinbaseSelfDestructs` green
- [x] Surviving `TestFinalizeTxSimple_*` family green
- [ ] CI: race-tests, kurtosis, hive matrix legs green on both serial
and parallel

## Related

- erigontech#21017 — serial/parallel CI matrix that surfaces parallel-leg failures
(now rebuilt on post-erigontech#21153 main; CI fresh-running)
- erigontech#21153 — parallel-exec correctness stack (merged)
- erigontech#21138 — heuristic-removal / IBS-dependency-removal tracker (the
parent)
mh0lt pushed a commit that referenced this pull request May 24, 2026
…ash LRU

Two surgical commits bundled (both touch the code-read hot path):

1. IntraBlockState.GetCodeSize now loads the full bytes via
   stateReader.ReadAccountCode on first touch and populates
   stateObject.code, so subsequent same-addr EXTCODESIZE /
   EXTCODEHASH / CALL within the tx are in-struct slice-len calls
   (~50 ns), not full reader round-trips. Mirrors geth's pattern
   at core/state/state_object.go ~Code() — pay one read per addr
   per tx, free for the rest.

2. CodeCache.addrToHash switched from a no-op-when-full
   maphash.Map[versionedAddressID] to an LRU
   lru.Cache[[20]byte, versionedAddressID] (hashicorp/golang-lru/v2,
   already imported elsewhere). Cap derived from the existing byte
   budget at ~28 bytes/entry (~580 k entries for the 16 MB default).
   Fresh-address workloads (mainnet thousands of new addrs per
   block) now warm up the addr layer over time instead of silently
   dropping new entries forever; matches geth's lru.Cache at
   core/state/database_code.go.

   The hashToCode layer is unchanged (content-addressed bytes,
   immutable, byte-capped with new-entry no-op when full — the same
   semantic as before since code bytes by codeHash never change).

Bench on the EXTCODESIZE-EXISTING_CONTRACT-30M family: 62.34 mgas/s
(was 61.50). The marginal gain is small on this bench because BAL
prefetch already populates the cache layers; neither lever fires
heavily. The expected wins are on non-BAL workloads where
EXTCODESIZE-loop patterns repeat within a tx (#1) and
fresh-address-churn mainnet blocks fill the addr layer (#2).

Updated TestCodeCache_AddrCapacityLimit to assert LRU eviction
(was asserting no-op-when-full); the prior behaviour was the bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mh0lt pushed a commit that referenced this pull request May 25, 2026
…ash LRU

Two surgical commits bundled (both touch the code-read hot path):

1. IntraBlockState.GetCodeSize now loads the full bytes via
   stateReader.ReadAccountCode on first touch and populates
   stateObject.code, so subsequent same-addr EXTCODESIZE /
   EXTCODEHASH / CALL within the tx are in-struct slice-len calls
   (~50 ns), not full reader round-trips. Mirrors geth's pattern
   at core/state/state_object.go ~Code() — pay one read per addr
   per tx, free for the rest.

2. CodeCache.addrToHash switched from a no-op-when-full
   maphash.Map[versionedAddressID] to an LRU
   lru.Cache[[20]byte, versionedAddressID] (hashicorp/golang-lru/v2,
   already imported elsewhere). Cap derived from the existing byte
   budget at ~28 bytes/entry (~580 k entries for the 16 MB default).
   Fresh-address workloads (mainnet thousands of new addrs per
   block) now warm up the addr layer over time instead of silently
   dropping new entries forever; matches geth's lru.Cache at
   core/state/database_code.go.

   The hashToCode layer is unchanged (content-addressed bytes,
   immutable, byte-capped with new-entry no-op when full — the same
   semantic as before since code bytes by codeHash never change).

Bench on the EXTCODESIZE-EXISTING_CONTRACT-30M family: 62.34 mgas/s
(was 61.50). The marginal gain is small on this bench because BAL
prefetch already populates the cache layers; neither lever fires
heavily. The expected wins are on non-BAL workloads where
EXTCODESIZE-loop patterns repeat within a tx (#1) and
fresh-address-churn mainnet blocks fill the addr layer (#2).

Updated TestCodeCache_AddrCapacityLimit to assert LRU eviction
(was asserting no-op-when-full); the prior behaviour was the bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request May 29, 2026
…ckly (erigontech#21483)

## Problem

When a merge-queue run has a hive-eest shard fail, the failing job calls
`gh run cancel ${{ github.run_id }}` (added in erigontech#21445). That sends
SIGTERM to all in-flight matrix siblings, but the Docker-bound hive
simulators take ~20 minutes to actually drain. `ci-gate` is `if:
always()` and waits for every `needs` job to reach a terminal state, so
the broken PR sits at `AWAITING_CHECKS` for the full drain time —
blocking the head of the merge queue.

Concrete example from today (PR erigontech#21470 at position erigontech#1):

- 08:29:57 — `hive-eest / test-hive-eest (paris+shanghai, serial)`
fails, calls `gh run cancel 26562610423`, emits the "Merge-queue
root-cause failure" annotation from erigontech#21445.
- 08:48 (~19 min later) — paris+shanghai-parallel,
prague-serial/parallel, cancun-serial/parallel, osaka-parallel,
rlp-serial/parallel, and glamsterdam-devnet-parallel were all still
`in_progress`. Every other ci-gate child (tests, race-tests,
eest-spec-tests, kurtosis, hive, lint, bench, repro, sonar, caplin) had
already completed.

The bottleneck was specifically the hive-eest matrix siblings.

## Fix

```yaml
strategy:
  fail-fast: ${{ github.event_name == 'merge_group' }}
```

- **In `merge_group`**: first failed shard immediately cancels all
siblings at the GitHub API layer — much faster than the `gh run cancel`
→ SIGTERM → runner-drain path. ci-gate's `needs` reach terminal state in
seconds, ci-gate fails, the broken PR is evicted.
- **In PR runs**: stays `false`, so authors still see the full failure
breakdown across every shard. No regression in PR feedback.

## What's left in place and why

The per-job `gh run cancel` step (test-hive-eest.yml lines 311-317)
stays. Two reasons:

- Matrix `fail-fast` only cancels siblings **within the same matrix** —
it doesn't cancel sibling reusable workflows. If a future failure
pattern leaks across workflows, `gh run cancel` still covers it.
- ci-gate.yml's root-cause annotator (line 188) keys off "the leaf that
ran `gh run cancel` successfully" to single out the true root cause
among collateral cancellations. Removing the step would silently regress
erigontech#21445's attribution.

## Scope choice

Only `test-hive-eest.yml` is changed. Other matrix-bearing reusable
workflows (`test-all-erigon.yml`, `test-all-erigon-race.yml`,
`test-eest-spec.yml`, `test-kurtosis-assertoor.yml`, `test-hive.yml`,
`test-bench.yml`) all use `fail-fast: false` too, but none of them were
the queue-blocking long pole in this incident. Keeping the patch
minimal; we can generalize if another workflow becomes the bottleneck.

## Tradeoff to be aware of

Queue runs will now show siblings as `cancelled` instead of `failed`
whenever any one shard fails. That's the correct tradeoff in
`merge_group` — the goal is fast eviction, not detailed diagnostics;
full per-shard breakdown remains available on the PR run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yperbasis added a commit that referenced this pull request May 29, 2026
profiling at tip (MDBX, 60s window) pinned bytes.Clone as the #1
single allocation source at 64.3 GB / 60s (~16.85% of total alloc
traffic). 37.86 GB / 60s (~631 MB/sec) of that came from the
defensive common.Copy at TrieContext.Branch.

The clone was redundant: every downstream consumer either reads the
slice inline (cell.fillFromFields copies into pre-allocated arrays;
merger.Merge consumes and produces a fresh buffer; trie_reader
parses bytes into cells; unfoldBranchNode similar) or clones it
itself at the queue boundary (getDeferredUpdate clones both prefix
and prev when storing in the deferred-update pool). Branch's clone
was a third copy of bytes that nothing needed to retain.

Document the new contract (borrowed slice valid for the current
ComputeCommitment scope) and update the test that exercised the old
"returns owned bytes" guarantee to verify the new aliasing
guarantee instead.

After-measurement on the rig is blocked by an unrelated stage-loop
persistence inconsistency (chaindata head pointer ahead of state
writes on restart) that's reproducing on every restart cycle today;
the change is mathematically minimal (single Clone removal +
test contract update) and unit tests + make lint pass, so shipping
on the math.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
bloxster added a commit that referenced this pull request Jun 1, 2026
Addresses the CHANGES_REQUESTED review, verified against `main`:

database.md
- Receipts ARE stored: ReceiptDomain (required) + RCacheDomain (optional).
  Reword the "not stored / re-computed" claim to: compact per-txn receipt
  metadata persisted in a required domain, full receipts optionally cached,
  logs re-derived by re-execution only when not cached.
- snapshots/domain holds 6 domains (4 state + 2 receipt), not 4.
- Soften "rm -rf chaindata/": recoverable, but re-derives state from snapshots
  and resyncs the tip from the CL over the Engine API (devp2p block download
  removed in #21505) — not a quick rebuild.

architecture.md
- prune `full` keeps a rolling ~262k-block window (DefaultPruneDistance), not
  all post-Merge blocks; add the `blocks` prune mode to the table.
- Pipeline: Snapshots is stage #1; no separate Commitment stage (it runs inside
  Execution); add Senders.
- Mermaid: Downloader (BitTorrent) is independent of Sentry (devp2p); blocks
  arrive from Caplin via the Engine API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Jun 3, 2026
…#21579)

Two tweaks to how Claude Code operates in this repo.

## 1. Bare `#N` references in GitHub text

Claude has used a bare `#`+number to mean "point N" — e.g. writing
"`erigontech#1`" for "the nit from point 1" — in PR descriptions, issue
descriptions, and comments. GitHub auto-links that to the issue/PR with
that number, so "`erigontech#1`" turns into a link to the repo's first-ever PR.
This came up in [PR
21510](erigontech#21510 (comment))
and has happened on other occasions.

`agents.md` now instructs Claude to write "point 1" / "item 1" / "the
first nit" and reserve `#N` for genuine issue/PR references.

## 2. Claude attribution in commits, PRs, and issues

Stop adding `Co-Authored-By: Claude` and `🤖 Generated with Claude Code`
lines.

Enforced deterministically in `.claude/settings.json` rather than as a
prose instruction, so it does not depend on the model remembering:

- `includeCoAuthoredBy: false` — the one switch the installed Claude
Code (v2.1.160) honors across *both* the commit and PR code paths; it
returns empty attribution for each.
- `attribution: { commit: "", pr: "" }` — the newer, forward-looking
mechanism; empties the footer text.

**Why both:** `attribution` alone is insufficient today — the PR code
path does a truthy check on `attribution.pr`, so an empty string is
ignored and the "Generated with" footer still leaks into PR bodies.
`includeCoAuthoredBy: false` is the reliable global off-switch.
`agents.md` documents the rule for human readers too.

---

Docs/config only — no Go changes, so build/lint are unaffected. This PR
follows both rules itself: no attribution trailers on the commit, and
the `erigontech#1` examples above are shown as code so they don't auto-link.
yperbasis pushed a commit to fr0mano/erigon that referenced this pull request Jun 9, 2026
…LOAS (erigontech#21698)

## Problem

Lido Hoodi validators lost attestation score after switching to
`release/3.5`. They missed **head votes** at ~30–60% per epoch (≈2× a
random network sample) while **target and source votes were always
correct** — the attestations were produced, published, and included
on-chain on time, but voted for a **stale head**.

## Root cause

The GLOAS merge (erigontech#18956) added `indexedWeightStore`
(`cl/phase1/forkchoice/weight_store_indexed.go`). It is instantiated
unconditionally, and its `IndexVote`/`RemoveVote` are called per
validator index on **every** attestation via `setLatestMessage`,
regardless of fork.

However its results are only consumed by GLOAS `get_head`. Pre-GLOAS
`get_head` (and `timing.go`) use the non-indexed `weightStore`, and
`GetIndexedWeightStore()` has no callers. So on pre-GLOAS chains the
index is maintained but never read.

On a high-validator-count network (Hoodi, ~1.2M active validators) this
maintenance was the single largest CPU consumer (`RemoveVote`, ~15% of
CPU, fresh slice allocation per call) plus a large GC load — all under
the fork-choice write lock. The lock contention delayed `OnBlock` (block
import) and `get_head` past the attestation deadline, so the head served
to the validator client was stale → wrong head votes.

## Fix

Gate the indexed-store maintenance to the GLOAS vote path only, via an
explicit `maintainIndexedVotes` flag on `setLatestMessage` (mirroring
the existing `updateLatestMessages` pre-GLOAS/GLOAS dispatch).

## Validation (live Hoodi node, before vs after rebuild)

- Node CPU (operator Grafana): ~7.5% → <2%.
- 60s CPU profile: total samples 148.9% → 41.6%;
`indexedWeightStore.RemoveVote` (was erigontech#1 hotspot) and the GC storm both
gone.
- Head-at-attestation-deadline staleness: 64% → 4%.
- Validator head-vote misses: 40% → 0% in the first fully post-fix epoch
(network sample ~23% in the same epoch); target/source unaffected
throughout.

Affects all pre-GLOAS networks on this branch (mainnet/Sepolia/Gnosis),
with impact scaling by validator count.

## Tests

- `TestPreGloasDoesNotMaintainIndexedWeightStore` — fails before the fix
(the pre-GLOAS path populates the index), passes after.
- `TestGloasMaintainsIndexedWeightStore` — locks in that the GLOAS path
still indexes votes.
- `go test ./cl/phase1/forkchoice/...`, `make lint`, `make erigon` all
clean.

## Follow-up

`indexedWeightStore` is currently unused even on GLOAS (`get_head` uses
the non-indexed store). The Caplin team should either wire it into GLOAS
`get_head` or remove it; tracking separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants