execution/commitment: stop cloning bytes at TrieContext.Branch#21524
Conversation
profiling at tip (MDBX, 60s window) pinned bytes.Clone as the #1 single allocation source at 64.3 GB / 60s (~16.85% of total alloc traffic). 37.86 GB / 60s (~631 MB/sec) of that came from the defensive common.Copy at TrieContext.Branch. The clone was redundant: every downstream consumer either reads the slice inline (cell.fillFromFields copies into pre-allocated arrays; merger.Merge consumes and produces a fresh buffer; trie_reader parses bytes into cells; unfoldBranchNode similar) or clones it itself at the queue boundary (getDeferredUpdate clones both prefix and prev when storing in the deferred-update pool). Branch's clone was a third copy of bytes that nothing needed to retain. Document the new contract (borrowed slice valid for the current ComputeCommitment scope) and update the test that exercised the old "returns owned bytes" guarantee to verify the new aliasing guarantee instead. After-measurement on the rig is blocked by an unrelated stage-loop persistence inconsistency (chaindata head pointer ahead of state writes on restart) that's reproducing on every restart cycle today; the change is mathematically minimal (single Clone removal + test contract update) and unit tests + make lint pass, so shipping on the math. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR reduces allocation pressure in the commitment trie path by making TrieContext.Branch return the domain reader’s slice directly instead of cloning it, relying on callers to consume inline or clone at retention boundaries.
Changes:
- Removed the
common.CopyinTrieContext.Branch. - Added an implementation-level borrowed-slice contract comment.
- Updated the unit test to assert aliasing/borrowing behavior instead of copy ownership.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
execution/commitment/commitmentdb/commitment_context.go |
Changes Branch to return borrowed data directly and documents the new lifetime expectation. |
execution/commitment/commitmentdb/commitment_context_test.go |
Updates the branch ownership test to validate borrowed-slice aliasing behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…e PatriciaContext interface Address Copilot's review on PR #21524: the borrowed-slice lifetime note was only on the concrete *TrieContext implementation; callers that depend on the interface couldn't see it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Giulio2002
left a comment
There was a problem hiding this comment.
LGTM \u2014 small obvious change (~39 changed lines across 3 files)
|
@yperbasis longevity of the A/B test? we had an issue with branch merger before that the value we got from context while unfolding and when we do merge later, drifts. |
|
Ran the rig overnight (~9 h, n=2,627 matched-pair |
…ies (erigontech#21597) ## Summary Fixes the hive devp2p `BlobViolations` flake (`expected disconnect on blob violation, got msg code: 24`) seen on the CI Gate run for erigontech#21524 (parallel leg) and on a runner-experiment branch the day before (serial leg) — and the gossip-duplication regression behind it. ## Root cause Since erigontech#21335 all per-eth-version sentry `GrpcServer`s share one `p2p.Server` and one `PeerStore`. `SendMessageById` resolves the target peer in the shared store, so callers that fan a by-id send out across every sentry client — e.g. the txpool's new-peer pool sync (`PropagatePooledTxnsToPeersList`) — now deliver the same frame once per sentry: three identical `NewPooledTransactionHashes` messages per new peer on the default eth/69+70+71 build. Before erigontech#21335 each sentry had its own peer set, so the off-sentry sends were silent no-ops. `SendMessageToAll` and `SendMessageToRandomPeers` already guard against this with `protocolVersions.Contains(peerInfo.EthProtocol())`; `SendMessageById` was the only outbound path without the negotiated-version check. The BlobViolations failure sequence, reproduced locally with wire-level logging (hive `--sim devp2p --sim.limit eth` against an instrumented image; failed on the second loop iteration with the exact CI error): 1. The test peers, announces two blob txs (one with a lying size/type), delivers them, and waits for a disconnect. 2. The violating `PooledTransactions` reply sits in the txpool fetcher's 250 ms inbound batch before the announcement check kicks the peer. 3. If the 5 s `syncToNewPeersEvery` tick lands in that window, the all-pool announcement (~70 KB — about 2000 hashes accumulated by earlier suite tests) is written to the test peer three times: ``` 08:16:32.606 inbound NPTH68 peer=de6008eb count=2 08:16:32.607 WRITE GET_POOLED_TXS_66 peer=de6008eb 08:16:32.607 sync-to-new-peers entries=2002 <- 5s tick fires 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- one write 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- per 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- sentry 08:16:32.855 VIOLATION -> PenalizePeer peer=de6008eb <- next 250ms batch flush ``` 4. The devp2p test deliberately tolerates one pre-disconnect hash announcement (go-ethereum commit 88c8459, Sept 2024, "sometimes we'll get a blob transaction hashes announcement before the disconnect"); the duplicated second copy arrives before the kick and fails the test. In the failed CI run the timing lines up exactly: the txpool started at 06:33:04.64 and the test failed at 06:33:14.644 — the tick+10s sync. With the duplication fixed, at most one announcement can precede the disconnect in that window, which the test tolerates by design. Beyond the test, every new peer was receiving the full pool sync in triplicate. ## Fix Apply the same negotiated-eth-version guard in `SendMessageById` that the other outbound paths use. Eth-protocol messages only; wit is unaffected (it is deduplicated to a single `GrpcServer` at shared-server construction). This also restores the routing contract documented at `FetchBlockAccessLists` ("sentryIndex MUST be the sentry where peerID is actually connected"): sends via a non-matching sentry are silent no-ops again, as they were with per-sentry peer sets. Not present on `release/3.4` (erigontech#21335 is main-only), so no backport is needed. ## Testing - New regression test `TestGrpcServer_SendMessageById_SharedStore_NoDuplicateWrites` (TDD): three `GrpcServer`s (eth/69/70/71) sharing a `PeerStore`, peer negotiated eth/70, by-id fan-out across all three must produce exactly one write. Red before the fix (`expected: 1, actual: 3`), green after. - `go test ./p2p/sentry/... ./txnprovider/txpool/...` clean. - Local hive validation in a loop: before the fix `not ok 18 BlobViolations` reproduced on iteration 2; after the fix 12/12 iterations green, with instrumented logs showing exactly one `NEW_POOLED_TRANSACTION_HASHES_68` write per new peer where there were three. - `make lint` (repeatedly) and `make erigon integration` clean.
…vert The initial revert paraphrased the PatriciaContext.Branch and TrieContext.Branch comments and trimmed one assertion from the unit test. Restore the exact pre-#21524 text and the full two-direction Test_TrieContext_BranchCopiesData, so the PR's net diff is precisely the inverse of #21524 and nothing else.
…proof/trace corruption (erigontech#21630) **[SharovBot]** ## Problem PR erigontech#21524 removed `common.Copy` from `TrieContext.Branch()`, claiming the returned slice can be safely borrowed. However, mainnet RPC integration tests show widespread failures: - All 20 `eth_getProof` tests: diff mismatch - 50+ `eth_simulateV1` tests: diff mismatch - `debug_traceBlockByNumber`: malformed JSON + diff mismatch The borrowed slice is being mutated before callers finish using it. ## Fix Pure revert of erigontech#21524 (`b2329dc55b`). The net diff is exactly the inverse of that commit and touches nothing else — equivalent to `git revert b2329dc`. It: - restores `common.Copy(enc)` in `TrieContext.Branch()` so each returned slice is independent; - restores the original `PatriciaContext.Branch` interface doc and `TrieContext.Branch` comment; - restores `Test_TrieContext_BranchCopiesData` in full, including the original two-direction independence assertions (source mutation does not reach the returned slice, and mutating the returned slice does not reach the source). ## CI evidence Failing job: https://github.com/erigontech/erigon/actions/runs/26987787453/job/79641095034 Regressing commit: b2329dc Fixes regression from erigontech#21524. --------- Co-authored-by: erigon-copilot[bot] <erigon-copilot[bot]@users.noreply.github.com> Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com> Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com> Co-authored-by: Andrew Ashikhmin <34320705+yperbasis@users.noreply.github.com>
Summary
TrieContext.Branchclones every returned byte slice viacommon.Copy. Profiling shows this single site is responsible for ~631 MB/s of allocated bytes at tip — about 10% of total alloc traffic and 59% of allbytes.Clonecalls — for no benefit: every downstream consumer either reads inline (cell.fillFromFieldscopies into pre-allocated arrays;merger.Merge,trie_reader,unfoldBranchNode,EncodeBranchconsume the bytes synchronously) or clones at its own queue boundary (getDeferredUpdateclones bothprefixandprevwhen storing in the deferred-update pool).Change
common.Copyatcommitment_context.go;Branchnow returns a borrowed slice valid for the lifetime of the currentComputeCommitmentcall.PatriciaContextinterface.Test_TrieContext_BranchCopiesData→Test_TrieContext_BranchBorrowsData) to verify the new aliasing guarantee.Performance
Matched-pair A/B vs main (both at base
fefd6ff92e), mainnet tip via engine-mux, each EL pinned to 4 P-cores / 8 hyperthreads on i9-13900KS, n=2,627 both-VALIDengine_newPayloadV4matched pairs over ~9 h:fefd6ff9d141cd80Per-block delta stdev = 33.73 ms, SE of mean ≈ 0.66 ms — the wall-clock delta is statistically indistinguishable from zero. 58.5% of blocks are faster on the PR side, but the magnitude is within noise.
What the PR delivers: the allocation reduction is real (~631 MB/s of
bytes.Clonetraffic eliminated, visible in pprof), but at today's GC fitness on this workload it doesn't translate to measurable newPayload latency.Correctness validation
Same 9 h / 2,627 matched-pair window: zero INVALID responses, zero status mismatches between main and the PR. The log-tail monitor flagged no
panic,bad block,state root mismatch,commitment mismatch, orwrong rootevents on the candidate.Test plan
go build ./...cleanmake lintclean./execution/commitment/...unit tests pass