Skip to content

execution/commitment: stop cloning bytes at TrieContext.Branch#21524

Merged
awskii merged 3 commits into
mainfrom
yperbasis/trie-context-branch-no-clone
Jun 4, 2026
Merged

execution/commitment: stop cloning bytes at TrieContext.Branch#21524
awskii merged 3 commits into
mainfrom
yperbasis/trie-context-branch-no-clone

Conversation

@yperbasis

@yperbasis yperbasis commented May 30, 2026

Copy link
Copy Markdown
Member

Summary

TrieContext.Branch clones every returned byte slice via common.Copy. Profiling shows this single site is responsible for ~631 MB/s of allocated bytes at tip — about 10% of total alloc traffic and 59% of all bytes.Clone calls — for no benefit: every downstream consumer either reads inline (cell.fillFromFields copies into pre-allocated arrays; merger.Merge, trie_reader, unfoldBranchNode, EncodeBranch consume the bytes synchronously) or clones at its own queue boundary (getDeferredUpdate clones both prefix and prev when storing in the deferred-update pool).

Change

  • Drop the common.Copy at commitment_context.go; Branch now returns a borrowed slice valid for the lifetime of the current ComputeCommitment call.
  • Document the new contract on the PatriciaContext interface.
  • Update the test (Test_TrieContext_BranchCopiesDataTest_TrieContext_BranchBorrowsData) to verify the new aliasing guarantee.

Performance

Matched-pair A/B vs main (both at base fefd6ff92e), mainnet tip via engine-mux, each EL pinned to 4 P-cores / 8 hyperthreads on i9-13900KS, n=2,627 both-VALID engine_newPayloadV4 matched pairs over ~9 h:

EL mean p50 p90 p99
main fefd6ff9 85.65 75.81 126.35 313.78
this PR d141cd80 85.85 75.67 125.65 315.11
Δ = main − PR −0.20 +0.14 +0.70 −1.33

Per-block delta stdev = 33.73 ms, SE of mean ≈ 0.66 ms — the wall-clock delta is statistically indistinguishable from zero. 58.5% of blocks are faster on the PR side, but the magnitude is within noise.

What the PR delivers: the allocation reduction is real (~631 MB/s of bytes.Clone traffic eliminated, visible in pprof), but at today's GC fitness on this workload it doesn't translate to measurable newPayload latency.

Correctness validation

Same 9 h / 2,627 matched-pair window: zero INVALID responses, zero status mismatches between main and the PR. The log-tail monitor flagged no panic, bad block, state root mismatch, commitment mismatch, or wrong root events on the candidate.

Test plan

  • go build ./... clean
  • make lint clean
  • ./execution/commitment/... unit tests pass
  • Matched-pair rig: no perf regression, no correctness regression over ~9 h / 2,627 payloads

profiling at tip (MDBX, 60s window) pinned bytes.Clone as the #1
single allocation source at 64.3 GB / 60s (~16.85% of total alloc
traffic). 37.86 GB / 60s (~631 MB/sec) of that came from the
defensive common.Copy at TrieContext.Branch.

The clone was redundant: every downstream consumer either reads the
slice inline (cell.fillFromFields copies into pre-allocated arrays;
merger.Merge consumes and produces a fresh buffer; trie_reader
parses bytes into cells; unfoldBranchNode similar) or clones it
itself at the queue boundary (getDeferredUpdate clones both prefix
and prev when storing in the deferred-update pool). Branch's clone
was a third copy of bytes that nothing needed to retain.

Document the new contract (borrowed slice valid for the current
ComputeCommitment scope) and update the test that exercised the old
"returns owned bytes" guarantee to verify the new aliasing
guarantee instead.

After-measurement on the rig is blocked by an unrelated stage-loop
persistence inconsistency (chaindata head pointer ahead of state
writes on restart) that's reproducing on every restart cycle today;
the change is mathematically minimal (single Clone removal +
test contract update) and unit tests + make lint pass, so shipping
on the math.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces allocation pressure in the commitment trie path by making TrieContext.Branch return the domain reader’s slice directly instead of cloning it, relying on callers to consume inline or clone at retention boundaries.

Changes:

  • Removed the common.Copy in TrieContext.Branch.
  • Added an implementation-level borrowed-slice contract comment.
  • Updated the unit test to assert aliasing/borrowing behavior instead of copy ownership.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
execution/commitment/commitmentdb/commitment_context.go Changes Branch to return borrowed data directly and documents the new lifetime expectation.
execution/commitment/commitmentdb/commitment_context_test.go Updates the branch ownership test to validate borrowed-slice aliasing behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/commitment/commitmentdb/commitment_context.go
…e PatriciaContext interface

Address Copilot's review on PR #21524: the borrowed-slice lifetime
note was only on the concrete *TrieContext implementation; callers
that depend on the interface couldn't see it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as ready for review May 30, 2026 08:59

@Giulio2002 Giulio2002 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM \u2014 small obvious change (~39 changed lines across 3 files)

@awskii

awskii commented Jun 1, 2026

Copy link
Copy Markdown
Member

@yperbasis longevity of the A/B test? we had an issue with branch merger before that the value we got from context while unfolding and when we do merge later, drifts.

@yperbasis

Copy link
Copy Markdown
Member Author

Ran the rig overnight (~9 h, n=2,627 matched-pair engine_newPayloadV4 at mainnet tip): zero INVALID responses, zero status mismatches between main and this PR. No drift surfaced.

chris-mercer pushed a commit to white-b0x/erigon that referenced this pull request Jun 3, 2026
…ies (erigontech#21597)

## Summary

Fixes the hive devp2p `BlobViolations` flake (`expected disconnect on
blob violation, got msg code: 24`) seen on the CI Gate run for erigontech#21524
(parallel leg) and on a runner-experiment branch the day before (serial
leg) — and the gossip-duplication regression behind it.

## Root cause

Since erigontech#21335 all per-eth-version sentry `GrpcServer`s share one
`p2p.Server` and one `PeerStore`. `SendMessageById` resolves the target
peer in the shared store, so callers that fan a by-id send out across
every sentry client — e.g. the txpool's new-peer pool sync
(`PropagatePooledTxnsToPeersList`) — now deliver the same frame once per
sentry: three identical `NewPooledTransactionHashes` messages per new
peer on the default eth/69+70+71 build. Before erigontech#21335 each sentry had
its own peer set, so the off-sentry sends were silent no-ops.

`SendMessageToAll` and `SendMessageToRandomPeers` already guard against
this with `protocolVersions.Contains(peerInfo.EthProtocol())`;
`SendMessageById` was the only outbound path without the
negotiated-version check.

The BlobViolations failure sequence, reproduced locally with wire-level
logging (hive `--sim devp2p --sim.limit eth` against an instrumented
image; failed on the second loop iteration with the exact CI error):

1. The test peers, announces two blob txs (one with a lying size/type),
delivers them, and waits for a disconnect.
2. The violating `PooledTransactions` reply sits in the txpool fetcher's
250 ms inbound batch before the announcement check kicks the peer.
3. If the 5 s `syncToNewPeersEvery` tick lands in that window, the
all-pool announcement (~70 KB — about 2000 hashes accumulated by earlier
suite tests) is written to the test peer three times:

```
08:16:32.606  inbound NPTH68            peer=de6008eb count=2
08:16:32.607  WRITE GET_POOLED_TXS_66   peer=de6008eb
08:16:32.607  sync-to-new-peers         entries=2002            <- 5s tick fires
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- one write
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- per
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- sentry
08:16:32.855  VIOLATION -> PenalizePeer peer=de6008eb           <- next 250ms batch flush
```

4. The devp2p test deliberately tolerates one pre-disconnect hash
announcement (go-ethereum commit 88c8459, Sept 2024, "sometimes we'll
get a blob transaction hashes announcement before the disconnect"); the
duplicated second copy arrives before the kick and fails the test.

In the failed CI run the timing lines up exactly: the txpool started at
06:33:04.64 and the test failed at 06:33:14.644 — the tick+10s sync.

With the duplication fixed, at most one announcement can precede the
disconnect in that window, which the test tolerates by design. Beyond
the test, every new peer was receiving the full pool sync in triplicate.

## Fix

Apply the same negotiated-eth-version guard in `SendMessageById` that
the other outbound paths use. Eth-protocol messages only; wit is
unaffected (it is deduplicated to a single `GrpcServer` at shared-server
construction). This also restores the routing contract documented at
`FetchBlockAccessLists` ("sentryIndex MUST be the sentry where peerID is
actually connected"): sends via a non-matching sentry are silent no-ops
again, as they were with per-sentry peer sets.

Not present on `release/3.4` (erigontech#21335 is main-only), so no backport is
needed.

## Testing

- New regression test
`TestGrpcServer_SendMessageById_SharedStore_NoDuplicateWrites` (TDD):
three `GrpcServer`s (eth/69/70/71) sharing a `PeerStore`, peer
negotiated eth/70, by-id fan-out across all three must produce exactly
one write. Red before the fix (`expected: 1, actual: 3`), green after.
- `go test ./p2p/sentry/... ./txnprovider/txpool/...` clean.
- Local hive validation in a loop: before the fix `not ok 18
BlobViolations` reproduced on iteration 2; after the fix 12/12
iterations green, with instrumented logs showing exactly one
`NEW_POOLED_TRANSACTION_HASHES_68` write per new peer where there were
three.
- `make lint` (repeatedly) and `make erigon integration` clean.
@awskii awskii added this pull request to the merge queue Jun 4, 2026
Merged via the queue into main with commit b2329dc Jun 4, 2026
169 of 171 checks passed
@awskii awskii deleted the yperbasis/trie-context-branch-no-clone branch June 4, 2026 17:40
yperbasis added a commit that referenced this pull request Jun 5, 2026
…vert

The initial revert paraphrased the PatriciaContext.Branch and TrieContext.Branch comments and trimmed one assertion from the unit test. Restore the exact pre-#21524 text and the full two-direction Test_TrieContext_BranchCopiesData, so the PR's net diff is precisely the inverse of #21524 and nothing else.
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Jun 5, 2026
…proof/trace corruption (erigontech#21630)

**[SharovBot]**

## Problem

PR erigontech#21524 removed `common.Copy` from `TrieContext.Branch()`, claiming
the returned slice can be safely borrowed. However, mainnet RPC
integration tests show widespread failures:
- All 20 `eth_getProof` tests: diff mismatch
- 50+ `eth_simulateV1` tests: diff mismatch
- `debug_traceBlockByNumber`: malformed JSON + diff mismatch

The borrowed slice is being mutated before callers finish using it.

## Fix

Pure revert of erigontech#21524 (`b2329dc55b`). The net diff is exactly the
inverse of that commit and touches nothing else — equivalent to `git
revert b2329dc`. It:
- restores `common.Copy(enc)` in `TrieContext.Branch()` so each returned
slice is independent;
- restores the original `PatriciaContext.Branch` interface doc and
`TrieContext.Branch` comment;
- restores `Test_TrieContext_BranchCopiesData` in full, including the
original two-direction independence assertions (source mutation does not
reach the returned slice, and mutating the returned slice does not reach
the source).

## CI evidence

Failing job:
https://github.com/erigontech/erigon/actions/runs/26987787453/job/79641095034
Regressing commit: b2329dc

Fixes regression from erigontech#21524.

---------

Co-authored-by: erigon-copilot[bot] <erigon-copilot[bot]@users.noreply.github.com>
Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>
Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com>
Co-authored-by: Andrew Ashikhmin <34320705+yperbasis@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants