p2p/sentry: share one p2p.Server across all eth protocols by lystopad · Pull Request #21335 · erigontech/erigon

lystopad · 2026-05-21T10:41:28Z

Summary

Erigon now opens a single TCP listener on --port (30303 by default) carrying every configured eth protocol version, instead of one listener per protocol on 30303/30304/30305.

This fixes a discovery-DHT race that left inbound peers stuck at a fraction of --maxpeers for multi-protocol deployments.

The bug

Each entry in ProtocolVersion (today: [ETH69, ETH70, ETH71] on main) used to back its own p2p.Server bound to its own port from --p2p.allowed-ports. All those Servers share the same node key, so each one signed an ENR for the same Node ID but advertising its own port. The discv4 DHT keeps one ENR per Node ID — the one with the highest seq — so the Servers raced at startup and only one of the ENRs survived in the network's view.

Concrete repro on a live Hoodi node with maxpeers: 80 and ProtocolVersion=[ETH69, ETH68]:

ENR seq numbers from erigon_nodeInfo are within a few milliseconds of each other.
Whichever protocol won the race got 50+ inbound; the other got 0-4.
Total peers plateaued around 30 instead of 80; outbound was identical between the two nodes I compared (capped at MaxPeers/DialRatio), so the entire delta was on the inbound side.

With three protocols (ETH69/70/71 on main) the race is three-way and a node can spend the whole session advertising whichever port is least useful.

The fix

node/components/sentry.Provider now:

Creates one sentry.GrpcServer per protocol (so the existing MultiClient / protocol-routing keeps working).
Collects the union of every GrpcServer's Protocols (dedup'd by name+version so wit isn't registered N times).
Builds one p2p.Server with that union and starts it on a single port.
Injects that shared Server back into every GrpcServer via a new SetP2PServer(srv) hook.

Result: one Node ID, one ENR, one TCP listener, one UDP discovery — multiple protocols multiplexed per peer connection (which is how p2p.Server was designed to work in the first place, and how geth has always done it).

Side fixes that fall out of the share

GrpcServer.Close() leaves an externally-injected Server alone (external == true skips Stop) so one sentry shutting down doesn't tear the listener out from under the others; Provider.Close owns stopping the shared server.
A shared sentry.PeerStore is wired into every GrpcServer so wit/0 (deduped to a single sentry) and the negotiated eth/* (on a different sentry per peer) see the same PeerInfo and share its eth-ready signal. The field uses atomic.Pointer[PeerStore] so the swap is race-free.
pi.protocol / pi.witProtocol reads go through locked accessors (EthProtocol(), WitProtocol()); the writers (SetEthProtocol, SetWitProtocol) already hold pi.lock. Verified with go test -race.
pi.rw is split into per-subprotocol pi.ethRw / pi.witRw so writePeer can route by sentryproto.MessageId (globally unique) — necessary because eth and wit reuse low msgcodes (e.g. 0x01 is both eth.NewBlockHashesMsg and wit.NewWitnessHashesMsg).
Per-sentry Peers() / SimplePeerCount() filter by the GrpcServer's own eth version (cached in ss.ethVersion), so the shared PeerStore doesn't N-fold-duplicate admin_peers aggregation.
node/eth.NodesInfo deduplicates by Enode — every sentry now returns identical NodeInfo, so the dedup is what keeps admin_nodeInfo from listing the same node N times.
NodeDatabase moves from per-protocol subdirs (nodes/eth68, nodes/eth69, …) to a single nodes/eth. Existing per-protocol dirs become inert on upgrade — peer discovery rebuilds from bootnodes within a few minutes.

User-visible changes

Port surface shrinks: only --port is opened now, not --port, --port+1, --port+2. Firewall / monitoring / Kubernetes Service rules that explicitly opened 30304 and 30305 can drop those entries.
--p2p.allowed-ports flag removed. Drop it from CLI args / config files; passing it now errors. --port is the only knob.
--maxpeers caps total connections honestly. Before this fix, with N protocols enabled, each per-protocol Server enforced maxpeers independently, so the actual ceiling was ~N×maxpeers. The new ceiling matches what the flag's docs say.
Default --maxpeers bumped 32 → 64 to compensate for the now-honest cap. (With 3 eth versions and the old multiplication, the effective ceiling was ~96; geth defaults to 50; 64 lands above geth with a small headroom.)
Standalone sentry binary (cmd/sentry) and --sentry.api.addr (remote sentry over gRPC) are functionally unchanged — neither had the bug. The cmd/sentry flag surface shrinks alongside the main binary: it no longer accepts --p2p.allowed-ports.
First run after upgrade loses the warm peer cache in nodes/eth{68,69,…}/; nothing on disk is deleted, the dirs are just no longer read. Discovery refills from bootnodes within a few minutes.

Why not the smaller fixes I considered

Force seq ordering so the highest protocol always wins the DHT race — still publishes N ENRs, still wastes ports, still N-fold counts maxpeers.
Disable discovery on all-but-one Server — kills inbound to the non-primary protocols entirely (fine when eth/68 is being deprecated, not fine for ETH69/70/71 living concurrently).

The shared-Server route is the structurally clean one and matches geth's model.

Verification

make erigon integration builds clean.
make lint clean (×3 — non-deterministic per CLAUDE.md).
go test ./p2p/sentry/... ./node/components/sentry/... ./node/eth/... all green.
go test -race ./p2p/sentry/ ./node/components/sentry/ clean.
Manual repro of the original bug confirmed: ENR seqs from both sentries decoded; in_30303=0, in_30304>0 consistently after a restart that lost the eth/69 race. With this patch applied the race goes away because there's only one ENR.

Test plan

Sync from snapshots on a public testnet (Hoodi / Sepolia) and confirm --maxpeers is reached.
Confirm only --port shows up in netstat -ltnp | grep erigon (not --port+1/+2).
Confirm admin_peers returns each peer once (no len(ProtocolVersion) duplication).
Confirm admin_nodeInfo shows one enode (one port).
Confirm standalone sentry binary still starts: ./build/bin/sentry --datadir=… --chain=….
Confirm --sentry.api.addr=… mode still connects to an external sentry process.

🤖 Generated with Claude Code

…cols Before: each requested eth protocol version backed its own p2p.Server bound to its own TCP/UDP port from --p2p.allowed-ports (30303/30304/30305 by default), all sharing the same node key. The discovery DHT keeps one ENR per Node ID (highest seq wins), so the per-Server ENRs raced at startup and only one survived — peers ended up dialing whichever port that one ENR advertised and inbound on the others stuck at zero. On a Hoodi node with maxpeers=80 and ProtocolVersion=[ETH69,ETH68] inbound on the eth/69 listener was 4-14 while the eth/68 listener got the attention; total peers plateaued near 30. With ETH69/70/71 on main the race is three-way. After: the node opens ONE TCP listener (and one UDP discovery socket) on --port (30303 by default) that carries every configured eth protocol version and the wit sideprotocol multiplexed per peer connection. One Node ID, one ENR, one port — no race. How it's wired: node/components/sentry.Provider builds the union of every GrpcServer's Protocols (dedup'd by name+version so wit isn't registered N times) and injects a single shared p2p.Server back into each GrpcServer via a new SetP2PServer hook. Side fixes that fall out of the share: - GrpcServer.Close() leaves an externally-injected Server alone so one sentry shutting down doesn't tear the listener out from under the others. - Peers() / NodeInfo() on non-reporter sentries return empty so admin_peers aggregation across sentries doesn't N-fold duplicate every entry. - NodeDatabase moves from per-protocol subdirs (nodes/eth68, nodes/eth69, …) to a single nodes/eth — existing per-protocol dirs become inert on upgrade and peer discovery rebuilds from bootnodes within a few minutes. User-visible: - One TCP port instead of three; firewall / monitoring setups that opened 30304/30305 specifically can drop those rules. - --maxpeers now caps total connections honestly (was effectively N×cap). - Standalone sentry binary and --sentry.api.addr (remote sentry) paths are unchanged; neither had the bug. Co-Authored-By: Claude

Copilot

Pull request overview

This PR changes Erigon’s local sentry wiring to run one shared p2p.Server (single ENR / Node ID / TCP listener) across all configured ETH protocol versions, instead of one p2p.Server per protocol/port. This addresses the discovery ENR “last writer wins” race when multiple Servers sign ENRs with the same Node ID but different ports.

Changes:

Add a GrpcServer.SetP2PServer(...) hook to inject an externally-managed, shared p2p.Server, plus lifecycle/peer-reporting controls for shared-server mode.
Update the sentry Provider to build per-protocol GrpcServer instances, merge/dedupe their p2p.Protocols, start one shared p2p.Server, and inject it into all GrpcServers.
Unify node discovery DB layout from per-protocol subdirs to a single nodes/eth directory in local multi-protocol mode.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
p2p/sentry/sentry_grpc_server.go	Adds shared-server injection (`SetP2PServer`), external ownership behavior in `Close`, peer-reporting gating for `Peers`/`NodeInfo`, and exports DNS discovery setup.
node/components/sentry/provider.go	Builds a shared `p2p.Server` across protocol GrpcServers, merges/dedupes protocol registrations, and centralizes node DB + listen-port selection.

Comments suppressed due to low confidence (1)

node/components/sentry/provider.go:405

GrpcServer.Close intentionally does not Stop() externally-injected p2p.Servers. Since the Provider is the coordinator creating this shared server, it should retain a reference and Stop() it in Provider.Close; otherwise Close can leave the TCP listener and P2P goroutines running unless the caller also cancels SentryCtx.

	for i, ss := range p.Servers {
		// First sentry (highest configured protocol version) reports peers.
		ss.SetP2PServer(srv, i == 0)
	}
	return nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…iring Follow-ups to the shared p2p.Server refactor in the previous commit, addressing Copilot review feedback on PR #21335. - Provider.Close now stops the shared p2p.Server. GrpcServer.Close is a no-op for externally-injected Servers (by design — the coordinator owns lifecycle), so without this the listener + discovery goroutines outlive Provider.Close until SentryCtx is cancelled. - Protocol.Run waits up to 10s for the first SetStatus instead of disconnecting immediately when ss.statusData is still nil. The shared Server's listener comes up in Initialize; the multi-client's first SetStatus arrives later via StartStreamLoops in Provider.Start. A new GrpcServer.statusReady channel is closed on the first SetStatus, and awaitStatus selects on it / the timeout / ctx.Done. Closes the startup window where peers would have bounced off DiscProtocolError. - buildSharedP2PConfig now errors out when every entry in AllowedPorts is busy. The legacy per-protocol loop had this guard; my rewrite dropped it and would silently keep the last listenPort and fail on bind. - buildSharedP2PConfig probes port availability against 127.0.0.1 instead of an empty host. checkPortIsFree dials its target, and the common ":30303" form would yield "missing host" -> dial fails -> port reported free even when actually in use. The bug pre-existed in the legacy per-protocol loop too; cleaning it up here while the surrounding code is in motion. The ListenAddr returned to the Server keeps the original empty host so the listener still binds on all interfaces. Co-Authored-By: Claude

Adds focused unit tests covering the new shared-Server wiring that landed in this PR, addressing the two test-coverage items from the Copilot review: p2p/sentry: - SetP2PServer flips external/reportsPeers correctly and panics on a second injection (ownership is decided up front). - Peers() / NodeInfo() on a non-reporter sentry return empty replies so admin_peers aggregation across sentries doesn't N-fold every entry. - Close() leaves an externally-injected Server running; Close() on a self-owned Server stops it. - awaitStatus returns immediately when statusData is already set, times out cleanly when it isn't, and unblocks as soon as SetStatus stores the first non-nil status — the startup-window fix in Protocol.Run. node/components/sentry: - buildSharedP2PConfig points NodeDatabase at nodes/eth, honours an explicit ListenAddr when AllowedPorts is empty, picks the first free AllowedPorts entry, errors when all are busy, and probes 127.0.0.1 when the configured host is empty (":port" form). - startSharedP2PServer dedupes Protocols by name+version, wires the shared p2p.Server into every GrpcServer, and marks only the first as the peer-list reporter. - Provider.Close stops the shared p2p.Server it created (the GrpcServer side leaves it alone by design). Also exposes GrpcServer.IsPeerReporter for observability — the dedup test in the components/sentry package needs to inspect which sentry got marked as the reporter, and it lives in a different package than the unexported field. Co-Authored-By: Claude

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

yperbasis

Overview

Fixes a real bug where erigon's local-mode sentry opened one p2p.Server per eth protocol version. Each Server signed its own ENR under the same Node ID but with a different listener port, racing in the discovery DHT — only the highest-seq ENR survives, so peers dial the "wrong" port and inbound peer count stalls at a fraction of --maxpeers.

The fix moves to geth's model: one p2p.Server with the union of all protocol caps multiplexed per peer connection. Side-effect cleanups (admin_peers de-duplication, MaxPeers honesty, externally-owned-Server lifecycle) fall out naturally.

Stats: +642 / −50, 4 files. Scope is well-contained to the sentry component.

What's strong about this PR

Real bug with measurable impact. The PR body documents the discovery DHT race precisely, with concrete millisecond-level ENR seq timing and a Hoodi repro. Not a speculative refactor.
Architectural alignment. Matches geth's design (one Server, multiplexed protocols). Reduces port surface and removes a class of races.
Clean separation. buildSharedP2PConfig (pure) and startSharedP2PServer (lifecycle) are each small and testable; tests in provider_test.go exercise both directly. SetP2PServer on GrpcServer is a minimal, well-documented injection hook.
Backwards-compatible. Standalone cmd/sentry and remote --sentry.api.addr mode are untouched — they continue to use the lazy-server path inside SetStatus. The new statusReady channel doesn't change behavior for those (the constructor still creates it; awaitStatus returns immediately if status is already set).
Good test coverage. Eleven new tests across two packages, including: dedup, reporter wiring, idempotent SetP2PServer panic, external-server lifecycle, owned-server lifecycle, awaitStatus fast/timeout/unblock paths, empty-host port probing, all-ports-busy error path. All pass under go test -race.

I ran the full ./node/components/sentry/... and ./p2p/sentry/... test packages locally with -race -count=1 — all pass cleanly.

Concerns / suggestions

1. Doc/code mismatch: which sentry reports peers (`p.Servers[0]`)

The PR description and the startSharedP2PServer doc comment both claim:

"Exactly one GrpcServer is marked as the peer-list reporter (the first one in p.Servers, which matches the highest entry in ProtocolVersion)"

But nodecfg/defaults.go:51 has:

ProtocolVersion: []uint{direct.ETH69, direct.ETH70, direct.ETH71},

— lowest first, highest last. So p.Servers[0] is ETH69, not the highest. The new test in provider_test.go puts eth/71 as first, which is the opposite of the production default ordering.

Functionally this doesn't matter — p2pServer.PeersInfo() returns the global list regardless of which sentry calls it. But the doc comment will mislead a reader. Either fix the comment ("the first configured protocol version, typically the lowest") or, if "highest" is intentional, sort p.Servers by Protocols[0].Version desc before the i == 0 reporter pick. The latter is cleaner if you care about which goodPeers map the wit/0 traffic lands in (see #4 below).

2. Unlocked reads of `external` / `reportsPeers` in `Peers()` / `NodeInfo()`

p2p/sentry/sentry_grpc_server.go Peers() and NodeInfo() read these fields without holding p2pServerLock:

p2pServer := ss.getP2PServer()    // takes RLock, releases it
if p2pServer == nil { ... }
if ss.external && !ss.reportsPeers {   // unlocked read
    return &sentryproto.PeersReply{}, nil
}

SetP2PServer writes both fields under Lock. In practice this is fine because SetP2PServer is called once during Initialize, strictly before any gRPC handler can fire — there's a happens-before edge through the multi-client wiring. go test -race didn't flag it because the tests don't drive that race.

But it's a latent risk for future changes (e.g., a hot-reconfig path). Cheap fix: capture both fields inside getP2PServer (rename to return a triple, or add a sibling getServerState() (*p2p.Server, bool, bool)).

3. `SetP2PServer` panics on double-call

if ss.p2pServer != nil {
    panic("sentry.GrpcServer: SetP2PServer called when p2pServer is already set")
}

The invariant ("ownership decided up front") is sound, but a panic in node startup is harsh. An error return would let the provider fail Initialize cleanly with the same diagnostic. Not a blocker — the panic message is descriptive — but worth reconsidering.

4. `wit/0` dedup picks the first server's handler

In startSharedP2PServer, the dedup keeps the first wit/0 Protocol encountered (currently from p.Servers[0], i.e. the ETH69 sentry by default ordering). So every wit-protocol peer's runWitPeer runs on the ETH69 sentry's goodPeers map — even peers that negotiate eth/71 with the ETH71 sentry. That peer ends up in p.Servers[0].goodPeers with protocol=0, witProtocol=0 AND in p.Servers[2].goodPeers with protocol=71.

Effect on runPeerCountLogger: such peers get counted under protocol=0 in the ETH69 sentry's bucket and under their actual eth version in the highest sentry's bucket — double-counted across protocols in the log line. admin_peers is unaffected (gated to the reporter, uses PeersInfo()). Cosmetic logging quirk, not a correctness issue. Worth a one-line note in the comment, or — more cleanly — bind the deduped wit Protocol to the reporter sentry so wit state co-locates with the global peer view.

5. `awaitStatus`: 10s timeout is reasonable but undiscoverable

awaitStatusTimeout = 10 * time.Second is a package-private constant. It governs how long an inbound peer's Protocol.Run waits for the multi-client to broadcast its first SetStatus. The window between the listener coming up (in Initialize) and the multi-client calling SetStatus (after BuildMultiClient + Start) is normally sub-second, so 10s is plenty.

But if anything blocks the startup path (slow chain init, large snapshot reindex, debugger), peers connecting in the window get a silent DiscProtocolError. Worth a debug-level log on timeout so the operator can correlate "no inbound for the first N seconds" with the cause. The current code drops to nil silently and the caller logs PeerErrorLocalStatusNeeded — which doesn't make the timeout vs. the actual "core didn't send status" case distinguishable.

6. `BootstrapNodes` mutation lives in the wrong layer

startSharedP2PServer does:

if cfg.BootstrapNodes == nil && len(chainBootnodes) > 0 {
    bootstrapNodes, err := enode.ParseNodesFromURLs(chainBootnodes)
    ...
    cfg.BootstrapNodes = bootstrapNodes
    cfg.BootstrapNodesV5 = bootstrapNodes
}

This duplicates the exact logic in p2p/sentry/sentry_grpc_server.go makeP2PServer. Since you're already building the p2p.Server yourself in startSharedP2PServer, you can either:

Reuse makeP2PServer (export it, or extract a buildBootstrapNodes helper), or
Pull the DNS-discovery / bootnodes resolution into a single shared helper that both makeP2PServer and startSharedP2PServer call.

The current duplication is a future-divergence hazard.

7. Minor: `awaitStatus` doesn't return `nil` on `ctx.Done()` deterministically

select {
case <-ss.statusReady:
case <-time.After(maxWait):
case <-ss.ctx.Done():
}
return ss.GetStatus()

On ctx.Done(), the function returns whatever GetStatus() is — which could be non-nil if SetStatus ran concurrently and ctx.Done() won the select. Probably fine (we just return the latest status either way), but the doc comment says "returns nil on timeout so the caller can disconnect the peer" without mentioning the ctx.Done case. Tighten the comment or explicitly return nil when ctx is done.

8. Test gap: end-to-end `Initialize` with the new shared-Server path

The new unit tests cover buildSharedP2PConfig and startSharedP2PServer in isolation, but no test exercises Provider.Initialize end-to-end in local mode (which is what the bug actually lived in). I understand that requires a full ChainDB + chainspec stub which is non-trivial, but it would be the strongest regression guard for the original DHT-race scenario. At minimum, a test that asserts after Initialize with len(ProtocolVersion) >= 2:

p.sharedP2PServer != nil
len(p.Servers) == len(ProtocolVersion)
p.Servers[0].GetP2PServer() == p.Servers[1].GetP2PServer() == p.sharedP2PServer
p.Servers[0].IsPeerReporter() == true && p.Servers[1].IsPeerReporter() == false

would catch any future regression that breaks the wiring without a full network stand-up.

9. Backport plan

The PR notes that release/3.4 will need a separate cherry-pick PR landing the fix directly in node/eth/backend.go (no sentryProvider abstraction there). Given this is a peer-connectivity bug affecting users on 3.4 right now, that backport should be tracked as a follow-up immediately after merge. Per CLAUDE.md conventions, the title prefix is [r3.4].

Security / correctness

Nothing concerning. The new code:

Doesn't introduce new network endpoints — actually closes off two (--port+1, --port+2).
Doesn't change cryptographic material (same Node ID, same ENR signing).
Doesn't loosen any peer admission checks.
Bounds the new awaitStatus window so a stuck startup can't cause unbounded blocking.

Performance

One less p2p.Server per protocol version: less goroutine overhead, one discovery instance instead of N.
One DNS-discovery iterator per eth/N protocol still (created in each NewGrpcServer) — same DNS list. Could be deduped to one DNS iterator shared across protocols, but it's not visibly costly and matches geth's per-cap-iterator pattern. Out of scope for this PR.

Verdict

Solid PR. The core change is correct, well-tested, and addresses a documented production bug. My concerns are all polish (doc accuracy, future-proofing locking, factoring duplicated logic). Recommend:

Fix the "highest entry in ProtocolVersion" comment in provider.go and the PR body — it's the first configured entry (which is the lowest in the default).
Add a one-line debug log when awaitStatus times out.
(Optional but nice) Capture external/reportsPeers under p2pServerLock in Peers/NodeInfo for hygiene.
Track the [r3.4] backport as a follow-up.

Items 5-8 above are nice-to-haves that can land later. Nothing in this PR blocks merge.

Findings from Copilot (3) and yperbasis (8). The Copilot findings on Peers routing, empty NodeInfo, and close(statusReady) are real; yperbasis identified the matching doc/code mismatches and a few polish items. 1. Peers() / message routing (Copilot #1). The multi-sentry client uses Peers() to decide which sentry owns each peer and routes SendMessageById via that sentry's gRPC. With the previous reporter-only gating, every peer mapped to Servers[0] (which is the *lowest* configured protocol, ETH69 in the default — yperbasis #1 noted the comment said "highest"). Servers[0]'s goodPeers doesn't have eth/70 or eth/71 peers, so SendMessageById silently no-op'd. Fix: each GrpcServer.Peers() now reports its own goodPeers, filtered to skip entries where both protocol and witProtocol are zero. Each peer ends up in exactly one eth-sentry's goodPeers (the negotiated version) plus, at most, the sentry hosting the wit sideprotocol, so admin_peers aggregation is naturally non-duplicating and routing is correct. 2. NodeInfo() (Copilot #3). Non-reporters returning empty replies polluted admin_nodeInfo with blank entries that sorted first. With the shared p2p.Server every sentry has the same Node ID and the same enode, so they now return identical NodeInfo. node/eth.NodesInfo deduplicates by Enode before sorting. 3. SetStatus's close(ss.statusReady) (Copilot #2) panicked for callers that construct GrpcServer outside NewGrpcServer (existing TestSentryServerImpl_* does this). Guarded the close with a nil check; awaitStatus tolerates a nil channel via the select's other cases. 4. SetP2PServer returns an error instead of panicking on double-call (yperbasis #3). The "ownership decided up front" invariant is still enforced, just propagated up the Provider.Initialize path cleanly. 5. SimplePeerCount filters protocol=0 ghosts (yperbasis #4). With wit/0 deduped to one sentry, peers that negotiate eth/N on a different sentry end up as protocol=0/witProtocol=0 ghosts on the wit-hosting sentry. Counting them would emit a bogus eth.ProtocolToString[0] bucket in the GoodPeers log. The Peers() filter already drops them from admin_peers; this aligns SimplePeerCount. 6. awaitStatus logs a Debug line when its timeout fires (yperbasis #5) so operators can tell "core didn't send status in time" from "core never tried" when the caller disconnects the peer with PeerErrorLocalStatusNeeded. Doc comment clarifies the ctx.Done case too (yperbasis #7). Drops the reportsPeers flag and IsPeerReporter accessor — no longer needed once per-sentry goodPeers replaces the reporter gating. SetP2PServer signature loses its second argument. Tests updated for the new shape; new TestGrpcServer_PeersReturnsPerSentryGoodPeers (per-sentry view + ghost-entry filter) and TestGrpcServer_SetStatus_NilStatusReadyIsSafe (close-nil guard). Deferred to follow-up: - BootstrapNodes/DNS resolution helper to dedupe logic between makeP2PServer and startSharedP2PServer (yperbasis #6). - End-to-end Provider.Initialize test in local mode (yperbasis #8). - [r3.4] backport PR (yperbasis #9). Co-Authored-By: Claude

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Three new findings, all real lifecycle / probing edge cases. 1. SetP2PServer accepted a nil *p2p.Server and still flipped external=true. A subsequent SetStatus would then take the lazy path and build its own Server, but Close() would see external=true and skip Stop() — leaking the listener and discovery goroutines. Reject nil at the door. 2. startSharedP2PServer leaked the shared p2p.Server on inject failure: srv.Start succeeds and p.sharedP2PServer is assigned, but if any SetP2PServer call later errors, the partial Server was left running when Initialize returned. Stop it and clear the field on that path. 3. checkPortIsFree only handled the empty-host form of an unspecified bind address. 0.0.0.0:N and [::]:N would also DialTimeout-fail and read as "port is free" even when a listener was actually bound on all interfaces. Add a loopbackProbeHost helper that maps all four unspecified forms (empty, 0.0.0.0, ::, [::]) to a concrete loopback target so the probe actually exercises the listener; ListenAddr itself keeps the original host. Tests added: - TestGrpcServer_SetP2PServer_RejectsNil — nil rejection + external flag invariant. - TestLoopbackProbeHost — table-driven coverage of the four unspecified forms and concrete-host passthrough. Co-Authored-By: Claude

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Copilot review on PR #21335: with dedupe-by-Enode in place, the previous for-i loop over the first `limit` sentries could return fewer than `limit` unique entries if duplicates appeared early. Treat `limit` as the cap on *unique* results and keep iterating the rest of the sentry list past duplicates so a hybrid layout (some sentries sharing a p2p.Server, others external) can still fill the cap. Co-Authored-By: Claude

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

CI failed on windows-2025 with: expected: "/data/nodes/eth" actual: "\\data\\nodes\\eth" TestBuildSharedP2PConfig_NodeDatabaseUnified hardcoded a forward-slash expectation, but filepath.Join uses OS-native separators. Build the expected value with filepath.Join from a t.TempDir base so the assertion holds on Windows too. Co-Authored-By: Claude

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

p2p/sentry/sentry_grpc_server.go:1212

The “already connected” checks later in getOrCreatePeer read existingPeerInfo.protocol / existingPeerInfo.witProtocol without taking existingPeerInfo.lock, while those fields are written under that lock (e.g. SetEthProtocol). This is a Go data race and can make the one-connection-per-protocol guard flaky under concurrent handshakes. Please read these fields under PeerInfo.lock (or via locked accessors).

		peerInfo := NewPeerInfo(peer)
		ss.peers.peers[peerID] = peerInfo
		return peerInfo, nil
	}

p2p/sentry/sentry_grpc_server.go:1683

peerInfo.protocol is read here without taking peerInfo.lock, but it’s written under that lock (via SetEthProtocol). Since Peers() can be called concurrently with ongoing handshakes, this is a Go data race. Consider reading the protocol once under peerInfo.lock and using that local value for filtering.

	ss.rangePeers(func(peerInfo *PeerInfo) bool {
		if peerInfo.protocol == 0 || peerInfo.protocol != myVersion {
			return true
		}

p2p/sentry/sentry_grpc_server.go:1716

SimplePeerCount reads peerInfo.protocol without peerInfo.lock, but the field is written under that lock. This can show up as a race under load (RPC calls vs handshake goroutines). Read it once under peerInfo.lock and use the local value for both the filter and the increment.

	ss.rangePeers(func(peerInfo *PeerInfo) bool {
		// Per-sentry counting: each sentry reports only the peers whose
		// negotiated eth version matches its own. Mirrors Peers(). Without
		// this filter the Provider's runPeerCountLogger would aggregate
		// every peer once per sentry (shared PeerStore makes the map
		// visible to all sentries).
		if peerInfo.protocol == 0 || peerInfo.protocol != myVersion {
			return true
		}
		counts[peerInfo.protocol]++
		return true

Round-10 Copilot review (3 "low confidence" comments suppressed in the UI, but all real): pi.protocol and pi.witProtocol are written under pi.lock (SetEthProtocol, the wit Run handover) but read in several places without the lock — Go data races under concurrent handshakes vs admin_peers / SimplePeerCount / the message routers. Add locked accessors: - EthProtocol() uint — mirrors SetEthProtocol - SetWitProtocol(uint) — new setter; wit Run now uses it - WitProtocol() uint Replace bare reads at: - getOrCreatePeer's "already connected" checks (one-connection-per- protocol guard) - Peers() filter - SimplePeerCount filter+count - SendMessageById error message - SendMessageToRandomPeers protocol-version filter - SendMessageToAll protocol-version filter Also dropped a duplicate `peerInfo.protocol = protocol` write in the eth Run — SetEthProtocol on the line above already stored it under pi.lock; the bare write was a second, unsynchronised mutation. Verified with `go test -race ./p2p/sentry/`. Co-Authored-By: Claude

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

ss.peers becomes atomic.Pointer[PeerStore] so SetSharedPeerStore is race-free under the documented "swap before srv.Start" contract — a bare field write would trip the race detector if anyone ever reshuffled the call ordering. Readers Load() once and use the inner mutex as before. Cache the eth protocol version in GrpcServer.ethVersion (set in NewGrpcServer) so Peers / SimplePeerCount don't rescan ss.Protocols on every call; the ethProtocolVersion() helper goes away. Both filters now combine the cached version with #88bae21e's locked EthProtocol() snapshot. writePeer's nil-rw drop branch now emits a Trace log carrying the (protocol, msgID, peerID) so an operator chasing a missing message has a fingerprint to grep for instead of guessing. startSharedP2PServer validates that every GrpcServer has a nil p2pServer before mutating any of them — defends against a future change that makes SetP2PServer fail mid-loop and leaves half the sentries holding the shared PeerStore. Verified with `go test -race ./p2p/sentry/ ./node/components/sentry/` and `make lint`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rs to 64 Follow-up to the single-shared-Server change. With one p2p.Server carrying all eth versions on --port, --p2p.allowed-ports (which selected one TCP port per protocol from a list) no longer means anything. Removing it: - Drops the P2pProtocolAllowedPorts CLI flag and its registration in node/cli/default_flags.go. - Drops the allowedPorts parameter from utils.NewP2PConfig and the flag wiring in cmd/sentry/main.go. - Drops the AllowedPorts field from p2p.Config. - Replaces the AllowedPorts fallback loop in buildSharedP2PConfig with a straight ListenAddr passthrough; if --port is busy the bind now fails fast (matches geth). - Removes the obsolete port_util.go helpers (checkPortIsFree, loopbackProbeHost, splitAddrIntoHostAndPort) and the four AllowedPorts/probe-host tests in provider_test.go. - Strips AllowedPorts from three test helpers (polygon miner, engine api tester, txpool p2p client) — they relied on []uint{0} as an ephemeral hint, but ListenAddr already does that. Also bumps default MaxPeers 32 -> 64. Pre-fix with 3 eth protocol Servers the effective cap was ~3*32 = 96; geth defaults to 50. 64 lands above geth with a small headroom so existing operators don't see a sharp drop after upgrading. Also fixes the stale per-protocol qualifiers in flag Usage strings (MaxPeers, MaxPendingPeers) and the NodesDir doc comment that still referenced eth68/eth69 subdirectories. Docs: - README.md port table collapses sentry 30303/30304 -> 30303. - docs/site/docs/fundamentals/default-ports.md table + Sentry CLI section updated; --p2p.allowed-ports bullet removed. - docs/site/docs/fundamentals/multiple-instances.md port-allocation table updated. - ChangeLog.md gets a [3.5.0] Breaking Changes entry with a before/ after table and migration bullets. Verified with `go test -race ./p2p/sentry/ ./node/components/sentry/`, `make lint`, and `make erigon integration`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The flag is gone (see prior commit) — the three skill files that documented or scripted it (erigon-network-ports, erigon-ephemeral, launch-devnet) would otherwise tell agents to pass a flag that errors at startup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

The docs-site CI step `python3 docs/site/scripts/generate-llms.py --check` caught that the committed llms-full.txt artifacts drifted from regenerated content after the default-ports.md / multiple-instances.md edits in the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot review (#21335 r3288296443): SetStatus is a gRPC entrypoint exposed over the wire by cmd/sentry, but it dereferences statusData.ForkData unconditionally (genesisHash on the entry, and HeightForks/TimeForks for the ENR entry builder). A malformed payload with nil ForkData would crash the sentry via nil-pointer panic. Guard at the door: return a clear error if either statusData or statusData.ForkData is nil. Test covers both shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s, findPeerByMinBlock, PeerEvents (erigontech#21394) Follow-up fixes for erigontech#21335 (merged). ## Bugs fixed With the shared `PeerStore` introduced in erigontech#21335, every `GrpcServer` can see **all** peers regardless of which eth protocol version they negotiated. Three `rangePeers` loops were missing the per-sentry version filter that `Peers()` and `SimplePeerCount()` already apply correctly: ### 1. `findBestPeersWithPermit` / `findPeerByMinBlock` Used by `SendMessageByMinBlock` to select target peers. Without the filter, an eth/71 sentry could pick an eth/69 peer: - `messageCode` encodes the request with **eth/71** wire codes. - `writePeer` sends it over the eth/69 peer's `ethRw`. - The remote eth/69 peer receives an unrecognised (or misinterpreted) code and disconnects. Contrast: `SendMessageToRandomPeers` and `SendMessageToAll` already filter with `protocolVersions.Contains(peerInfo.EthProtocol())`. ### 2. `PeerEvents` replay The replay pass at the start of `PeerEvents` iterated the entire shared store and emitted a `Connect` event for every peer. Subscribers for eth/68 would receive `Connect` events for eth/69 and eth/71 peers. Those peers never generate a `Disconnect` on the eth/68 sentry's `peersStreams`, so the subscriber accumulates ghost peers that are never cleaned up. ## Fix Apply the same two-condition guard already used by `Peers()` and `SimplePeerCount()`: ```go if pv := peerInfo.EthProtocol(); pv == 0 || pv != ss.ethVersion { return true } ``` ## Tests Three new tests added to `sentry_grpc_server_test.go`: - `TestGrpcServer_FindBestPeersWithPermit_FiltersVersion` - `TestGrpcServer_FindPeerByMinBlock_FiltersVersion` - `TestGrpcServer_PeerEvents_ReplayFiltersByVersion` All existing tests pass. `make lint` clean ×2.

…ies (erigontech#21597) ## Summary Fixes the hive devp2p `BlobViolations` flake (`expected disconnect on blob violation, got msg code: 24`) seen on the CI Gate run for erigontech#21524 (parallel leg) and on a runner-experiment branch the day before (serial leg) — and the gossip-duplication regression behind it. ## Root cause Since erigontech#21335 all per-eth-version sentry `GrpcServer`s share one `p2p.Server` and one `PeerStore`. `SendMessageById` resolves the target peer in the shared store, so callers that fan a by-id send out across every sentry client — e.g. the txpool's new-peer pool sync (`PropagatePooledTxnsToPeersList`) — now deliver the same frame once per sentry: three identical `NewPooledTransactionHashes` messages per new peer on the default eth/69+70+71 build. Before erigontech#21335 each sentry had its own peer set, so the off-sentry sends were silent no-ops. `SendMessageToAll` and `SendMessageToRandomPeers` already guard against this with `protocolVersions.Contains(peerInfo.EthProtocol())`; `SendMessageById` was the only outbound path without the negotiated-version check. The BlobViolations failure sequence, reproduced locally with wire-level logging (hive `--sim devp2p --sim.limit eth` against an instrumented image; failed on the second loop iteration with the exact CI error): 1. The test peers, announces two blob txs (one with a lying size/type), delivers them, and waits for a disconnect. 2. The violating `PooledTransactions` reply sits in the txpool fetcher's 250 ms inbound batch before the announcement check kicks the peer. 3. If the 5 s `syncToNewPeersEvery` tick lands in that window, the all-pool announcement (~70 KB — about 2000 hashes accumulated by earlier suite tests) is written to the test peer three times: ``` 08:16:32.606 inbound NPTH68 peer=de6008eb count=2 08:16:32.607 WRITE GET_POOLED_TXS_66 peer=de6008eb 08:16:32.607 sync-to-new-peers entries=2002 <- 5s tick fires 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- one write 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- per 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- sentry 08:16:32.855 VIOLATION -> PenalizePeer peer=de6008eb <- next 250ms batch flush ``` 4. The devp2p test deliberately tolerates one pre-disconnect hash announcement (go-ethereum commit 88c8459, Sept 2024, "sometimes we'll get a blob transaction hashes announcement before the disconnect"); the duplicated second copy arrives before the kick and fails the test. In the failed CI run the timing lines up exactly: the txpool started at 06:33:04.64 and the test failed at 06:33:14.644 — the tick+10s sync. With the duplication fixed, at most one announcement can precede the disconnect in that window, which the test tolerates by design. Beyond the test, every new peer was receiving the full pool sync in triplicate. ## Fix Apply the same negotiated-eth-version guard in `SendMessageById` that the other outbound paths use. Eth-protocol messages only; wit is unaffected (it is deduplicated to a single `GrpcServer` at shared-server construction). This also restores the routing contract documented at `FetchBlockAccessLists` ("sentryIndex MUST be the sentry where peerID is actually connected"): sends via a non-matching sentry are silent no-ops again, as they were with per-sentry peer sets. Not present on `release/3.4` (erigontech#21335 is main-only), so no backport is needed. ## Testing - New regression test `TestGrpcServer_SendMessageById_SharedStore_NoDuplicateWrites` (TDD): three `GrpcServer`s (eth/69/70/71) sharing a `PeerStore`, peer negotiated eth/70, by-id fan-out across all three must produce exactly one write. Red before the fix (`expected: 1, actual: 3`), green after. - `go test ./p2p/sentry/... ./txnprovider/txpool/...` clean. - Local hive validation in a loop: before the fix `not ok 18 BlobViolations` reproduced on iteration 2; after the fix 12/12 iterations green, with instrumented logs showing exactly one `NEW_POOLED_TRANSACTION_HASHES_68` write per new peer where there were three. - `make lint` (repeatedly) and `make erigon integration` clean.

lystopad requested review from AskAlexSharov, anacrolix, taratorio and yperbasis as code owners May 21, 2026 10:41

lystopad self-assigned this May 21, 2026

lystopad added the Networking label May 21, 2026

lystopad requested a review from mh0lt May 21, 2026 11:00

yperbasis requested a review from Copilot May 21, 2026 11:29

Copilot started reviewing on behalf of yperbasis May 21, 2026 11:30 View session