p2p/sentry: share one p2p.Server across all eth protocols#21335
Conversation
…cols Before: each requested eth protocol version backed its own p2p.Server bound to its own TCP/UDP port from --p2p.allowed-ports (30303/30304/30305 by default), all sharing the same node key. The discovery DHT keeps one ENR per Node ID (highest seq wins), so the per-Server ENRs raced at startup and only one survived — peers ended up dialing whichever port that one ENR advertised and inbound on the others stuck at zero. On a Hoodi node with maxpeers=80 and ProtocolVersion=[ETH69,ETH68] inbound on the eth/69 listener was 4-14 while the eth/68 listener got the attention; total peers plateaued near 30. With ETH69/70/71 on main the race is three-way. After: the node opens ONE TCP listener (and one UDP discovery socket) on --port (30303 by default) that carries every configured eth protocol version and the wit sideprotocol multiplexed per peer connection. One Node ID, one ENR, one port — no race. How it's wired: node/components/sentry.Provider builds the union of every GrpcServer's Protocols (dedup'd by name+version so wit isn't registered N times) and injects a single shared p2p.Server back into each GrpcServer via a new SetP2PServer hook. Side fixes that fall out of the share: - GrpcServer.Close() leaves an externally-injected Server alone so one sentry shutting down doesn't tear the listener out from under the others. - Peers() / NodeInfo() on non-reporter sentries return empty so admin_peers aggregation across sentries doesn't N-fold duplicate every entry. - NodeDatabase moves from per-protocol subdirs (nodes/eth68, nodes/eth69, …) to a single nodes/eth — existing per-protocol dirs become inert on upgrade and peer discovery rebuilds from bootnodes within a few minutes. User-visible: - One TCP port instead of three; firewall / monitoring setups that opened 30304/30305 specifically can drop those rules. - --maxpeers now caps total connections honestly (was effectively N×cap). - Standalone sentry binary and --sentry.api.addr (remote sentry) paths are unchanged; neither had the bug. Co-Authored-By: Claude
There was a problem hiding this comment.
Pull request overview
This PR changes Erigon’s local sentry wiring to run one shared p2p.Server (single ENR / Node ID / TCP listener) across all configured ETH protocol versions, instead of one p2p.Server per protocol/port. This addresses the discovery ENR “last writer wins” race when multiple Servers sign ENRs with the same Node ID but different ports.
Changes:
- Add a
GrpcServer.SetP2PServer(...)hook to inject an externally-managed, sharedp2p.Server, plus lifecycle/peer-reporting controls for shared-server mode. - Update the sentry Provider to build per-protocol
GrpcServerinstances, merge/dedupe theirp2p.Protocols, start one sharedp2p.Server, and inject it into all GrpcServers. - Unify node discovery DB layout from per-protocol subdirs to a single
nodes/ethdirectory in local multi-protocol mode.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| p2p/sentry/sentry_grpc_server.go | Adds shared-server injection (SetP2PServer), external ownership behavior in Close, peer-reporting gating for Peers/NodeInfo, and exports DNS discovery setup. |
| node/components/sentry/provider.go | Builds a shared p2p.Server across protocol GrpcServers, merges/dedupes protocol registrations, and centralizes node DB + listen-port selection. |
Comments suppressed due to low confidence (1)
node/components/sentry/provider.go:405
- GrpcServer.Close intentionally does not Stop() externally-injected p2p.Servers. Since the Provider is the coordinator creating this shared server, it should retain a reference and Stop() it in Provider.Close; otherwise Close can leave the TCP listener and P2P goroutines running unless the caller also cancels SentryCtx.
for i, ss := range p.Servers {
// First sentry (highest configured protocol version) reports peers.
ss.SetP2PServer(srv, i == 0)
}
return nil
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…iring Follow-ups to the shared p2p.Server refactor in the previous commit, addressing Copilot review feedback on PR #21335. - Provider.Close now stops the shared p2p.Server. GrpcServer.Close is a no-op for externally-injected Servers (by design — the coordinator owns lifecycle), so without this the listener + discovery goroutines outlive Provider.Close until SentryCtx is cancelled. - Protocol.Run waits up to 10s for the first SetStatus instead of disconnecting immediately when ss.statusData is still nil. The shared Server's listener comes up in Initialize; the multi-client's first SetStatus arrives later via StartStreamLoops in Provider.Start. A new GrpcServer.statusReady channel is closed on the first SetStatus, and awaitStatus selects on it / the timeout / ctx.Done. Closes the startup window where peers would have bounced off DiscProtocolError. - buildSharedP2PConfig now errors out when every entry in AllowedPorts is busy. The legacy per-protocol loop had this guard; my rewrite dropped it and would silently keep the last listenPort and fail on bind. - buildSharedP2PConfig probes port availability against 127.0.0.1 instead of an empty host. checkPortIsFree dials its target, and the common ":30303" form would yield "missing host" -> dial fails -> port reported free even when actually in use. The bug pre-existed in the legacy per-protocol loop too; cleaning it up here while the surrounding code is in motion. The ListenAddr returned to the Server keeps the original empty host so the listener still binds on all interfaces. Co-Authored-By: Claude
Adds focused unit tests covering the new shared-Server wiring that landed in
this PR, addressing the two test-coverage items from the Copilot review:
p2p/sentry:
- SetP2PServer flips external/reportsPeers correctly and panics on a
second injection (ownership is decided up front).
- Peers() / NodeInfo() on a non-reporter sentry return empty replies so
admin_peers aggregation across sentries doesn't N-fold every entry.
- Close() leaves an externally-injected Server running; Close() on a
self-owned Server stops it.
- awaitStatus returns immediately when statusData is already set, times
out cleanly when it isn't, and unblocks as soon as SetStatus stores
the first non-nil status — the startup-window fix in Protocol.Run.
node/components/sentry:
- buildSharedP2PConfig points NodeDatabase at nodes/eth, honours an
explicit ListenAddr when AllowedPorts is empty, picks the first free
AllowedPorts entry, errors when all are busy, and probes 127.0.0.1
when the configured host is empty (":port" form).
- startSharedP2PServer dedupes Protocols by name+version, wires the
shared p2p.Server into every GrpcServer, and marks only the first as
the peer-list reporter.
- Provider.Close stops the shared p2p.Server it created (the GrpcServer
side leaves it alone by design).
Also exposes GrpcServer.IsPeerReporter for observability — the dedup test
in the components/sentry package needs to inspect which sentry got marked
as the reporter, and it lives in a different package than the unexported
field.
Co-Authored-By: Claude
yperbasis
left a comment
There was a problem hiding this comment.
Overview
Fixes a real bug where erigon's local-mode sentry opened one p2p.Server per eth protocol version. Each Server signed its own ENR under the same Node ID but with a different listener port, racing in the discovery DHT — only the highest-seq ENR survives, so peers dial the "wrong" port and inbound peer count stalls at a fraction of --maxpeers.
The fix moves to geth's model: one p2p.Server with the union of all protocol caps multiplexed per peer connection. Side-effect cleanups (admin_peers de-duplication, MaxPeers honesty, externally-owned-Server lifecycle) fall out naturally.
Stats: +642 / −50, 4 files. Scope is well-contained to the sentry component.
What's strong about this PR
- Real bug with measurable impact. The PR body documents the discovery DHT race precisely, with concrete millisecond-level ENR seq timing and a Hoodi repro. Not a speculative refactor.
- Architectural alignment. Matches geth's design (one
Server, multiplexed protocols). Reduces port surface and removes a class of races. - Clean separation.
buildSharedP2PConfig(pure) andstartSharedP2PServer(lifecycle) are each small and testable; tests inprovider_test.goexercise both directly.SetP2PServeronGrpcServeris a minimal, well-documented injection hook. - Backwards-compatible. Standalone
cmd/sentryand remote--sentry.api.addrmode are untouched — they continue to use the lazy-server path insideSetStatus. The newstatusReadychannel doesn't change behavior for those (the constructor still creates it;awaitStatusreturns immediately if status is already set). - Good test coverage. Eleven new tests across two packages, including: dedup, reporter wiring, idempotent
SetP2PServerpanic, external-server lifecycle, owned-server lifecycle,awaitStatusfast/timeout/unblock paths, empty-host port probing, all-ports-busy error path. All pass undergo test -race.
I ran the full ./node/components/sentry/... and ./p2p/sentry/... test packages locally with -race -count=1 — all pass cleanly.
Concerns / suggestions
1. Doc/code mismatch: which sentry reports peers (p.Servers[0])
The PR description and the startSharedP2PServer doc comment both claim:
"Exactly one GrpcServer is marked as the peer-list reporter (the first one in p.Servers, which matches the highest entry in ProtocolVersion)"
But nodecfg/defaults.go:51 has:
ProtocolVersion: []uint{direct.ETH69, direct.ETH70, direct.ETH71},— lowest first, highest last. So p.Servers[0] is ETH69, not the highest. The new test in provider_test.go puts eth/71 as first, which is the opposite of the production default ordering.
Functionally this doesn't matter — p2pServer.PeersInfo() returns the global list regardless of which sentry calls it. But the doc comment will mislead a reader. Either fix the comment ("the first configured protocol version, typically the lowest") or, if "highest" is intentional, sort p.Servers by Protocols[0].Version desc before the i == 0 reporter pick. The latter is cleaner if you care about which goodPeers map the wit/0 traffic lands in (see #4 below).
2. Unlocked reads of external / reportsPeers in Peers() / NodeInfo()
p2p/sentry/sentry_grpc_server.go Peers() and NodeInfo() read these fields without holding p2pServerLock:
p2pServer := ss.getP2PServer() // takes RLock, releases it
if p2pServer == nil { ... }
if ss.external && !ss.reportsPeers { // unlocked read
return &sentryproto.PeersReply{}, nil
}SetP2PServer writes both fields under Lock. In practice this is fine because SetP2PServer is called once during Initialize, strictly before any gRPC handler can fire — there's a happens-before edge through the multi-client wiring. go test -race didn't flag it because the tests don't drive that race.
But it's a latent risk for future changes (e.g., a hot-reconfig path). Cheap fix: capture both fields inside getP2PServer (rename to return a triple, or add a sibling getServerState() (*p2p.Server, bool, bool)).
3. SetP2PServer panics on double-call
if ss.p2pServer != nil {
panic("sentry.GrpcServer: SetP2PServer called when p2pServer is already set")
}The invariant ("ownership decided up front") is sound, but a panic in node startup is harsh. An error return would let the provider fail Initialize cleanly with the same diagnostic. Not a blocker — the panic message is descriptive — but worth reconsidering.
4. wit/0 dedup picks the first server's handler
In startSharedP2PServer, the dedup keeps the first wit/0 Protocol encountered (currently from p.Servers[0], i.e. the ETH69 sentry by default ordering). So every wit-protocol peer's runWitPeer runs on the ETH69 sentry's goodPeers map — even peers that negotiate eth/71 with the ETH71 sentry. That peer ends up in p.Servers[0].goodPeers with protocol=0, witProtocol=0 AND in p.Servers[2].goodPeers with protocol=71.
Effect on runPeerCountLogger: such peers get counted under protocol=0 in the ETH69 sentry's bucket and under their actual eth version in the highest sentry's bucket — double-counted across protocols in the log line. admin_peers is unaffected (gated to the reporter, uses PeersInfo()). Cosmetic logging quirk, not a correctness issue. Worth a one-line note in the comment, or — more cleanly — bind the deduped wit Protocol to the reporter sentry so wit state co-locates with the global peer view.
5. awaitStatus: 10s timeout is reasonable but undiscoverable
awaitStatusTimeout = 10 * time.Second is a package-private constant. It governs how long an inbound peer's Protocol.Run waits for the multi-client to broadcast its first SetStatus. The window between the listener coming up (in Initialize) and the multi-client calling SetStatus (after BuildMultiClient + Start) is normally sub-second, so 10s is plenty.
But if anything blocks the startup path (slow chain init, large snapshot reindex, debugger), peers connecting in the window get a silent DiscProtocolError. Worth a debug-level log on timeout so the operator can correlate "no inbound for the first N seconds" with the cause. The current code drops to nil silently and the caller logs PeerErrorLocalStatusNeeded — which doesn't make the timeout vs. the actual "core didn't send status" case distinguishable.
6. BootstrapNodes mutation lives in the wrong layer
startSharedP2PServer does:
if cfg.BootstrapNodes == nil && len(chainBootnodes) > 0 {
bootstrapNodes, err := enode.ParseNodesFromURLs(chainBootnodes)
...
cfg.BootstrapNodes = bootstrapNodes
cfg.BootstrapNodesV5 = bootstrapNodes
}This duplicates the exact logic in p2p/sentry/sentry_grpc_server.go makeP2PServer. Since you're already building the p2p.Server yourself in startSharedP2PServer, you can either:
- Reuse
makeP2PServer(export it, or extract abuildBootstrapNodeshelper), or - Pull the DNS-discovery / bootnodes resolution into a single shared helper that both
makeP2PServerandstartSharedP2PServercall.
The current duplication is a future-divergence hazard.
7. Minor: awaitStatus doesn't return nil on ctx.Done() deterministically
select {
case <-ss.statusReady:
case <-time.After(maxWait):
case <-ss.ctx.Done():
}
return ss.GetStatus()On ctx.Done(), the function returns whatever GetStatus() is — which could be non-nil if SetStatus ran concurrently and ctx.Done() won the select. Probably fine (we just return the latest status either way), but the doc comment says "returns nil on timeout so the caller can disconnect the peer" without mentioning the ctx.Done case. Tighten the comment or explicitly return nil when ctx is done.
8. Test gap: end-to-end Initialize with the new shared-Server path
The new unit tests cover buildSharedP2PConfig and startSharedP2PServer in isolation, but no test exercises Provider.Initialize end-to-end in local mode (which is what the bug actually lived in). I understand that requires a full ChainDB + chainspec stub which is non-trivial, but it would be the strongest regression guard for the original DHT-race scenario. At minimum, a test that asserts after Initialize with len(ProtocolVersion) >= 2:
p.sharedP2PServer != nillen(p.Servers) == len(ProtocolVersion)p.Servers[0].GetP2PServer() == p.Servers[1].GetP2PServer() == p.sharedP2PServerp.Servers[0].IsPeerReporter() == true && p.Servers[1].IsPeerReporter() == false
would catch any future regression that breaks the wiring without a full network stand-up.
9. Backport plan
The PR notes that release/3.4 will need a separate cherry-pick PR landing the fix directly in node/eth/backend.go (no sentryProvider abstraction there). Given this is a peer-connectivity bug affecting users on 3.4 right now, that backport should be tracked as a follow-up immediately after merge. Per CLAUDE.md conventions, the title prefix is [r3.4].
Security / correctness
Nothing concerning. The new code:
- Doesn't introduce new network endpoints — actually closes off two (
--port+1,--port+2). - Doesn't change cryptographic material (same Node ID, same ENR signing).
- Doesn't loosen any peer admission checks.
- Bounds the new
awaitStatuswindow so a stuck startup can't cause unbounded blocking.
Performance
- One less
p2p.Serverper protocol version: less goroutine overhead, one discovery instance instead of N. - One DNS-discovery iterator per eth/N protocol still (created in each
NewGrpcServer) — same DNS list. Could be deduped to one DNS iterator shared across protocols, but it's not visibly costly and matches geth's per-cap-iterator pattern. Out of scope for this PR.
Verdict
Solid PR. The core change is correct, well-tested, and addresses a documented production bug. My concerns are all polish (doc accuracy, future-proofing locking, factoring duplicated logic). Recommend:
- Fix the "highest entry in ProtocolVersion" comment in
provider.goand the PR body — it's the first configured entry (which is the lowest in the default). - Add a one-line debug log when
awaitStatustimes out. - (Optional but nice) Capture
external/reportsPeersunderp2pServerLockinPeers/NodeInfofor hygiene. - Track the
[r3.4]backport as a follow-up.
Items 5-8 above are nice-to-haves that can land later. Nothing in this PR blocks merge.
Findings from Copilot (3) and yperbasis (8). The Copilot findings on Peers routing, empty NodeInfo, and close(statusReady) are real; yperbasis identified the matching doc/code mismatches and a few polish items. 1. Peers() / message routing (Copilot #1). The multi-sentry client uses Peers() to decide which sentry owns each peer and routes SendMessageById via that sentry's gRPC. With the previous reporter-only gating, every peer mapped to Servers[0] (which is the *lowest* configured protocol, ETH69 in the default — yperbasis #1 noted the comment said "highest"). Servers[0]'s goodPeers doesn't have eth/70 or eth/71 peers, so SendMessageById silently no-op'd. Fix: each GrpcServer.Peers() now reports its own goodPeers, filtered to skip entries where both protocol and witProtocol are zero. Each peer ends up in exactly one eth-sentry's goodPeers (the negotiated version) plus, at most, the sentry hosting the wit sideprotocol, so admin_peers aggregation is naturally non-duplicating and routing is correct. 2. NodeInfo() (Copilot #3). Non-reporters returning empty replies polluted admin_nodeInfo with blank entries that sorted first. With the shared p2p.Server every sentry has the same Node ID and the same enode, so they now return identical NodeInfo. node/eth.NodesInfo deduplicates by Enode before sorting. 3. SetStatus's close(ss.statusReady) (Copilot #2) panicked for callers that construct GrpcServer outside NewGrpcServer (existing TestSentryServerImpl_* does this). Guarded the close with a nil check; awaitStatus tolerates a nil channel via the select's other cases. 4. SetP2PServer returns an error instead of panicking on double-call (yperbasis #3). The "ownership decided up front" invariant is still enforced, just propagated up the Provider.Initialize path cleanly. 5. SimplePeerCount filters protocol=0 ghosts (yperbasis #4). With wit/0 deduped to one sentry, peers that negotiate eth/N on a different sentry end up as protocol=0/witProtocol=0 ghosts on the wit-hosting sentry. Counting them would emit a bogus eth.ProtocolToString[0] bucket in the GoodPeers log. The Peers() filter already drops them from admin_peers; this aligns SimplePeerCount. 6. awaitStatus logs a Debug line when its timeout fires (yperbasis #5) so operators can tell "core didn't send status in time" from "core never tried" when the caller disconnects the peer with PeerErrorLocalStatusNeeded. Doc comment clarifies the ctx.Done case too (yperbasis #7). Drops the reportsPeers flag and IsPeerReporter accessor — no longer needed once per-sentry goodPeers replaces the reporter gating. SetP2PServer signature loses its second argument. Tests updated for the new shape; new TestGrpcServer_PeersReturnsPerSentryGoodPeers (per-sentry view + ghost-entry filter) and TestGrpcServer_SetStatus_NilStatusReadyIsSafe (close-nil guard). Deferred to follow-up: - BootstrapNodes/DNS resolution helper to dedupe logic between makeP2PServer and startSharedP2PServer (yperbasis #6). - End-to-end Provider.Initialize test in local mode (yperbasis #8). - [r3.4] backport PR (yperbasis #9). Co-Authored-By: Claude
Three new findings, all real lifecycle / probing edge cases. 1. SetP2PServer accepted a nil *p2p.Server and still flipped external=true. A subsequent SetStatus would then take the lazy path and build its own Server, but Close() would see external=true and skip Stop() — leaking the listener and discovery goroutines. Reject nil at the door. 2. startSharedP2PServer leaked the shared p2p.Server on inject failure: srv.Start succeeds and p.sharedP2PServer is assigned, but if any SetP2PServer call later errors, the partial Server was left running when Initialize returned. Stop it and clear the field on that path. 3. checkPortIsFree only handled the empty-host form of an unspecified bind address. 0.0.0.0:N and [::]:N would also DialTimeout-fail and read as "port is free" even when a listener was actually bound on all interfaces. Add a loopbackProbeHost helper that maps all four unspecified forms (empty, 0.0.0.0, ::, [::]) to a concrete loopback target so the probe actually exercises the listener; ListenAddr itself keeps the original host. Tests added: - TestGrpcServer_SetP2PServer_RejectsNil — nil rejection + external flag invariant. - TestLoopbackProbeHost — table-driven coverage of the four unspecified forms and concrete-host passthrough. Co-Authored-By: Claude
Copilot review on PR #21335: with dedupe-by-Enode in place, the previous for-i loop over the first `limit` sentries could return fewer than `limit` unique entries if duplicates appeared early. Treat `limit` as the cap on *unique* results and keep iterating the rest of the sentry list past duplicates so a hybrid layout (some sentries sharing a p2p.Server, others external) can still fill the cap. Co-Authored-By: Claude
CI failed on windows-2025 with: expected: "/data/nodes/eth" actual: "\\data\\nodes\\eth" TestBuildSharedP2PConfig_NodeDatabaseUnified hardcoded a forward-slash expectation, but filepath.Join uses OS-native separators. Build the expected value with filepath.Join from a t.TempDir base so the assertion holds on Windows too. Co-Authored-By: Claude
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
p2p/sentry/sentry_grpc_server.go:1212
- The “already connected” checks later in
getOrCreatePeerreadexistingPeerInfo.protocol/existingPeerInfo.witProtocolwithout takingexistingPeerInfo.lock, while those fields are written under that lock (e.g.SetEthProtocol). This is a Go data race and can make the one-connection-per-protocol guard flaky under concurrent handshakes. Please read these fields underPeerInfo.lock(or via locked accessors).
peerInfo := NewPeerInfo(peer)
ss.peers.peers[peerID] = peerInfo
return peerInfo, nil
}
p2p/sentry/sentry_grpc_server.go:1683
peerInfo.protocolis read here without takingpeerInfo.lock, but it’s written under that lock (viaSetEthProtocol). SincePeers()can be called concurrently with ongoing handshakes, this is a Go data race. Consider reading the protocol once underpeerInfo.lockand using that local value for filtering.
ss.rangePeers(func(peerInfo *PeerInfo) bool {
if peerInfo.protocol == 0 || peerInfo.protocol != myVersion {
return true
}
p2p/sentry/sentry_grpc_server.go:1716
SimplePeerCountreadspeerInfo.protocolwithoutpeerInfo.lock, but the field is written under that lock. This can show up as a race under load (RPC calls vs handshake goroutines). Read it once underpeerInfo.lockand use the local value for both the filter and the increment.
ss.rangePeers(func(peerInfo *PeerInfo) bool {
// Per-sentry counting: each sentry reports only the peers whose
// negotiated eth version matches its own. Mirrors Peers(). Without
// this filter the Provider's runPeerCountLogger would aggregate
// every peer once per sentry (shared PeerStore makes the map
// visible to all sentries).
if peerInfo.protocol == 0 || peerInfo.protocol != myVersion {
return true
}
counts[peerInfo.protocol]++
return true
Round-10 Copilot review (3 "low confidence" comments suppressed in the
UI, but all real): pi.protocol and pi.witProtocol are written under
pi.lock (SetEthProtocol, the wit Run handover) but read in several
places without the lock — Go data races under concurrent handshakes vs
admin_peers / SimplePeerCount / the message routers.
Add locked accessors:
- EthProtocol() uint — mirrors SetEthProtocol
- SetWitProtocol(uint) — new setter; wit Run now uses it
- WitProtocol() uint
Replace bare reads at:
- getOrCreatePeer's "already connected" checks (one-connection-per-
protocol guard)
- Peers() filter
- SimplePeerCount filter+count
- SendMessageById error message
- SendMessageToRandomPeers protocol-version filter
- SendMessageToAll protocol-version filter
Also dropped a duplicate `peerInfo.protocol = protocol` write in the
eth Run — SetEthProtocol on the line above already stored it under
pi.lock; the bare write was a second, unsynchronised mutation.
Verified with `go test -race ./p2p/sentry/`.
Co-Authored-By: Claude
ss.peers becomes atomic.Pointer[PeerStore] so SetSharedPeerStore is race-free under the documented "swap before srv.Start" contract — a bare field write would trip the race detector if anyone ever reshuffled the call ordering. Readers Load() once and use the inner mutex as before. Cache the eth protocol version in GrpcServer.ethVersion (set in NewGrpcServer) so Peers / SimplePeerCount don't rescan ss.Protocols on every call; the ethProtocolVersion() helper goes away. Both filters now combine the cached version with #88bae21e's locked EthProtocol() snapshot. writePeer's nil-rw drop branch now emits a Trace log carrying the (protocol, msgID, peerID) so an operator chasing a missing message has a fingerprint to grep for instead of guessing. startSharedP2PServer validates that every GrpcServer has a nil p2pServer before mutating any of them — defends against a future change that makes SetP2PServer fail mid-loop and leaves half the sentries holding the shared PeerStore. Verified with `go test -race ./p2p/sentry/ ./node/components/sentry/` and `make lint`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rs to 64
Follow-up to the single-shared-Server change. With one p2p.Server carrying
all eth versions on --port, --p2p.allowed-ports (which selected one TCP port
per protocol from a list) no longer means anything. Removing it:
- Drops the P2pProtocolAllowedPorts CLI flag and its registration in
node/cli/default_flags.go.
- Drops the allowedPorts parameter from utils.NewP2PConfig and the
flag wiring in cmd/sentry/main.go.
- Drops the AllowedPorts field from p2p.Config.
- Replaces the AllowedPorts fallback loop in buildSharedP2PConfig with
a straight ListenAddr passthrough; if --port is busy the bind now
fails fast (matches geth).
- Removes the obsolete port_util.go helpers (checkPortIsFree,
loopbackProbeHost, splitAddrIntoHostAndPort) and the four
AllowedPorts/probe-host tests in provider_test.go.
- Strips AllowedPorts from three test helpers (polygon miner, engine
api tester, txpool p2p client) — they relied on []uint{0} as an
ephemeral hint, but ListenAddr already does that.
Also bumps default MaxPeers 32 -> 64. Pre-fix with 3 eth protocol Servers
the effective cap was ~3*32 = 96; geth defaults to 50. 64 lands above geth
with a small headroom so existing operators don't see a sharp drop after
upgrading.
Also fixes the stale per-protocol qualifiers in flag Usage strings
(MaxPeers, MaxPendingPeers) and the NodesDir doc comment that still
referenced eth68/eth69 subdirectories.
Docs:
- README.md port table collapses sentry 30303/30304 -> 30303.
- docs/site/docs/fundamentals/default-ports.md table + Sentry CLI
section updated; --p2p.allowed-ports bullet removed.
- docs/site/docs/fundamentals/multiple-instances.md port-allocation
table updated.
- ChangeLog.md gets a [3.5.0] Breaking Changes entry with a before/
after table and migration bullets.
Verified with `go test -race ./p2p/sentry/ ./node/components/sentry/`,
`make lint`, and `make erigon integration`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The flag is gone (see prior commit) — the three skill files that documented or scripted it (erigon-network-ports, erigon-ephemeral, launch-devnet) would otherwise tell agents to pass a flag that errors at startup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The docs-site CI step `python3 docs/site/scripts/generate-llms.py --check` caught that the committed llms-full.txt artifacts drifted from regenerated content after the default-ports.md / multiple-instances.md edits in the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot review (#21335 r3288296443): SetStatus is a gRPC entrypoint exposed over the wire by cmd/sentry, but it dereferences statusData.ForkData unconditionally (genesisHash on the entry, and HeightForks/TimeForks for the ENR entry builder). A malformed payload with nil ForkData would crash the sentry via nil-pointer panic. Guard at the door: return a clear error if either statusData or statusData.ForkData is nil. Test covers both shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s, findPeerByMinBlock, PeerEvents (erigontech#21394) Follow-up fixes for erigontech#21335 (merged). ## Bugs fixed With the shared `PeerStore` introduced in erigontech#21335, every `GrpcServer` can see **all** peers regardless of which eth protocol version they negotiated. Three `rangePeers` loops were missing the per-sentry version filter that `Peers()` and `SimplePeerCount()` already apply correctly: ### 1. `findBestPeersWithPermit` / `findPeerByMinBlock` Used by `SendMessageByMinBlock` to select target peers. Without the filter, an eth/71 sentry could pick an eth/69 peer: - `messageCode` encodes the request with **eth/71** wire codes. - `writePeer` sends it over the eth/69 peer's `ethRw`. - The remote eth/69 peer receives an unrecognised (or misinterpreted) code and disconnects. Contrast: `SendMessageToRandomPeers` and `SendMessageToAll` already filter with `protocolVersions.Contains(peerInfo.EthProtocol())`. ### 2. `PeerEvents` replay The replay pass at the start of `PeerEvents` iterated the entire shared store and emitted a `Connect` event for every peer. Subscribers for eth/68 would receive `Connect` events for eth/69 and eth/71 peers. Those peers never generate a `Disconnect` on the eth/68 sentry's `peersStreams`, so the subscriber accumulates ghost peers that are never cleaned up. ## Fix Apply the same two-condition guard already used by `Peers()` and `SimplePeerCount()`: ```go if pv := peerInfo.EthProtocol(); pv == 0 || pv != ss.ethVersion { return true } ``` ## Tests Three new tests added to `sentry_grpc_server_test.go`: - `TestGrpcServer_FindBestPeersWithPermit_FiltersVersion` - `TestGrpcServer_FindPeerByMinBlock_FiltersVersion` - `TestGrpcServer_PeerEvents_ReplayFiltersByVersion` All existing tests pass. `make lint` clean ×2.
…ies (erigontech#21597) ## Summary Fixes the hive devp2p `BlobViolations` flake (`expected disconnect on blob violation, got msg code: 24`) seen on the CI Gate run for erigontech#21524 (parallel leg) and on a runner-experiment branch the day before (serial leg) — and the gossip-duplication regression behind it. ## Root cause Since erigontech#21335 all per-eth-version sentry `GrpcServer`s share one `p2p.Server` and one `PeerStore`. `SendMessageById` resolves the target peer in the shared store, so callers that fan a by-id send out across every sentry client — e.g. the txpool's new-peer pool sync (`PropagatePooledTxnsToPeersList`) — now deliver the same frame once per sentry: three identical `NewPooledTransactionHashes` messages per new peer on the default eth/69+70+71 build. Before erigontech#21335 each sentry had its own peer set, so the off-sentry sends were silent no-ops. `SendMessageToAll` and `SendMessageToRandomPeers` already guard against this with `protocolVersions.Contains(peerInfo.EthProtocol())`; `SendMessageById` was the only outbound path without the negotiated-version check. The BlobViolations failure sequence, reproduced locally with wire-level logging (hive `--sim devp2p --sim.limit eth` against an instrumented image; failed on the second loop iteration with the exact CI error): 1. The test peers, announces two blob txs (one with a lying size/type), delivers them, and waits for a disconnect. 2. The violating `PooledTransactions` reply sits in the txpool fetcher's 250 ms inbound batch before the announcement check kicks the peer. 3. If the 5 s `syncToNewPeersEvery` tick lands in that window, the all-pool announcement (~70 KB — about 2000 hashes accumulated by earlier suite tests) is written to the test peer three times: ``` 08:16:32.606 inbound NPTH68 peer=de6008eb count=2 08:16:32.607 WRITE GET_POOLED_TXS_66 peer=de6008eb 08:16:32.607 sync-to-new-peers entries=2002 <- 5s tick fires 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- one write 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- per 08:16:32.608 WRITE NPTH68 70084 bytes peer=de6008eb <- sentry 08:16:32.855 VIOLATION -> PenalizePeer peer=de6008eb <- next 250ms batch flush ``` 4. The devp2p test deliberately tolerates one pre-disconnect hash announcement (go-ethereum commit 88c8459, Sept 2024, "sometimes we'll get a blob transaction hashes announcement before the disconnect"); the duplicated second copy arrives before the kick and fails the test. In the failed CI run the timing lines up exactly: the txpool started at 06:33:04.64 and the test failed at 06:33:14.644 — the tick+10s sync. With the duplication fixed, at most one announcement can precede the disconnect in that window, which the test tolerates by design. Beyond the test, every new peer was receiving the full pool sync in triplicate. ## Fix Apply the same negotiated-eth-version guard in `SendMessageById` that the other outbound paths use. Eth-protocol messages only; wit is unaffected (it is deduplicated to a single `GrpcServer` at shared-server construction). This also restores the routing contract documented at `FetchBlockAccessLists` ("sentryIndex MUST be the sentry where peerID is actually connected"): sends via a non-matching sentry are silent no-ops again, as they were with per-sentry peer sets. Not present on `release/3.4` (erigontech#21335 is main-only), so no backport is needed. ## Testing - New regression test `TestGrpcServer_SendMessageById_SharedStore_NoDuplicateWrites` (TDD): three `GrpcServer`s (eth/69/70/71) sharing a `PeerStore`, peer negotiated eth/70, by-id fan-out across all three must produce exactly one write. Red before the fix (`expected: 1, actual: 3`), green after. - `go test ./p2p/sentry/... ./txnprovider/txpool/...` clean. - Local hive validation in a loop: before the fix `not ok 18 BlobViolations` reproduced on iteration 2; after the fix 12/12 iterations green, with instrumented logs showing exactly one `NEW_POOLED_TRANSACTION_HASHES_68` write per new peer where there were three. - `make lint` (repeatedly) and `make erigon integration` clean.
Summary
Erigon now opens a single TCP listener on
--port(30303 by default) carrying every configured eth protocol version, instead of one listener per protocol on 30303/30304/30305.This fixes a discovery-DHT race that left inbound peers stuck at a fraction of
--maxpeersfor multi-protocol deployments.The bug
Each entry in
ProtocolVersion(today:[ETH69, ETH70, ETH71]onmain) used to back its ownp2p.Serverbound to its own port from--p2p.allowed-ports. All those Servers share the same node key, so each one signed an ENR for the same Node ID but advertising its own port. The discv4 DHT keeps one ENR per Node ID — the one with the highestseq— so the Servers raced at startup and only one of the ENRs survived in the network's view.Concrete repro on a live Hoodi node with
maxpeers: 80andProtocolVersion=[ETH69, ETH68]:seqnumbers fromerigon_nodeInfoare within a few milliseconds of each other.MaxPeers/DialRatio), so the entire delta was on the inbound side.With three protocols (
ETH69/70/71onmain) the race is three-way and a node can spend the whole session advertising whichever port is least useful.The fix
node/components/sentry.Providernow:sentry.GrpcServerper protocol (so the existingMultiClient/ protocol-routing keeps working).Protocols(dedup'd byname+versionsowitisn't registered N times).p2p.Serverwith that union and starts it on a single port.SetP2PServer(srv)hook.Result: one Node ID, one ENR, one TCP listener, one UDP discovery — multiple protocols multiplexed per peer connection (which is how
p2p.Serverwas designed to work in the first place, and how geth has always done it).Side fixes that fall out of the share
GrpcServer.Close()leaves an externally-injected Server alone (external == trueskipsStop) so one sentry shutting down doesn't tear the listener out from under the others;Provider.Closeowns stopping the shared server.sentry.PeerStoreis wired into every GrpcServer so wit/0 (deduped to a single sentry) and the negotiated eth/* (on a different sentry per peer) see the samePeerInfoand share its eth-ready signal. The field usesatomic.Pointer[PeerStore]so the swap is race-free.pi.protocol/pi.witProtocolreads go through locked accessors (EthProtocol(),WitProtocol()); the writers (SetEthProtocol,SetWitProtocol) already holdpi.lock. Verified withgo test -race.pi.rwis split into per-subprotocolpi.ethRw/pi.witRwsowritePeercan route bysentryproto.MessageId(globally unique) — necessary because eth and wit reuse low msgcodes (e.g.0x01is botheth.NewBlockHashesMsgandwit.NewWitnessHashesMsg).Peers()/SimplePeerCount()filter by the GrpcServer's own eth version (cached inss.ethVersion), so the sharedPeerStoredoesn't N-fold-duplicateadmin_peersaggregation.node/eth.NodesInfodeduplicates by Enode — every sentry now returns identicalNodeInfo, so the dedup is what keepsadmin_nodeInfofrom listing the same node N times.NodeDatabasemoves from per-protocol subdirs (nodes/eth68,nodes/eth69, …) to a singlenodes/eth. Existing per-protocol dirs become inert on upgrade — peer discovery rebuilds from bootnodes within a few minutes.User-visible changes
--portis opened now, not--port,--port+1,--port+2. Firewall / monitoring / Kubernetes Service rules that explicitly opened 30304 and 30305 can drop those entries.--p2p.allowed-portsflag removed. Drop it from CLI args / config files; passing it now errors.--portis the only knob.--maxpeerscaps total connections honestly. Before this fix, with N protocols enabled, each per-protocol Server enforcedmaxpeersindependently, so the actual ceiling was ~N×maxpeers. The new ceiling matches what the flag's docs say.--maxpeersbumped 32 → 64 to compensate for the now-honest cap. (With 3 eth versions and the old multiplication, the effective ceiling was ~96; geth defaults to 50; 64 lands above geth with a small headroom.)sentrybinary (cmd/sentry) and--sentry.api.addr(remote sentry over gRPC) are functionally unchanged — neither had the bug. Thecmd/sentryflag surface shrinks alongside the main binary: it no longer accepts--p2p.allowed-ports.nodes/eth{68,69,…}/; nothing on disk is deleted, the dirs are just no longer read. Discovery refills from bootnodes within a few minutes.Why not the smaller fixes I considered
maxpeers.The shared-Server route is the structurally clean one and matches geth's model.
Verification
make erigon integrationbuilds clean.make lintclean (×3 — non-deterministic per CLAUDE.md).go test ./p2p/sentry/... ./node/components/sentry/... ./node/eth/...all green.go test -race ./p2p/sentry/ ./node/components/sentry/clean.in_30303=0,in_30304>0consistently after a restart that lost the eth/69 race. With this patch applied the race goes away because there's only one ENR.Test plan
--maxpeersis reached.--portshows up innetstat -ltnp | grep erigon(not--port+1/+2).admin_peersreturns each peer once (nolen(ProtocolVersion)duplication).admin_nodeInfoshows one enode (one port)../build/bin/sentry --datadir=… --chain=….--sentry.api.addr=…mode still connects to an external sentry process.🤖 Generated with Claude Code