Skip to content

p2p/sentry: share one p2p.Server across all eth protocols#21335

Merged
lystopad merged 21 commits into
mainfrom
feature/lystopad/p2p_unified_server
May 22, 2026
Merged

p2p/sentry: share one p2p.Server across all eth protocols#21335
lystopad merged 21 commits into
mainfrom
feature/lystopad/p2p_unified_server

Conversation

@lystopad

@lystopad lystopad commented May 21, 2026

Copy link
Copy Markdown
Member

Summary

Erigon now opens a single TCP listener on --port (30303 by default) carrying every configured eth protocol version, instead of one listener per protocol on 30303/30304/30305.

This fixes a discovery-DHT race that left inbound peers stuck at a fraction of --maxpeers for multi-protocol deployments.

The bug

Each entry in ProtocolVersion (today: [ETH69, ETH70, ETH71] on main) used to back its own p2p.Server bound to its own port from --p2p.allowed-ports. All those Servers share the same node key, so each one signed an ENR for the same Node ID but advertising its own port. The discv4 DHT keeps one ENR per Node ID — the one with the highest seq — so the Servers raced at startup and only one of the ENRs survived in the network's view.

Concrete repro on a live Hoodi node with maxpeers: 80 and ProtocolVersion=[ETH69, ETH68]:

  • ENR seq numbers from erigon_nodeInfo are within a few milliseconds of each other.
  • Whichever protocol won the race got 50+ inbound; the other got 0-4.
  • Total peers plateaued around 30 instead of 80; outbound was identical between the two nodes I compared (capped at MaxPeers/DialRatio), so the entire delta was on the inbound side.

With three protocols (ETH69/70/71 on main) the race is three-way and a node can spend the whole session advertising whichever port is least useful.

The fix

node/components/sentry.Provider now:

  1. Creates one sentry.GrpcServer per protocol (so the existing MultiClient / protocol-routing keeps working).
  2. Collects the union of every GrpcServer's Protocols (dedup'd by name+version so wit isn't registered N times).
  3. Builds one p2p.Server with that union and starts it on a single port.
  4. Injects that shared Server back into every GrpcServer via a new SetP2PServer(srv) hook.

Result: one Node ID, one ENR, one TCP listener, one UDP discovery — multiple protocols multiplexed per peer connection (which is how p2p.Server was designed to work in the first place, and how geth has always done it).

Side fixes that fall out of the share

  • GrpcServer.Close() leaves an externally-injected Server alone (external == true skips Stop) so one sentry shutting down doesn't tear the listener out from under the others; Provider.Close owns stopping the shared server.
  • A shared sentry.PeerStore is wired into every GrpcServer so wit/0 (deduped to a single sentry) and the negotiated eth/* (on a different sentry per peer) see the same PeerInfo and share its eth-ready signal. The field uses atomic.Pointer[PeerStore] so the swap is race-free.
  • pi.protocol / pi.witProtocol reads go through locked accessors (EthProtocol(), WitProtocol()); the writers (SetEthProtocol, SetWitProtocol) already hold pi.lock. Verified with go test -race.
  • pi.rw is split into per-subprotocol pi.ethRw / pi.witRw so writePeer can route by sentryproto.MessageId (globally unique) — necessary because eth and wit reuse low msgcodes (e.g. 0x01 is both eth.NewBlockHashesMsg and wit.NewWitnessHashesMsg).
  • Per-sentry Peers() / SimplePeerCount() filter by the GrpcServer's own eth version (cached in ss.ethVersion), so the shared PeerStore doesn't N-fold-duplicate admin_peers aggregation.
  • node/eth.NodesInfo deduplicates by Enode — every sentry now returns identical NodeInfo, so the dedup is what keeps admin_nodeInfo from listing the same node N times.
  • NodeDatabase moves from per-protocol subdirs (nodes/eth68, nodes/eth69, …) to a single nodes/eth. Existing per-protocol dirs become inert on upgrade — peer discovery rebuilds from bootnodes within a few minutes.

User-visible changes

  • Port surface shrinks: only --port is opened now, not --port, --port+1, --port+2. Firewall / monitoring / Kubernetes Service rules that explicitly opened 30304 and 30305 can drop those entries.
  • --p2p.allowed-ports flag removed. Drop it from CLI args / config files; passing it now errors. --port is the only knob.
  • --maxpeers caps total connections honestly. Before this fix, with N protocols enabled, each per-protocol Server enforced maxpeers independently, so the actual ceiling was ~N×maxpeers. The new ceiling matches what the flag's docs say.
  • Default --maxpeers bumped 32 → 64 to compensate for the now-honest cap. (With 3 eth versions and the old multiplication, the effective ceiling was ~96; geth defaults to 50; 64 lands above geth with a small headroom.)
  • Standalone sentry binary (cmd/sentry) and --sentry.api.addr (remote sentry over gRPC) are functionally unchanged — neither had the bug. The cmd/sentry flag surface shrinks alongside the main binary: it no longer accepts --p2p.allowed-ports.
  • First run after upgrade loses the warm peer cache in nodes/eth{68,69,…}/; nothing on disk is deleted, the dirs are just no longer read. Discovery refills from bootnodes within a few minutes.

Why not the smaller fixes I considered

  • Force seq ordering so the highest protocol always wins the DHT race — still publishes N ENRs, still wastes ports, still N-fold counts maxpeers.
  • Disable discovery on all-but-one Server — kills inbound to the non-primary protocols entirely (fine when eth/68 is being deprecated, not fine for ETH69/70/71 living concurrently).

The shared-Server route is the structurally clean one and matches geth's model.

Verification

  • make erigon integration builds clean.
  • make lint clean (×3 — non-deterministic per CLAUDE.md).
  • go test ./p2p/sentry/... ./node/components/sentry/... ./node/eth/... all green.
  • go test -race ./p2p/sentry/ ./node/components/sentry/ clean.
  • Manual repro of the original bug confirmed: ENR seqs from both sentries decoded; in_30303=0, in_30304>0 consistently after a restart that lost the eth/69 race. With this patch applied the race goes away because there's only one ENR.

Test plan

  • Sync from snapshots on a public testnet (Hoodi / Sepolia) and confirm --maxpeers is reached.
  • Confirm only --port shows up in netstat -ltnp | grep erigon (not --port+1/+2).
  • Confirm admin_peers returns each peer once (no len(ProtocolVersion) duplication).
  • Confirm admin_nodeInfo shows one enode (one port).
  • Confirm standalone sentry binary still starts: ./build/bin/sentry --datadir=… --chain=….
  • Confirm --sentry.api.addr=… mode still connects to an external sentry process.

🤖 Generated with Claude Code

…cols

Before: each requested eth protocol version backed its own p2p.Server bound
to its own TCP/UDP port from --p2p.allowed-ports (30303/30304/30305 by
default), all sharing the same node key. The discovery DHT keeps one ENR per
Node ID (highest seq wins), so the per-Server ENRs raced at startup and only
one survived — peers ended up dialing whichever port that one ENR advertised
and inbound on the others stuck at zero. On a Hoodi node with maxpeers=80
and ProtocolVersion=[ETH69,ETH68] inbound on the eth/69 listener was 4-14
while the eth/68 listener got the attention; total peers plateaued near 30.
With ETH69/70/71 on main the race is three-way.

After: the node opens ONE TCP listener (and one UDP discovery socket) on
--port (30303 by default) that carries every configured eth protocol version
and the wit sideprotocol multiplexed per peer connection. One Node ID, one
ENR, one port — no race.

How it's wired: node/components/sentry.Provider builds the union of every
GrpcServer's Protocols (dedup'd by name+version so wit isn't registered N
times) and injects a single shared p2p.Server back into each GrpcServer via
a new SetP2PServer hook.

Side fixes that fall out of the share:
- GrpcServer.Close() leaves an externally-injected Server alone so one
  sentry shutting down doesn't tear the listener out from under the others.
- Peers() / NodeInfo() on non-reporter sentries return empty so admin_peers
  aggregation across sentries doesn't N-fold duplicate every entry.
- NodeDatabase moves from per-protocol subdirs (nodes/eth68, nodes/eth69, …)
  to a single nodes/eth — existing per-protocol dirs become inert on upgrade
  and peer discovery rebuilds from bootnodes within a few minutes.

User-visible:
- One TCP port instead of three; firewall / monitoring setups that opened
  30304/30305 specifically can drop those rules.
- --maxpeers now caps total connections honestly (was effectively N×cap).
- Standalone sentry binary and --sentry.api.addr (remote sentry) paths are
  unchanged; neither had the bug.

Co-Authored-By: Claude

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes Erigon’s local sentry wiring to run one shared p2p.Server (single ENR / Node ID / TCP listener) across all configured ETH protocol versions, instead of one p2p.Server per protocol/port. This addresses the discovery ENR “last writer wins” race when multiple Servers sign ENRs with the same Node ID but different ports.

Changes:

  • Add a GrpcServer.SetP2PServer(...) hook to inject an externally-managed, shared p2p.Server, plus lifecycle/peer-reporting controls for shared-server mode.
  • Update the sentry Provider to build per-protocol GrpcServer instances, merge/dedupe their p2p.Protocols, start one shared p2p.Server, and inject it into all GrpcServers.
  • Unify node discovery DB layout from per-protocol subdirs to a single nodes/eth directory in local multi-protocol mode.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
p2p/sentry/sentry_grpc_server.go Adds shared-server injection (SetP2PServer), external ownership behavior in Close, peer-reporting gating for Peers/NodeInfo, and exports DNS discovery setup.
node/components/sentry/provider.go Builds a shared p2p.Server across protocol GrpcServers, merges/dedupes protocol registrations, and centralizes node DB + listen-port selection.
Comments suppressed due to low confidence (1)

node/components/sentry/provider.go:405

  • GrpcServer.Close intentionally does not Stop() externally-injected p2p.Servers. Since the Provider is the coordinator creating this shared server, it should retain a reference and Stop() it in Provider.Close; otherwise Close can leave the TCP listener and P2P goroutines running unless the caller also cancels SentryCtx.
	for i, ss := range p.Servers {
		// First sentry (highest configured protocol version) reports peers.
		ss.SetP2PServer(srv, i == 0)
	}
	return nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread node/components/sentry/provider.go
Comment thread node/components/sentry/provider.go Outdated
Comment thread node/components/sentry/provider.go Outdated
Comment thread node/components/sentry/provider.go
Comment thread p2p/sentry/sentry_grpc_server.go
lystopad added 2 commits May 21, 2026 14:20
…iring

Follow-ups to the shared p2p.Server refactor in the previous commit,
addressing Copilot review feedback on PR #21335.

- Provider.Close now stops the shared p2p.Server. GrpcServer.Close is a
  no-op for externally-injected Servers (by design — the coordinator owns
  lifecycle), so without this the listener + discovery goroutines outlive
  Provider.Close until SentryCtx is cancelled.

- Protocol.Run waits up to 10s for the first SetStatus instead of
  disconnecting immediately when ss.statusData is still nil. The shared
  Server's listener comes up in Initialize; the multi-client's first
  SetStatus arrives later via StartStreamLoops in Provider.Start. A new
  GrpcServer.statusReady channel is closed on the first SetStatus, and
  awaitStatus selects on it / the timeout / ctx.Done. Closes the startup
  window where peers would have bounced off DiscProtocolError.

- buildSharedP2PConfig now errors out when every entry in AllowedPorts is
  busy. The legacy per-protocol loop had this guard; my rewrite dropped it
  and would silently keep the last listenPort and fail on bind.

- buildSharedP2PConfig probes port availability against 127.0.0.1 instead
  of an empty host. checkPortIsFree dials its target, and the common
  ":30303" form would yield "missing host" -> dial fails -> port reported
  free even when actually in use. The bug pre-existed in the legacy
  per-protocol loop too; cleaning it up here while the surrounding code is
  in motion. The ListenAddr returned to the Server keeps the original
  empty host so the listener still binds on all interfaces.

Co-Authored-By: Claude
Adds focused unit tests covering the new shared-Server wiring that landed in
this PR, addressing the two test-coverage items from the Copilot review:

p2p/sentry:
  - SetP2PServer flips external/reportsPeers correctly and panics on a
    second injection (ownership is decided up front).
  - Peers() / NodeInfo() on a non-reporter sentry return empty replies so
    admin_peers aggregation across sentries doesn't N-fold every entry.
  - Close() leaves an externally-injected Server running; Close() on a
    self-owned Server stops it.
  - awaitStatus returns immediately when statusData is already set, times
    out cleanly when it isn't, and unblocks as soon as SetStatus stores
    the first non-nil status — the startup-window fix in Protocol.Run.

node/components/sentry:
  - buildSharedP2PConfig points NodeDatabase at nodes/eth, honours an
    explicit ListenAddr when AllowedPorts is empty, picks the first free
    AllowedPorts entry, errors when all are busy, and probes 127.0.0.1
    when the configured host is empty (":port" form).
  - startSharedP2PServer dedupes Protocols by name+version, wires the
    shared p2p.Server into every GrpcServer, and marks only the first as
    the peer-list reporter.
  - Provider.Close stops the shared p2p.Server it created (the GrpcServer
    side leaves it alone by design).

Also exposes GrpcServer.IsPeerReporter for observability — the dedup test
in the components/sentry package needs to inspect which sentry got marked
as the reporter, and it lives in a different package than the unexported
field.

Co-Authored-By: Claude

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread p2p/sentry/sentry_grpc_server.go Outdated
Comment thread p2p/sentry/sentry_grpc_server.go Outdated
Comment thread p2p/sentry/sentry_grpc_server.go Outdated

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

Fixes a real bug where erigon's local-mode sentry opened one p2p.Server per eth protocol version. Each Server signed its own ENR under the same Node ID but with a different listener port, racing in the discovery DHT — only the highest-seq ENR survives, so peers dial the "wrong" port and inbound peer count stalls at a fraction of --maxpeers.

The fix moves to geth's model: one p2p.Server with the union of all protocol caps multiplexed per peer connection. Side-effect cleanups (admin_peers de-duplication, MaxPeers honesty, externally-owned-Server lifecycle) fall out naturally.

Stats: +642 / −50, 4 files. Scope is well-contained to the sentry component.

What's strong about this PR

  • Real bug with measurable impact. The PR body documents the discovery DHT race precisely, with concrete millisecond-level ENR seq timing and a Hoodi repro. Not a speculative refactor.
  • Architectural alignment. Matches geth's design (one Server, multiplexed protocols). Reduces port surface and removes a class of races.
  • Clean separation. buildSharedP2PConfig (pure) and startSharedP2PServer (lifecycle) are each small and testable; tests in provider_test.go exercise both directly. SetP2PServer on GrpcServer is a minimal, well-documented injection hook.
  • Backwards-compatible. Standalone cmd/sentry and remote --sentry.api.addr mode are untouched — they continue to use the lazy-server path inside SetStatus. The new statusReady channel doesn't change behavior for those (the constructor still creates it; awaitStatus returns immediately if status is already set).
  • Good test coverage. Eleven new tests across two packages, including: dedup, reporter wiring, idempotent SetP2PServer panic, external-server lifecycle, owned-server lifecycle, awaitStatus fast/timeout/unblock paths, empty-host port probing, all-ports-busy error path. All pass under go test -race.

I ran the full ./node/components/sentry/... and ./p2p/sentry/... test packages locally with -race -count=1 — all pass cleanly.

Concerns / suggestions

1. Doc/code mismatch: which sentry reports peers (p.Servers[0])

The PR description and the startSharedP2PServer doc comment both claim:

"Exactly one GrpcServer is marked as the peer-list reporter (the first one in p.Servers, which matches the highest entry in ProtocolVersion)"

But nodecfg/defaults.go:51 has:

ProtocolVersion: []uint{direct.ETH69, direct.ETH70, direct.ETH71},

lowest first, highest last. So p.Servers[0] is ETH69, not the highest. The new test in provider_test.go puts eth/71 as first, which is the opposite of the production default ordering.

Functionally this doesn't matter — p2pServer.PeersInfo() returns the global list regardless of which sentry calls it. But the doc comment will mislead a reader. Either fix the comment ("the first configured protocol version, typically the lowest") or, if "highest" is intentional, sort p.Servers by Protocols[0].Version desc before the i == 0 reporter pick. The latter is cleaner if you care about which goodPeers map the wit/0 traffic lands in (see #4 below).

2. Unlocked reads of external / reportsPeers in Peers() / NodeInfo()

p2p/sentry/sentry_grpc_server.go Peers() and NodeInfo() read these fields without holding p2pServerLock:

p2pServer := ss.getP2PServer()    // takes RLock, releases it
if p2pServer == nil { ... }
if ss.external && !ss.reportsPeers {   // unlocked read
    return &sentryproto.PeersReply{}, nil
}

SetP2PServer writes both fields under Lock. In practice this is fine because SetP2PServer is called once during Initialize, strictly before any gRPC handler can fire — there's a happens-before edge through the multi-client wiring. go test -race didn't flag it because the tests don't drive that race.

But it's a latent risk for future changes (e.g., a hot-reconfig path). Cheap fix: capture both fields inside getP2PServer (rename to return a triple, or add a sibling getServerState() (*p2p.Server, bool, bool)).

3. SetP2PServer panics on double-call

if ss.p2pServer != nil {
    panic("sentry.GrpcServer: SetP2PServer called when p2pServer is already set")
}

The invariant ("ownership decided up front") is sound, but a panic in node startup is harsh. An error return would let the provider fail Initialize cleanly with the same diagnostic. Not a blocker — the panic message is descriptive — but worth reconsidering.

4. wit/0 dedup picks the first server's handler

In startSharedP2PServer, the dedup keeps the first wit/0 Protocol encountered (currently from p.Servers[0], i.e. the ETH69 sentry by default ordering). So every wit-protocol peer's runWitPeer runs on the ETH69 sentry's goodPeers map — even peers that negotiate eth/71 with the ETH71 sentry. That peer ends up in p.Servers[0].goodPeers with protocol=0, witProtocol=0 AND in p.Servers[2].goodPeers with protocol=71.

Effect on runPeerCountLogger: such peers get counted under protocol=0 in the ETH69 sentry's bucket and under their actual eth version in the highest sentry's bucket — double-counted across protocols in the log line. admin_peers is unaffected (gated to the reporter, uses PeersInfo()). Cosmetic logging quirk, not a correctness issue. Worth a one-line note in the comment, or — more cleanly — bind the deduped wit Protocol to the reporter sentry so wit state co-locates with the global peer view.

5. awaitStatus: 10s timeout is reasonable but undiscoverable

awaitStatusTimeout = 10 * time.Second is a package-private constant. It governs how long an inbound peer's Protocol.Run waits for the multi-client to broadcast its first SetStatus. The window between the listener coming up (in Initialize) and the multi-client calling SetStatus (after BuildMultiClient + Start) is normally sub-second, so 10s is plenty.

But if anything blocks the startup path (slow chain init, large snapshot reindex, debugger), peers connecting in the window get a silent DiscProtocolError. Worth a debug-level log on timeout so the operator can correlate "no inbound for the first N seconds" with the cause. The current code drops to nil silently and the caller logs PeerErrorLocalStatusNeeded — which doesn't make the timeout vs. the actual "core didn't send status" case distinguishable.

6. BootstrapNodes mutation lives in the wrong layer

startSharedP2PServer does:

if cfg.BootstrapNodes == nil && len(chainBootnodes) > 0 {
    bootstrapNodes, err := enode.ParseNodesFromURLs(chainBootnodes)
    ...
    cfg.BootstrapNodes = bootstrapNodes
    cfg.BootstrapNodesV5 = bootstrapNodes
}

This duplicates the exact logic in p2p/sentry/sentry_grpc_server.go makeP2PServer. Since you're already building the p2p.Server yourself in startSharedP2PServer, you can either:

  • Reuse makeP2PServer (export it, or extract a buildBootstrapNodes helper), or
  • Pull the DNS-discovery / bootnodes resolution into a single shared helper that both makeP2PServer and startSharedP2PServer call.

The current duplication is a future-divergence hazard.

7. Minor: awaitStatus doesn't return nil on ctx.Done() deterministically

select {
case <-ss.statusReady:
case <-time.After(maxWait):
case <-ss.ctx.Done():
}
return ss.GetStatus()

On ctx.Done(), the function returns whatever GetStatus() is — which could be non-nil if SetStatus ran concurrently and ctx.Done() won the select. Probably fine (we just return the latest status either way), but the doc comment says "returns nil on timeout so the caller can disconnect the peer" without mentioning the ctx.Done case. Tighten the comment or explicitly return nil when ctx is done.

8. Test gap: end-to-end Initialize with the new shared-Server path

The new unit tests cover buildSharedP2PConfig and startSharedP2PServer in isolation, but no test exercises Provider.Initialize end-to-end in local mode (which is what the bug actually lived in). I understand that requires a full ChainDB + chainspec stub which is non-trivial, but it would be the strongest regression guard for the original DHT-race scenario. At minimum, a test that asserts after Initialize with len(ProtocolVersion) >= 2:

  • p.sharedP2PServer != nil
  • len(p.Servers) == len(ProtocolVersion)
  • p.Servers[0].GetP2PServer() == p.Servers[1].GetP2PServer() == p.sharedP2PServer
  • p.Servers[0].IsPeerReporter() == true && p.Servers[1].IsPeerReporter() == false

would catch any future regression that breaks the wiring without a full network stand-up.

9. Backport plan

The PR notes that release/3.4 will need a separate cherry-pick PR landing the fix directly in node/eth/backend.go (no sentryProvider abstraction there). Given this is a peer-connectivity bug affecting users on 3.4 right now, that backport should be tracked as a follow-up immediately after merge. Per CLAUDE.md conventions, the title prefix is [r3.4].

Security / correctness

Nothing concerning. The new code:

  • Doesn't introduce new network endpoints — actually closes off two (--port+1, --port+2).
  • Doesn't change cryptographic material (same Node ID, same ENR signing).
  • Doesn't loosen any peer admission checks.
  • Bounds the new awaitStatus window so a stuck startup can't cause unbounded blocking.

Performance

  • One less p2p.Server per protocol version: less goroutine overhead, one discovery instance instead of N.
  • One DNS-discovery iterator per eth/N protocol still (created in each NewGrpcServer) — same DNS list. Could be deduped to one DNS iterator shared across protocols, but it's not visibly costly and matches geth's per-cap-iterator pattern. Out of scope for this PR.

Verdict

Solid PR. The core change is correct, well-tested, and addresses a documented production bug. My concerns are all polish (doc accuracy, future-proofing locking, factoring duplicated logic). Recommend:

  1. Fix the "highest entry in ProtocolVersion" comment in provider.go and the PR body — it's the first configured entry (which is the lowest in the default).
  2. Add a one-line debug log when awaitStatus times out.
  3. (Optional but nice) Capture external/reportsPeers under p2pServerLock in Peers/NodeInfo for hygiene.
  4. Track the [r3.4] backport as a follow-up.

Items 5-8 above are nice-to-haves that can land later. Nothing in this PR blocks merge.

Findings from Copilot (3) and yperbasis (8). The Copilot findings on Peers
routing, empty NodeInfo, and close(statusReady) are real; yperbasis
identified the matching doc/code mismatches and a few polish items.

1. Peers() / message routing (Copilot #1). The multi-sentry client uses
   Peers() to decide which sentry owns each peer and routes SendMessageById
   via that sentry's gRPC. With the previous reporter-only gating, every
   peer mapped to Servers[0] (which is the *lowest* configured protocol,
   ETH69 in the default — yperbasis #1 noted the comment said "highest").
   Servers[0]'s goodPeers doesn't have eth/70 or eth/71 peers, so
   SendMessageById silently no-op'd.

   Fix: each GrpcServer.Peers() now reports its own goodPeers, filtered to
   skip entries where both protocol and witProtocol are zero. Each peer
   ends up in exactly one eth-sentry's goodPeers (the negotiated version)
   plus, at most, the sentry hosting the wit sideprotocol, so admin_peers
   aggregation is naturally non-duplicating and routing is correct.

2. NodeInfo() (Copilot #3). Non-reporters returning empty replies
   polluted admin_nodeInfo with blank entries that sorted first. With the
   shared p2p.Server every sentry has the same Node ID and the same enode,
   so they now return identical NodeInfo. node/eth.NodesInfo deduplicates
   by Enode before sorting.

3. SetStatus's close(ss.statusReady) (Copilot #2) panicked for callers
   that construct GrpcServer outside NewGrpcServer (existing
   TestSentryServerImpl_* does this). Guarded the close with a nil check;
   awaitStatus tolerates a nil channel via the select's other cases.

4. SetP2PServer returns an error instead of panicking on double-call
   (yperbasis #3). The "ownership decided up front" invariant is still
   enforced, just propagated up the Provider.Initialize path cleanly.

5. SimplePeerCount filters protocol=0 ghosts (yperbasis #4). With wit/0
   deduped to one sentry, peers that negotiate eth/N on a different sentry
   end up as protocol=0/witProtocol=0 ghosts on the wit-hosting sentry.
   Counting them would emit a bogus eth.ProtocolToString[0] bucket in the
   GoodPeers log. The Peers() filter already drops them from admin_peers;
   this aligns SimplePeerCount.

6. awaitStatus logs a Debug line when its timeout fires (yperbasis #5)
   so operators can tell "core didn't send status in time" from "core
   never tried" when the caller disconnects the peer with
   PeerErrorLocalStatusNeeded. Doc comment clarifies the ctx.Done case
   too (yperbasis #7).

Drops the reportsPeers flag and IsPeerReporter accessor — no longer needed
once per-sentry goodPeers replaces the reporter gating. SetP2PServer
signature loses its second argument.

Tests updated for the new shape; new TestGrpcServer_PeersReturnsPerSentryGoodPeers
(per-sentry view + ghost-entry filter) and TestGrpcServer_SetStatus_NilStatusReadyIsSafe
(close-nil guard).

Deferred to follow-up:
- BootstrapNodes/DNS resolution helper to dedupe logic between
  makeP2PServer and startSharedP2PServer (yperbasis #6).
- End-to-end Provider.Initialize test in local mode (yperbasis #8).
- [r3.4] backport PR (yperbasis #9).

Co-Authored-By: Claude

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread p2p/sentry/sentry_grpc_server.go
Comment thread node/components/sentry/provider.go Outdated
Comment thread node/components/sentry/provider.go
Three new findings, all real lifecycle / probing edge cases.

1. SetP2PServer accepted a nil *p2p.Server and still flipped external=true.
   A subsequent SetStatus would then take the lazy path and build its own
   Server, but Close() would see external=true and skip Stop() — leaking
   the listener and discovery goroutines. Reject nil at the door.

2. startSharedP2PServer leaked the shared p2p.Server on inject failure:
   srv.Start succeeds and p.sharedP2PServer is assigned, but if any
   SetP2PServer call later errors, the partial Server was left running
   when Initialize returned. Stop it and clear the field on that path.

3. checkPortIsFree only handled the empty-host form of an unspecified
   bind address. 0.0.0.0:N and [::]:N would also DialTimeout-fail and
   read as "port is free" even when a listener was actually bound on
   all interfaces. Add a loopbackProbeHost helper that maps all four
   unspecified forms (empty, 0.0.0.0, ::, [::]) to a concrete loopback
   target so the probe actually exercises the listener; ListenAddr
   itself keeps the original host.

Tests added:
- TestGrpcServer_SetP2PServer_RejectsNil — nil rejection + external flag
  invariant.
- TestLoopbackProbeHost — table-driven coverage of the four unspecified
  forms and concrete-host passthrough.

Co-Authored-By: Claude

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment thread node/eth/backend.go Outdated
Copilot review on PR #21335: with dedupe-by-Enode in place, the previous
for-i loop over the first `limit` sentries could return fewer than `limit`
unique entries if duplicates appeared early. Treat `limit` as the cap on
*unique* results and keep iterating the rest of the sentry list past
duplicates so a hybrid layout (some sentries sharing a p2p.Server, others
external) can still fill the cap.

Co-Authored-By: Claude

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment thread p2p/sentry/sentry_grpc_server.go Outdated
Comment thread node/components/sentry/provider.go Outdated
Comment thread node/components/sentry/provider_test.go Outdated
CI failed on windows-2025 with:

  expected: "/data/nodes/eth"
  actual:   "\\data\\nodes\\eth"

TestBuildSharedP2PConfig_NodeDatabaseUnified hardcoded a forward-slash
expectation, but filepath.Join uses OS-native separators. Build the
expected value with filepath.Join from a t.TempDir base so the assertion
holds on Windows too.

Co-Authored-By: Claude
@lystopad lystopad enabled auto-merge May 22, 2026 07:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

p2p/sentry/sentry_grpc_server.go:1212

  • The “already connected” checks later in getOrCreatePeer read existingPeerInfo.protocol / existingPeerInfo.witProtocol without taking existingPeerInfo.lock, while those fields are written under that lock (e.g. SetEthProtocol). This is a Go data race and can make the one-connection-per-protocol guard flaky under concurrent handshakes. Please read these fields under PeerInfo.lock (or via locked accessors).
		peerInfo := NewPeerInfo(peer)
		ss.peers.peers[peerID] = peerInfo
		return peerInfo, nil
	}

p2p/sentry/sentry_grpc_server.go:1683

  • peerInfo.protocol is read here without taking peerInfo.lock, but it’s written under that lock (via SetEthProtocol). Since Peers() can be called concurrently with ongoing handshakes, this is a Go data race. Consider reading the protocol once under peerInfo.lock and using that local value for filtering.
	ss.rangePeers(func(peerInfo *PeerInfo) bool {
		if peerInfo.protocol == 0 || peerInfo.protocol != myVersion {
			return true
		}

p2p/sentry/sentry_grpc_server.go:1716

  • SimplePeerCount reads peerInfo.protocol without peerInfo.lock, but the field is written under that lock. This can show up as a race under load (RPC calls vs handshake goroutines). Read it once under peerInfo.lock and use the local value for both the filter and the increment.
	ss.rangePeers(func(peerInfo *PeerInfo) bool {
		// Per-sentry counting: each sentry reports only the peers whose
		// negotiated eth version matches its own. Mirrors Peers(). Without
		// this filter the Provider's runPeerCountLogger would aggregate
		// every peer once per sentry (shared PeerStore makes the map
		// visible to all sentries).
		if peerInfo.protocol == 0 || peerInfo.protocol != myVersion {
			return true
		}
		counts[peerInfo.protocol]++
		return true

Comment thread p2p/sentry/sentry_grpc_server.go Outdated
@yperbasis yperbasis added this to the 3.5.0 milestone May 22, 2026
Round-10 Copilot review (3 "low confidence" comments suppressed in the
UI, but all real): pi.protocol and pi.witProtocol are written under
pi.lock (SetEthProtocol, the wit Run handover) but read in several
places without the lock — Go data races under concurrent handshakes vs
admin_peers / SimplePeerCount / the message routers.

Add locked accessors:
  - EthProtocol() uint   — mirrors SetEthProtocol
  - SetWitProtocol(uint) — new setter; wit Run now uses it
  - WitProtocol() uint

Replace bare reads at:
  - getOrCreatePeer's "already connected" checks (one-connection-per-
    protocol guard)
  - Peers() filter
  - SimplePeerCount filter+count
  - SendMessageById error message
  - SendMessageToRandomPeers protocol-version filter
  - SendMessageToAll protocol-version filter

Also dropped a duplicate `peerInfo.protocol = protocol` write in the
eth Run — SetEthProtocol on the line above already stored it under
pi.lock; the bare write was a second, unsynchronised mutation.

Verified with `go test -race ./p2p/sentry/`.

Co-Authored-By: Claude

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread node/components/sentry/provider.go
Comment thread p2p/sentry/sentry_grpc_server_test.go
yperbasis and others added 2 commits May 22, 2026 13:48
ss.peers becomes atomic.Pointer[PeerStore] so SetSharedPeerStore is
race-free under the documented "swap before srv.Start" contract — a
bare field write would trip the race detector if anyone ever reshuffled
the call ordering. Readers Load() once and use the inner mutex as
before.

Cache the eth protocol version in GrpcServer.ethVersion (set in
NewGrpcServer) so Peers / SimplePeerCount don't rescan ss.Protocols on
every call; the ethProtocolVersion() helper goes away. Both filters now
combine the cached version with #88bae21e's locked EthProtocol()
snapshot.

writePeer's nil-rw drop branch now emits a Trace log carrying the
(protocol, msgID, peerID) so an operator chasing a missing message has
a fingerprint to grep for instead of guessing.

startSharedP2PServer validates that every GrpcServer has a nil
p2pServer before mutating any of them — defends against a future change
that makes SetP2PServer fail mid-loop and leaves half the sentries
holding the shared PeerStore.

Verified with `go test -race ./p2p/sentry/ ./node/components/sentry/`
and `make lint`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rs to 64

Follow-up to the single-shared-Server change. With one p2p.Server carrying
all eth versions on --port, --p2p.allowed-ports (which selected one TCP port
per protocol from a list) no longer means anything. Removing it:

  - Drops the P2pProtocolAllowedPorts CLI flag and its registration in
    node/cli/default_flags.go.
  - Drops the allowedPorts parameter from utils.NewP2PConfig and the
    flag wiring in cmd/sentry/main.go.
  - Drops the AllowedPorts field from p2p.Config.
  - Replaces the AllowedPorts fallback loop in buildSharedP2PConfig with
    a straight ListenAddr passthrough; if --port is busy the bind now
    fails fast (matches geth).
  - Removes the obsolete port_util.go helpers (checkPortIsFree,
    loopbackProbeHost, splitAddrIntoHostAndPort) and the four
    AllowedPorts/probe-host tests in provider_test.go.
  - Strips AllowedPorts from three test helpers (polygon miner, engine
    api tester, txpool p2p client) — they relied on []uint{0} as an
    ephemeral hint, but ListenAddr already does that.

Also bumps default MaxPeers 32 -> 64. Pre-fix with 3 eth protocol Servers
the effective cap was ~3*32 = 96; geth defaults to 50. 64 lands above geth
with a small headroom so existing operators don't see a sharp drop after
upgrading.

Also fixes the stale per-protocol qualifiers in flag Usage strings
(MaxPeers, MaxPendingPeers) and the NodesDir doc comment that still
referenced eth68/eth69 subdirectories.

Docs:
  - README.md port table collapses sentry 30303/30304 -> 30303.
  - docs/site/docs/fundamentals/default-ports.md table + Sentry CLI
    section updated; --p2p.allowed-ports bullet removed.
  - docs/site/docs/fundamentals/multiple-instances.md port-allocation
    table updated.
  - ChangeLog.md gets a [3.5.0] Breaking Changes entry with a before/
    after table and migration bullets.

Verified with `go test -race ./p2p/sentry/ ./node/components/sentry/`,
`make lint`, and `make erigon integration`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The flag is gone (see prior commit) — the three skill files that
documented or scripted it (erigon-network-ports, erigon-ephemeral,
launch-devnet) would otherwise tell agents to pass a flag that
errors at startup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Comment thread p2p/sentry/sentry_grpc_server.go
yperbasis and others added 2 commits May 22, 2026 14:22
The docs-site CI step `python3 docs/site/scripts/generate-llms.py --check`
caught that the committed llms-full.txt artifacts drifted from regenerated
content after the default-ports.md / multiple-instances.md edits in the
prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot review (#21335 r3288296443): SetStatus is a gRPC entrypoint
exposed over the wire by cmd/sentry, but it dereferences
statusData.ForkData unconditionally (genesisHash on the entry, and
HeightForks/TimeForks for the ENR entry builder). A malformed payload
with nil ForkData would crash the sentry via nil-pointer panic.

Guard at the door: return a clear error if either statusData or
statusData.ForkData is nil. Test covers both shapes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lystopad lystopad added this pull request to the merge queue May 22, 2026
Merged via the queue into main with commit a612441 May 22, 2026
71 checks passed
@lystopad lystopad deleted the feature/lystopad/p2p_unified_server branch May 22, 2026 14:26
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 25, 2026
…s, findPeerByMinBlock, PeerEvents (erigontech#21394)

Follow-up fixes for erigontech#21335 (merged).

## Bugs fixed

With the shared `PeerStore` introduced in erigontech#21335, every `GrpcServer` can
see **all** peers regardless of which eth protocol version they
negotiated.
Three `rangePeers` loops were missing the per-sentry version filter that
`Peers()` and `SimplePeerCount()` already apply correctly:

### 1. `findBestPeersWithPermit` / `findPeerByMinBlock`

Used by `SendMessageByMinBlock` to select target peers. Without the
filter, an eth/71 sentry could pick an eth/69 peer:

- `messageCode` encodes the request with **eth/71** wire codes.
- `writePeer` sends it over the eth/69 peer's `ethRw`.
- The remote eth/69 peer receives an unrecognised (or misinterpreted)
  code and disconnects.

Contrast: `SendMessageToRandomPeers` and `SendMessageToAll` already
filter with `protocolVersions.Contains(peerInfo.EthProtocol())`.

### 2. `PeerEvents` replay

The replay pass at the start of `PeerEvents` iterated the entire shared
store and emitted a `Connect` event for every peer. Subscribers for
eth/68 would receive `Connect` events for eth/69 and eth/71 peers.
Those peers never generate a `Disconnect` on the eth/68 sentry's
`peersStreams`, so the subscriber accumulates ghost peers that are never
cleaned up.

## Fix

Apply the same two-condition guard already used by `Peers()` and
`SimplePeerCount()`:

```go
if pv := peerInfo.EthProtocol(); pv == 0 || pv != ss.ethVersion {
    return true
}
```

## Tests

Three new tests added to `sentry_grpc_server_test.go`:

- `TestGrpcServer_FindBestPeersWithPermit_FiltersVersion`
- `TestGrpcServer_FindPeerByMinBlock_FiltersVersion`
- `TestGrpcServer_PeerEvents_ReplayFiltersByVersion`

All existing tests pass. `make lint` clean ×2.
chris-mercer pushed a commit to white-b0x/erigon that referenced this pull request Jun 3, 2026
…ies (erigontech#21597)

## Summary

Fixes the hive devp2p `BlobViolations` flake (`expected disconnect on
blob violation, got msg code: 24`) seen on the CI Gate run for erigontech#21524
(parallel leg) and on a runner-experiment branch the day before (serial
leg) — and the gossip-duplication regression behind it.

## Root cause

Since erigontech#21335 all per-eth-version sentry `GrpcServer`s share one
`p2p.Server` and one `PeerStore`. `SendMessageById` resolves the target
peer in the shared store, so callers that fan a by-id send out across
every sentry client — e.g. the txpool's new-peer pool sync
(`PropagatePooledTxnsToPeersList`) — now deliver the same frame once per
sentry: three identical `NewPooledTransactionHashes` messages per new
peer on the default eth/69+70+71 build. Before erigontech#21335 each sentry had
its own peer set, so the off-sentry sends were silent no-ops.

`SendMessageToAll` and `SendMessageToRandomPeers` already guard against
this with `protocolVersions.Contains(peerInfo.EthProtocol())`;
`SendMessageById` was the only outbound path without the
negotiated-version check.

The BlobViolations failure sequence, reproduced locally with wire-level
logging (hive `--sim devp2p --sim.limit eth` against an instrumented
image; failed on the second loop iteration with the exact CI error):

1. The test peers, announces two blob txs (one with a lying size/type),
delivers them, and waits for a disconnect.
2. The violating `PooledTransactions` reply sits in the txpool fetcher's
250 ms inbound batch before the announcement check kicks the peer.
3. If the 5 s `syncToNewPeersEvery` tick lands in that window, the
all-pool announcement (~70 KB — about 2000 hashes accumulated by earlier
suite tests) is written to the test peer three times:

```
08:16:32.606  inbound NPTH68            peer=de6008eb count=2
08:16:32.607  WRITE GET_POOLED_TXS_66   peer=de6008eb
08:16:32.607  sync-to-new-peers         entries=2002            <- 5s tick fires
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- one write
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- per
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- sentry
08:16:32.855  VIOLATION -> PenalizePeer peer=de6008eb           <- next 250ms batch flush
```

4. The devp2p test deliberately tolerates one pre-disconnect hash
announcement (go-ethereum commit 88c8459, Sept 2024, "sometimes we'll
get a blob transaction hashes announcement before the disconnect"); the
duplicated second copy arrives before the kick and fails the test.

In the failed CI run the timing lines up exactly: the txpool started at
06:33:04.64 and the test failed at 06:33:14.644 — the tick+10s sync.

With the duplication fixed, at most one announcement can precede the
disconnect in that window, which the test tolerates by design. Beyond
the test, every new peer was receiving the full pool sync in triplicate.

## Fix

Apply the same negotiated-eth-version guard in `SendMessageById` that
the other outbound paths use. Eth-protocol messages only; wit is
unaffected (it is deduplicated to a single `GrpcServer` at shared-server
construction). This also restores the routing contract documented at
`FetchBlockAccessLists` ("sentryIndex MUST be the sentry where peerID is
actually connected"): sends via a non-matching sentry are silent no-ops
again, as they were with per-sentry peer sets.

Not present on `release/3.4` (erigontech#21335 is main-only), so no backport is
needed.

## Testing

- New regression test
`TestGrpcServer_SendMessageById_SharedStore_NoDuplicateWrites` (TDD):
three `GrpcServer`s (eth/69/70/71) sharing a `PeerStore`, peer
negotiated eth/70, by-id fan-out across all three must produce exactly
one write. Red before the fix (`expected: 1, actual: 3`), green after.
- `go test ./p2p/sentry/... ./txnprovider/txpool/...` clean.
- Local hive validation in a loop: before the fix `not ok 18
BlobViolations` reproduced on iteration 2; after the fix 12/12
iterations green, with instrumented logs showing exactly one
`NEW_POOLED_TRANSACTION_HASHES_68` write per new peer where there were
three.
- `make lint` (repeatedly) and `make erigon integration` clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants