fix(legacy): remove duplicate peer registration that can wedge sync#946
Conversation
…ndidates Every non-multistream outbound peer was being registered with the sync manager twice: once from OnVersion (peer_server.go:648) and again from handleAddPeerMsg (peer_server.go:1885). Multistream peers had the same duplicate via OnProtoconf + handleAddPeerMsg. In normal operation the two newPeerMsgs landed back-to-back on the sync manager's msgChan and the second was a harmless no-op-overwrite. But if ANY operation made blockHandler slow to drain msgChan (we saw this when block 244024 took 7m23s under postgres pressure), a peer's lifecycle events would queue out of phase: a Disconnect during handshake could put newPeerMsg, donePeerMsg, newPeerMsg on msgChan for the same peer pointer. When blockHandler finally drained, the resulting Set -> Delete -> Set sequence left a permanently disconnected peer pointer in peerStates. startSync then re-elected that zombie forever, pushed getheaders into a closed socket, waited the full maxLastBlockTime (3 min) for the violation to fire, rotated to the next zombie, and so on. Mainnet sat at height 244024 for over 48h cycling between 5-6 dead peer structs while the connection manager held zero live sockets to :8333. Fix: remove the redundant NewPeer call from handleAddPeerMsg. Each peer is now registered exactly once -- in OnVersion for non-multistream peers and in OnProtoconf (after openRequiredStreams) for multistream peers -- matching the timing constraints already documented on those two sites. As defence-in-depth, also skip peers whose socket has already been torn down (peer.Connected() == false) when ranging over peerStates in startSync. If a future regression slips a dead peer into the map again, this stops it from being elected and stalling sync until the next violation tick. Tested: - go build ./services/legacy/... - go vet ./services/legacy/... - go test -race ./services/legacy/netsync/... (93 tests pass)
…hannels Follow-up to the prior commit (eliminating the duplicate NewPeer that strands zombies in peerStates). Two further hardening changes: 1. handleNewPeerMsg now early-returns when peer.Connected() is false. Pairs with the Connected() filter in startSync: dead pointers are now blocked at insertion rather than only being skipped at elect-time. Closes the remaining window during which a stale entry can sit in peerStates waiting for a donePeerMsg. 2. NewPeer/QueueTx/DonePeer all wrote unconditionally to the `done` reply channel on the shutdown path. Both production callsites of NewPeer and DonePeer pass nil, so a write during shutdown would panic. Add a nil check on each. Pre-existing latent bug noted during review; folded in because all three are the same one-line pattern. Tests: - TestHandleNewPeerMsg_NilFSMState updated to use a real connected peer (the prior zero-value stub now correctly fails the Connected() guard). - New TestHandleNewPeerMsg_SkipsDisconnectedPeer covers the rejection path. - go test -race ./services/legacy/netsync/... ./services/legacy/ (179 tests pass)
|
🤖 Claude Code Review Status: Complete Current Review: No critical issues found. The changes correctly address the duplicate peer registration race condition. Analysis: The PR successfully fixes a production issue where the legacy sync could permanently wedge due to a subtle message-ordering race. The root cause was duplicate Changes verified:
Notes:
The implementation is sound, well-tested, and addresses the root cause correctly. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-02 09:05 UTC |
oskarszoon
left a comment
There was a problem hiding this comment.
Approve. Real prod bug (48h wedged mainnet sync), clean scoped fix.
NewPeer removal correct — OnVersion:648 + OnProtoconf:698 cover non-multistream + multistream paths. Post-fix msgChan is clean Set→Delete; fast-reconnect uses fresh *peerpkg.Peer so no ghost-entry aliasing.
0f43413038 defensive add-ons both sound:
Connected()guard at insertion — prunes peers whose socket closed mid-channel-drain.- nil FSM state guard at
:736—state != nilcheck before deref handles transientGetFSMCurrentStatefailure.
TestHandleNewPeerMsg_SkipsDisconnectedPeer + TestHandleNewPeerMsg_NilFSMState cover both. Channel-serialisation model preserved, no new locks. VersionKnown() correctly gates DonePeer. Tombstone comment in handleAddPeerMsg worth keeping.
|
ordishs
left a comment
There was a problem hiding this comment.
Approve — the fix is correct, minimal, and well-reasoned, with real-world mainnet validation.
Verified statically:
- Root cause confirmed:
NewPeeris now called exactly once (OnVersionfor non-multistream,OnProtoconffor multistream); the removedhandleAddPeerMsgcall was the genuine duplicate, eliminating theSet→Delete→Setrace at the source. Connected()(connected != 0 && disconnect == 0) is the right predicate; thestartSyncguard is well-placed and thehandleNewPeerMsgskip is safe — a laterdonePeerMsgfor a skipped peer hits the graceful "unknown peer" return inhandleDonePeerMsg.- Nil-channel guard is correct: both production callers pass
nil, so the shutdown path would have panicked.
Non-blocking follow-ups:
- Add a test asserting
handleAddPeerMsgno longer enqueues anewPeerMsg— this is the primary fix, but current regression tests only cover the defence-in-depth guards, not the duplicate-registration removal itself. - Drop the unused
.Maybe()mock inTestHandleNewPeerMsg_SkipsDisconnectedPeer—handleNewPeerMsgreturns at theConnected()check before reaching the FSM fetch, soGetFSMCurrentStateis never invoked.
One behavioral change worth keeping on the radar (already flagged in the PR description): multistream peers that never send Protoconf are no longer registered as sync candidates. Risk is low since BSV nodes always send protoconf, and the peer is still added to the server peer list via AddPeer.



What goes wrong
The legacy sync gets stuck and won't recover without a restart. We hit this on mainnet — sync sat on the same block for over 48 hours.
When the sync manager gets busy (we triggered it with a block that took 7 minutes to process under Postgres pressure), a race in the peer add/remove path can leave a disconnected peer sitting in the sync manager's peer list forever. The sync manager keeps picking that dead peer, asks it for headers, gets nothing back, gives up after 3 minutes, picks another dead peer, and so on. Forever.
What's actually happening
The legacy code was calling
syncManager.NewPeertwice for every new peer — once fromOnVersion(orOnProtoconffor multistream) and a second time fromhandleAddPeerMsg. Normally harmless — the second call just overwrites the first. But if the peer disconnects between the two calls (e.g. during handshake), three messages end up queued for that peer in this order:The sync manager processes them in order: add, remove, then add the now-dead peer back. From then on, the dead peer sits in the peer map forever.
This is normally invisible because the sync manager drains messages quickly. The moment anything slows it down — a slow block, Postgres latency, GC pause — the window between calls widens and the race fires.
Direct evidence from the wedged log:
Peer IDs are monotonic, so the same id 73 reappearing in sync-peer election 48 hours later proves the same dead struct is still mapped.
The fix
Remove the duplicate registration. Each peer is now registered exactly once: in
OnVersionfor non-multistream peers, inOnProtoconffor multistream peers (matching the existing comments about stream-setup timing).Skip disconnected peers at two checkpoints, as defence-in-depth:
startSyncwhen choosing a sync peerhandleNewPeerMsgwhen inserting into the peer mapFix a pre-existing nil panic in
NewPeer/QueueTx/DonePeer— they wrote to adonereply channel on the shutdown path without checking for nil. Both production callers pass nil, so a shutdown call would panic. All three are the same one-line pattern; folded in here.Behavior to be aware of
Multistream peers that never send
Protoconfare now silently not registered with the sync manager (previously the duplicate call inhandleAddPeerMsgacted as a backup). Practical risk is low — BSV nodes always send protoconf — but flagging it.Test plan
go build ./services/legacy/...go vet ./services/legacy/...go test -race ./services/legacy/netsync/... ./services/legacy/— 179 tests pass