fix(legacy): prevent sync stalls from backpressure-triggered peer rotation and dropped orphan continuation#1067
Conversation
…nding OnBlock backpressures network reads while a block is queued or mid-validation (it blocks on blockProcessed), so zero throughput and a stale last-block-time during local validation measure our own speed, not the peer's. The detector counted those ticks as network-speed violations and rotated a healthy sync peer mid-catchup; the rotated peer's queued DATA1 blocks then failed the peerStates lookup, spamming SERVICE_ERROR (59) and forcing a re-download from the new peer. Track the local backlog (blockQueue depth plus the block inside handleBlockMsg) and skip stall checks while it is non-zero. Throughput samples keep updating, and a genuinely stalled peer still fails the checks once the backlog drains.
|
🤖 Claude Code Review Status: Complete Current Review: No issues found. Both fixes are minimal, correct, and well covered by tests.
The peer-agnostic violation counter and the unbounded suppression are acknowledged tradeoffs in the PR description, not regressions. Note: the two pre-existing inline threads on |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-11 09:52 UTC |
…sync In the legacy sync protocol the peer pushes its tip inv after delivering a getblocks batch and sends nothing further until the next getblocks arrives — the orphan tip block is the batch-continuation signal. The legacy-sync/catching-blocks branch dropped orphans without responding, so the pipeline ran dry at every unexpected tip delivery and sync stalled until the stall detector rotated the sync peer. Drop the special case: answer every ErrBlockNotFound with a getblocks from our best block, as the non-legacy branch (and upstream btcd's orphan path) already did. PushGetBlocksMsg filters duplicates and the peer only invs blocks past the locator fork point, so a redundant request mid-batch costs one inv message.
|
+1 to both of Siggi's comments on the
|
fd8272c to
8fcb0a0
Compare
|
Split done: fix 3 moved to #1073 with the provenance flag reworked as an explicit On the blockBacklog notes: the missing MaxBlockDownloadTime bound is intentional — suppression only persists while blocks keep arriving and queuing, and a peer that keeps delivering isn't stalled; rotation can't fix slow local validation. Documented in the PR body now. The peer-agnostic counter is a fair point; scoping it to the sync peer's association is a small change and I'd do it as a follow-up rather than grow this PR further. |
|
ordishs
left a comment
There was a problem hiding this comment.
Approving the trimmed PR (now blockBacklog + orphan getblocks continuation, both in services/legacy/netsync/). The contested Mined/SkipPolicyChecks change was split out, so the earlier blocking objection no longer applies here.
Fix #1 (blockBacklog) — accounting is balanced (single producer/consumer site), atomic usage is correct, and the defer ordering keeps throughput samples fresh. Solid.
Fix #2 (orphan getblocks) — unifying all ErrBlockNotFound orphans onto the getblocks-from-best-block path is correct: in legacy sync the peer's tip inv after a batch is the continuation signal, so swallowing it stalled sync until the detector rotated the peer. Matches the non-legacy branch and upstream btcd. The getblocks-storm concern is bounded by PushGetBlocksMsg's consecutive-duplicate filter (begin hash advances with our best block, so real continuations aren't filtered). TestHandleBlockMsg_OrphanDuringCatchup is a genuine regression guard.
Verified at head 8fcb0a08:
go test -race -run 'TestHandleCheckSyncPeer_LocalBacklog|TestHandleBlockMsg_OrphanDuringCatchup' ./services/legacy/netsync/ → PASS
go test -race ./services/legacy/netsync/ → ok (full package)
go vet ./services/legacy/netsync/ → clean
Non-blocking follow-ups:
- Stale comment at
manager.go:1378(// Create a block locator starting from the parent hash) — the code locates from the best block. - Fix #1's backlog skip early-returns before the
MaxBlockDownloadTimecap and the counter is peer-agnostic; acceptable tradeoffs, worth a note in the comment.
…mber resolution flag to proto field 12 Conflicts resolved: - validator_api.proto: bsv-blockchain#1073 took field 11 for in_block (merged upstream first, owns the number); unconfirmed_parents_at_candidate_height moves to field 12. Both fields kept. - validator_api.pb.go: regenerated from the resolved proto. - validator/Client.go: buildValidateTxRequest carries both InBlock (upstream) and UnconfirmedParentsAtCandidateHeight (this branch). WIRE SKEW WARNING for the existing beta deployment: the beta build sends the resolution flag as field 11, which post-merge means in_block. A beta subtreevalidation talking to a post-merge validator would silently set the wrong option (wedge returns, wrong provenance). Replace beta deployments wholesale; do not mix beta and post-merge services.
…ation and dropped orphan continuation (#1067)
…-legacysyncing Eleven upstream commits, several touching legacy code this branch also changed. Resolutions: - netsync/manager.go: upstream bsv-blockchain#1067 deleted the catchingBlocks early-return in handleBlockMsg (the swallowed orphan tip is the legacy batch-continuation signal; swallowing it stalled sync). Took upstream's always-request behaviour and updated the surrounding comments to post-LEGACYSYNCING wording. - subtreevalidation/Server.go: kept upstream bsv-blockchain#1065's richer rationale for disabling block-assembly additions, on this branch's CATCHINGBLOCKS-only condition. - Re-applied the LEGACYSYNCING removal to symbols upstream reintroduced: validator publishPolicyRejectedTx gate (bsv-blockchain#799) now checks CATCHINGBLOCKS only; the new legacy gate-pinning tests (bsv-blockchain#1065) drive CATCHINGBLOCKS instead of the removed state; comment references updated in adaptivefetch, Validator_test, and the Server.go incident note.



Two fixes for legacy sync stalling mid-catchup (observed on testnet in CATCHINGBLOCKS past the checkpoints).
Fix 1: stall detector blames the sync peer for local validation backpressure
OnBlockblocks on<-sp.blockProcessed(peer_server.go:966), so the node stops reading from the sync peer while a block validates.checkSubtreeFromBlock(full validation path past the highest checkpoint). Measured throughput dropped to zero, network-speed violations accrued each 30s tick, and at 3 violationshandleCheckSyncPeerdisconnected the healthy sync peer and removed it frompeerStates.peerStateslookup (primary peer gone):SERVICE_ERROR (59): [handleBlockMsg] Received block message from unknown peer.The violation counter also ratchets: it only resets on a good-throughput tick, so violations accumulated across earlier backpressure ticks.
Fix: track the local block backlog on
SyncManager(blockBacklog: blockQueue depth plus the block insidehandleBlockMsg). While it is non-zero,handleCheckSyncPeerskips both stall checks — zero throughput during local validation measures our own validation speed, not the peer's health. The deferredupdateNetworkstill runs, so throughput samples stay fresh for the first tick after the backlog drains. A genuinely stalled peer delivers no blocks, so the backlog is empty and the detector behaves exactly as before.Tradeoffs (raised in review, intentional): the suppression is not bounded by
MaxBlockDownloadTime— rotation cannot fix slow local validation, and the suppression only persists while the sync peer keeps delivering blocks, which is the opposite of stalled. The counter is currently peer-agnostic; scoping it to the sync peer's association is a possible follow-up hardening.Fix 2: orphan blocks dropped during sync break the getblocks continuation
In the legacy sync protocol the peer pushes its tip inv after delivering a getblocks batch and sends nothing further until the next getblocks arrives — the orphan tip block IS the batch-continuation signal (upstream btcd answers any sync-peer orphan with
PushGetBlocksMsg). The legacy-sync/catching-blocks branch ofhandleBlockMsgdropped orphans with a barereturn nil:After that, the pipeline ran dry and sync stalled until the stall detector rotated the sync peer (~90s dead time plus re-download) — the rotation machinery was accidentally the only recovery path.
Fix: remove the special case — answer every
ErrBlockNotFoundwith getblocks from our best block, as the non-legacy branch already did. Safe mid-batch:PushGetBlocksMsgfilters duplicate requests,requestedBlocksdedups in-flight getdata, and the peer only invs blocks past the locator fork point, so a redundant request costs one inv message (≤500 hashes).Testing
TestHandleCheckSyncPeer_LocalBacklog: backlog pending → peer kept, no violation accrued; backlog drained → same zero-throughput state still rotates.TestHandleBlockMsg_OrphanDuringCatchup: orphan in CATCHINGBLOCKS → no error, no disconnect, and the getblocks continuation path runs (block locator requested).go test -race ./services/legacy/netsync/clean,go vetclean,golangci-lint0 issues.