fix(legacy): clear stall deadlines across BlockPriority streams#1030
Conversation
Under the BlockPriority multistream policy a peer is split into two TCP connections (GENERAL + DATA1), each a separate Peer with its own stall handler and pendingResponses map. Teranode sends getheaders/getdata on GENERAL, arming the response deadline there, while the svnode delivers the headers/block reply on DATA1. The GENERAL deadline was never cleared by the DATA1-delivered response, so the sync peer false-disconnected with a spurious "headers timeout" ~90s into a large block download. Changes: - signalReceived() fans the receive/clear signal out to every stream peer in the association, so a response on any stream clears the deadline armed on another. Non-association peers notify only themselves, preserving single-stream behaviour. Sends are guarded by each target's quit channel. - expiredStallResponse() suppresses non-block deadlines while a block fetch is in flight: headers and blocks share the DATA1 stream, so a follow-on getheaders reply is head-of-line blocked behind a multi-GB block. Liveness during a block fetch stays gated by the block's own (much longer) deadline. - On block receipt, queued deadlines are refreshed (extend-only) to close the handoff window once the block stops suppressing them. - Enable legacy_allowBlockPriority by default (settings.conf, code default and struct tag) - required to download large mainnet blocks reliably - and correct the struct-tag description, which wrongly described "block priority hints" instead of the multistream policy. Investigation note: the earlier "extended/chunked block fetch" hypothesis was ruled out. wire.SetLimits(4e9) already allows ~4GB blocks, a 1.1GB block is a single normal block message (extmsg framing only triggers above 4GB per bitcoin-sv protocol.cpp), and the BSV protoconf spec exempts block/cmpctblock from maxRecvPayloadLength. No wire-protocol change is needed. Verified: unit tests for the cross-stream fan-out (red->green) and the suppression matrix; -race on peer/netsync/settings; e2e multistream sync (TestMultistreamLegacySync and the multistream suite) with no stall-timeout regressions.
|
🤖 Claude Code Review Status: Complete Current Review: This PR fixes two critical legacy sync stalls triggered when downloading large blocks with BlockPriority enabled. The implementation is thorough and well-tested. No issues found. The code demonstrates careful engineering:
All changes align with project conventions from CLAUDE.md and the PR description accurately describes the implementation. |
The post-block deadline refresh reset every queued response to the 30s base (stallResponseTimeout), but getheaders is granted 90s (stallResponseTimeout*3) in maybeAddDeadline. A headers reply head-of-line blocked behind a block was therefore refreshed to 30s instead of its 90s budget. Practically safe (the response is already produced and queued behind the block, so only drain latency remains), but make it obviously correct: introduce responseStallBudget(cmd) and refresh each response to its own allowance.
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-04 14:19 UTC |
ordishs
left a comment
There was a problem hiding this comment.
Approve — root cause diagnosis is sound and the layered fix (cross-stream clear fan-out, block-in-flight suppression, extend-only refresh) is clean and well-tested. Verified locally: new tests pass under -race, gofmt clean, go vet ./services/legacy/peer/ clean.
A few comments to address (none blocking the mechanics):
1. (Medium) Refresh fires on every relayed tx, not just block completions. The refresh loop lives in the CmdBlock/CmdMerkleBlock/CmdTx/CmdNotFound receive case, so a plain mempool-relayed tx (not a response to our getdata) also refreshes all pending non-block deadlines to their full budget. A genuinely-stalled getheaders could be perpetually deferred while unrelated tx traffic flows. Impact during pure IBD is likely low, but this silently weakens stall detection on the shared handler for all peers. Consider gating the refresh on "a block-family deadline was actually pending" rather than running on any tx receive.
2. (Medium / operational) Default flip to true is the biggest blast radius. Enables multistream negotiation network-wide for every deployment and rewrites the longdesc that previously warned about priority-based attacks. Relies on graceful fallback for peers that don't advertise BlockPriority — worth explicitly confirming that path is exercised. A meaningful default-behavior change worth calling out to operators.
3. (Low) Target scenario unverified. The PR honestly notes the mainnet fat-block (804,008, ~1.1GB) path was never exercised. Unit + E2E coverage proves the deadline-clearing mechanics, but not that a real 1.1GB block now syncs end-to-end — the one claim the fix exists to satisfy. A live mainnet run past 804,008 (with HandleBlockDirect ... DONE and no headers timeout) should ideally gate merge.
4. (Low) Hot-path fan-out cost / teardown edge. Every received message now does N cross-goroutine stallControl sends (buffer 1, quit-guarded — no deadlock, just ~2x stall-control traffic). And a peer mid-RemoveStream (association set but no longer in a.streams) won't self-notify via StreamPeers(). Both narrow/low risk.
Mechanics LGTM; please address #1 and confirm the live mainnet check (#3) before relying on this for production large-block sync.
Addresses PR review (ordishs bsv-blockchain#1): the deadline refresh ran for the whole block-family receive case, so an unsolicited relayed tx (a CmdTx that was not a response to our getdata) also refreshed every pending non-block deadline. Ambient tx traffic could therefore perpetually defer a genuinely stalled getheaders, silently weakening stall detection on the shared handler for all peers. Extract clearBlockResponseGroup() and only refresh the remaining deadlines when a block-family response was actually outstanding. Also (review bsv-blockchain#4) guarantee signalReceived notifies the peer itself even when it has already been removed from the association's stream set (mid-teardown), rather than relying solely on StreamPeers(). Tests: clearBlockResponseGroup (block completion refreshes queued deadline; unsolicited tx does not; tx-as-getdata-response does) and the self-notify teardown edge.
…ress
Second stall mechanism, distinct from the per-message stall handler. The
netsync watchdog (handleCheckSyncPeer) rotates the sync peer when no
block completes within maxLastBlockTime (3 min) — and in headers-first
mode it skips only the speed check, not this one. A multi-GB block (e.g.
~3 GB at height 804157) streams in on the DATA1 stream and completes no
block within 3 min, so a healthy, actively-downloading sync peer is
disconnected mid-download ("updateSyncPeer - disconnect old sync peer"),
the header state is reset to the pre-block height, and the next peer
repeats the loop forever.
Root cause (same family as the deadline orphan): every progress signal
here is message-granular. lastBlockTime updates only on a completed
block, and updateNetwork sampled only the GENERAL peer's BytesReceived,
which excludes the DATA1 stream actually carrying the block.
Add byte-granular download progress and use it to tell an active
download apart from a stall:
- activityConn wraps the peer conn so each read advances a byte-granular
readBytes counter (and refreshes lastRecv) even while a single large
message is still streaming in. Association.ReadBytes sums it across all
streams (GENERAL + DATA1).
- netsync: sample association readBytes per tick; suppress the
last-block-time rotation while association throughput is healthy
(>= minSyncPeerNetworkSpeed, 50 KiB/s default, comfortably above
chatter and far below block-transfer rates). A genuinely stalled peer
delivers no throughput and is still rotated; a withholding-but-chatty
peer is dropped by the peer layer's block-response deadline.
- peer.go: while a block response is pending and association throughput
is healthy, extend the block deadline instead of disconnecting, so a
block that legitimately takes longer than stallResponseTimeoutBlocks
(5 min) to arrive still completes.
Tests: hasHealthyDownloadThroughput matrix; CheckSyncPeer keeps an
actively-downloading peer despite last-block-time; activityConn byte
counting; Association.ReadBytes sum.
|
Thanks. Addressed in #1 — good catch. The refresh now lives in #2 — fallback for non-BlockPriority peers is covered by #4 — #3 — this is the important one. The 1.1GB block at 804,008 now syncs end-to-end on mainnet with the deadline fix. But the next fat block (~3GB at 804,157) exposed a second, independent stall: netsync's
The 3GB/804,157 path is unit-verified but not yet proven live — I'll confirm it on mainnet the same way as 804,008 before this merges. Separately: the red |
ordishs
left a comment
There was a problem hiding this comment.
Re-review of the latest revision (through 952a2fba0). Verified locally on the PR head: go build OK, go test -race ./services/legacy/peer/ ./services/legacy/netsync/ green, gofmt clean, go vet clean. Good test coverage of the new logic.
Prior comments resolved:
- ✅ #1 (refresh on ambient tx) —
clearBlockResponseGroupnow gates the deadline refresh onblockWasPending, so an unsolicited relayed tx clears the group without deferring a genuinely-stalledgetheaders. - ✅ #5 (teardown self-notify edge) —
signalReceivednow always notifies self first, then fans to siblings (skipping self). - ✅ #3 (fat-block path) — now handled in code rather than left as a manual step. The new byte-granular progress signal (
activityConnwrappingconn.Read,Peer.ReadBytes, association-wide aggregation) is checked at both layers: the peer stall handler extends a block deadline while the association reads ≥50 KiB/s, and netsync suppresses themaxLastBlockTimerotation while throughput is healthy. This is a clean, well-tested solution to the core multi-GB-block problem.
New residual concern (Medium, security): the byte-progress gate introduces a slow-drip rotation-evasion vector. activityConn.Read counts raw socket bytes regardless of message validity, and a healthy throughput now overrides both the peer-layer block deadline and netsync's maxLastBlockTime (3 min) rotation. A malicious sync peer can declare a large block and dribble bytes at just ≥50 KiB/s without ever completing a valid block — holding the single sync-peer slot and stalling IBD until the 4 GB wire limit forces disconnect (~22 h). Previously maxLastBlockTime would rotate a peer that completed no block regardless of throughput; that backstop is now overridden by throughput, and 50 KiB/s is a low bar. The honest-peer tradeoff is reasonable, but consider an absolute wall-clock cap per in-flight block (bound the number of throughput-based extensions) so a withholding-but-chatty peer is eventually rotated.
Minor (Low): the peer-layer extension handles one expired block-family command per tick (Block/MerkleBlock/Tx/NotFound are armed together but expiredStallResponse returns the first by map iteration). No disconnect occurs, so it's functionally correct — siblings just get extended one-per-tick. Tidy-up only.
Still verified as not-done in the PR body: an actual mainnet run past block 804,008. The byte-progress logic + tests de-risk it substantially, but the live confirmation remains the real proof.
LGTM to merge once you've decided on the slow-drip cap (#new) and have the mainnet confirmation.
Code ReviewReviewed all 4 commits. The diagnosis of both stall mechanisms is well-reasoned, the commit history shows good responsiveness to earlier review rounds, and test coverage is strong. Verified locally: One design concern I'd like resolved before merge, plus a smaller correctness nit. 🟠 The two fixes together remove the hard cap on how long one peer can hold the sync slot without delivering a blockThe PR body / commit message state that a "withholding-but-chatty peer is dropped by the peer layer's block-response deadline." I don't think that holds, because the peer layer also extends on healthy throughput: // peer.go:1711
if isBlockResponseCommand(command) && healthyDownload {
pendingResponses[command] = now.Add(stallResponseTimeoutBlocks) // extend, not disconnect
}
Previously
At minimum, please correct the PR/commit narrative so the residual risk isn't documented as already mitigated. There's also currently no test exercising this extend branch. 🟡 Unsigned underflow when association
|
Addresses PR review (security, medium): the byte-progress gate let a malicious sync peer dribble bytes at just above the 50 KiB/s threshold without ever completing a valid block, holding the single sync-peer slot and stalling IBD for ~22h (until the 4 GB wire limit forced disconnect). Healthy throughput overrode both the peer-layer block deadline and netsync's maxLastBlockTime rotation with no upper bound. Add MaxBlockDownloadTime (30 min) as an absolute wall-clock ceiling on a single block fetch, shared by both layers: - peer.go tracks blockFetchStart; shouldExtendBlockDeadline extends only while throughput is healthy AND within the cap. Past it the block deadline fires and the peer is disconnected regardless of throughput. - netsync caps the last-block-time suppression at the same window: past it the sync peer is rotated regardless of throughput. A 4 GB block need only average ~2.3 MB/s to finish inside the window, so honest fat blocks are unaffected (the verified 1.1 GB block completed in under 3 min on mainnet). The cap is tunable via the shared constant. Also (review, low): the peer-layer extension now refreshes the whole block-response group together each tick rather than one command per tick by map-iteration order. Tests: shouldExtendBlockDeadline cap matrix; netsync rotates a slow-drip peer once past the cap.
…lper Addresses ordishs' fuller review: - (correctness) Association.ReadBytes sums over the streams present at sample time, so removing a stream (e.g. DATA1) between samples drops the sum. The unsigned delta subtraction would then wrap to a huge value and read as "healthy" exactly when a stream just died — wrongly extending a block deadline / suppressing rotation. Guard both deltas (peer stall handler and netsync hasHealthyDownloadThroughput) so a decrease counts as zero progress. - (cleanup) Export Peer.AssociationReadBytes and drop the duplicate netsync free function of the same name, removing the cross-package naming collision. - (cleanup) Document the buffered-1 fan-out coupling in signalReceived (no deadlock risk; stall handlers never send back). Tests: hasHealthyDownloadThroughput treats a counter decrease as unhealthy.
|
Addressed in 🟠 hard cap removed / "dropped by block deadline" claim — correct, that claim was wrong: the peer layer extends on healthy throughput too, so the 5-min ceiling wasn't actually a backstop. Fixed two ways:
🟡 unsigned underflow — real bug, fixed. 🟢 naming — exported 🟢 channel coupling — noted in the Default flip — kept independently revertible: Mainnet run past 804,157 remains the real gate and will land before merge as noted. |
Re-review — both findings resolved ✅Reviewed the two follow-up commits ( 🟠 → resolved: slow-drip liveness bound
Nice extras: the extend now refreshes the whole block-response group together (fixes the map-iteration-order nit), and the narrative/comments were corrected. The logic is unit-tested at the cap boundary ( 🟡 → resolved: unsigned underflowBoth delta sites now guard the subtraction (peer stall handler treats a decrease as 🟢 cleanups picked up
Residual note (non-blocking)The two caps measure from different origins — peer.go from LGTM pending the mainnet height-804,157 verification the PR already commits to before merge. |
|
Mainnet validation: synced past block 804,157 (~3 GB) end-to-end. ~5 min to download, then 2m17s to fully validate and process — |
ordishs
left a comment
There was a problem hiding this comment.
Approving. Both review findings are resolved cleanly:
- Slow-drip liveness hole bounded by the shared
MaxBlockDownloadTime(30 min) cap, enforced at both the peer and netsync layers. - Unsigned throughput-delta underflow guarded at both sample sites.
Verified against the PR head (2df3dcf43): go vet clean, peer + netsync tests pass under -race (including the new cap-boundary and counter-decrease cases). The default flip to legacy_allowBlockPriority = true is the highest-blast-radius change — please keep it changelog-visible and ideally independently revertible. Ship after the mainnet height-804,157 verification the PR already commits to.
…eceipt
Observed on mainnet: block 804157 (~3 GB) downloaded fine (~5.6 min — the
byte-progress gate held across the whole download), but the sync peer was
then disconnected ("updateSyncPeer - disconnect old sync peer")
mid-processing, ~7 min after sync start, and a fresh peer + header
re-download took its place. Progress wasn't lost (the block was already
downloaded, so it still got accepted), but it churned away the good, fast
sync peer on every fat block.
Two compounding causes:
1. updateLastBlockTime fired only at block *accept* — after the
multi-minute HandleBlockDirect validation — so the timer was stale
throughout processing.
2. The update was gated on `peer == syncPeer`, but under BlockPriority the
block is delivered on the DATA1 stream peer (a different Peer from the
GENERAL sync peer), so it never fired at all during multistream sync.
Fix: syncPeerStateFor matches the current sync peer OR any stream of its
association, and the sync peer's last-block-time is now refreshed at block
*receipt* (HandleBlockDirect entry) as well as at accept. A peer that just
delivered a multi-GB block is therefore not rotated while we validate it.
Residual: a block whose validation alone exceeds maxLastBlockTime (3 min)
could still trip a mid-processing rotation; 804157 validated in 2m17s.
Left as a follow-up if observed on larger blocks.
Tests: syncPeerStateFor matches the sync peer and an association sibling,
not unrelated peers.
|
Mainnet update: the ~3GB block at 804,157 synced end-to-end with the byte-progress fix — downloaded in ~5.6 min with no rotation during download, accepted, and sync advanced to 804,158. So the core fat-block path works. One residual surfaced and is fixed in Residual noted: validation alone exceeding |
Re-review —
|
|



Problem
Legacy sync stalls on large blocks once
legacy_allowBlockPriorityis enabled. After BlockPriority negotiates and the DATA1 stream opens, headers download, but the sync peer is then disconnected mid block-download. There are two distinct stall mechanisms; this PR fixes both.Both share a root cause: under the BlockPriority stream policy, block/header replies arrive on the DATA1 stream (a separate
Peerfrom the GENERAL sync peer), and every progress signal the watchdogs relied on was message-granular — so a multi-GB block that takes minutes to stream in registers as "no progress" until it fully completes.Stall 1 — per-message stall deadlines orphaned across streams
peer.go's stall handler arms a response deadline on the peer that sent the request (GENERAL), but BlockPriority delivers the response on DATA1 (a differentPeerwith its own handler). The GENERAL deadline is never cleared → spurious"<cmd> timeout"(observed: a 90sheaders timeout~90s into a 1.1GB block at height 804,008).Fix:
signalReceivedfans the receive/clear to every stream peer of the association (always notifies self, then siblings; quit-guarded).expiredStallResponsedoes not treat non-block deadlines (headers, inv) as stalls while a block response is pending (headers reply is head-of-line blocked behind the block on DATA1).responseStallBudget); gated on a block actually having been pending so unsolicited tx relay can't defer a stalledgetheaders.The "extended/chunked block fetch" theory was ruled out:
wire.SetLimits(4e9)already permits ~4GB blocks, a <4GB block is a single normalblockmessage (extmsgframing only triggers above 4GB per bitcoin-svprotocol.cpp), and the BSV protoconf spec exemptsblock/cmpctblockfrommaxRecvPayloadLength. No wire-protocol change needed.Stall 2 — netsync sync-peer rotation during a fat-block download
netsync'shandleCheckSyncPeerrotates the sync peer when no block completes withinmaxLastBlockTime(3 min); in headers-first mode it skips only the speed check, not this one. A ~3GB block at height 804,157 streams in on DATA1 and completes no block within 3 min, so the healthy, actively-downloading peer is disconnected (updateSyncPeer - disconnect old sync peer), the header state is reset to the pre-block height, and the next peer repeats the loop forever (lastBlockTimeonly updates on a completed block;updateNetworksampled only the GENERAL peer's bytes, excluding DATA1).Fix — byte-granular download progress:
activityConnwraps the peer conn so each read advances a byte-granularreadBytescounter (and refresheslastRecv) even while one large message is still streaming in.Association.ReadBytessums it across all streams.readBytesper tick and suppresses the last-block-time rotation while association throughput is healthy (≥minSyncPeerNetworkSpeed, 50 KiB/s default — above chatter, far below block-transfer rates).stallResponseTimeoutBlocks(5 min) still completes.Abuse bound (
MaxBlockDownloadTime, 30 min)healthyDownloadis gauged from association-wide read bytes, not block-stream bytes specifically — so a peer could in principle fake ≥50 KiB/s ofinv/addr/pingfiller without ever delivering a valid block. To stop that holding the single sync-peer slot indefinitely, both layers cap the throughput-based reprieve at an absolute wall-clock window (peer.MaxBlockDownloadTime, 30 min). Past it the peer is disconnected (peer layer) / rotated (netsync) regardless of throughput. So: a peer delivering no throughput is dropped immediately by the existing deadlines; a filler-faking peer is dropped within ~30 min. Honest fat blocks are unaffected (a 4 GB block need only average ~2.3 MB/s; the verified 1.1 GB block completed in <3 min on mainnet). Unsigned throughput-delta subtractions are guarded against a stream being removed mid-window.Stall 3 — sync-peer rotated mid block-validation
Even with Stall 2 fixed, a fat block that finishes downloading is then validated for minutes inside
HandleBlockDirect(extend txs / createUtxos / validate / subtree writes — ~2m17s for the 3GB block 804,157). During that window the sync peer is idle (download done → 0 throughput) andupdateLastBlockTimehad not refreshed because (a) it fired only at block accept, after validation, and (b) it was gated onpeer == syncPeer— but the block arrives on the DATA1 stream peer, a differentPeerfrom the GENERAL sync peer, so under multistream it never fired at all. Result: the 3-minmaxLastBlockTimecheck rotated a healthy, just-delivered sync peer mid-validation (updateSyncPeer - disconnect old sync peer), churning to a slower peer and re-downloading headers per fat block. Non-fatal (the block was already downloaded, so it still got accepted), but wasteful.Fix:
syncPeerStateFormatches the current sync peer or any stream of its association, so a block delivered on DATA1 counts as the sync peer's.HandleBlockDirectentry) as well as at accept, so the minutes-long validation isn't mistaken for a stall.Residual: a block whose validation alone exceeds
maxLastBlockTime(3 min) could still trip a mid-validation rotation (804,157 validated in 2m17s, under the bar); left as a follow-up if observed on larger blocks.Default flip
legacy_allowBlockPrioritynow defaults totrue(settings.conf, code default, struct tag) — required for reliable large-block download. The struct-tag description was also corrected (it wrongly described "block priority hints" rather than the multistream stream policy). Fallback for non-BlockPriority peers is covered byTestMultistreamBackwardCompatibility/TestMultistreamOnlyStandardPeer/TestMultistreamDisabledRejectsConnection.This is the highest-blast-radius change here (it alters P2P negotiation for every node on upgrade) and is kept independently revertible from the bug-fix logic — set
legacy_allowBlockPriority = falseto back it out without reverting the stall fixes. Worth release-note visibility.Verification
hasHealthyDownloadThroughputmatrix, CheckSyncPeer keeps an actively-downloading peer,activityConnbyte counting,Association.ReadBytessum.-raceonservices/legacy/peer,services/legacy/netsync,settings;go vet,gofmt,go build ./...clean.TestMultistreamLegacySync+ siblings) passes; no stall-timeout regressions.