fix(blockassembly): split catch-up from reorg in processNewBlockAnnouncement#903
Conversation
…ncement Prior to this change, any gap ≥ 2 between the assembler's current block and the blockchain tip was routed through handleReorg. If moveBack was zero (pure catch-up after downtime), handleReorg would still evaluate the large-reorg reset condition. Because moveForward could be ≥ CoinbaseMaturity after a long downtime, this incorrectly triggered a full reset of block assembly state. Fix: call getReorgBlocks once at the top of the gap ≥ 2 case, then branch on moveBack length: - moveBack == 0 → handleCatchUp: walk MoveForwardBlock one block at a time, stop on first error, never touch reset/Reorg paths. - moveBack > 0 → handleReorg: original signature unchanged; handles the existing large-reorg reset and small-reorg Reorg() paths unchanged. Adds TestProcessNewBlockAnnouncement_CatchupVsReorg with 7 subtests covering gap=0, gap=1, gap=2 catch-up, 150-block catch-up without reset, 1-block reorg, large-reorg reset guard, and mid-catch-up error propagation.
|
🤖 Claude Code Review Status: Complete SummaryThis PR correctly addresses issue #898 by splitting catch-up (moveBack=0) from genuine reorg paths. The fix is sound and well-tested. Core Fix Verified:
Previous Inline Comments Status:
Test Coverage: No issues found. The implementation matches the fix description, the tests are thorough, and the logic correctly prevents the CoinbaseMaturity reset bug during forward-only catch-ups. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-20 09:06 UTC |
f5b5637 to
2c523d4
Compare
2c523d4 to
a4cd5ef
Compare
a4cd5ef to
54b58e2
Compare
…ly-catchup-vs-reorg
|
ordishs
left a comment
There was a problem hiding this comment.
LGTM. Targets a real, reproduced production bug; the regression test (large catch-up moveBack=0 moveForward=150 no reset) directly pins the bypass; and the implementation reuses the well-understood reorgBlocks fast path rather than introducing a new one.
Verified the key claims:
SubtreeProcessor.go:2794fast path: iteratesmoveForwardBlockper block with rollback at lines 2820-2828 on mid-loop failure. ✓- Non-nil empty slice required:
nilis rejected at line 2778, fast path triggers atlen == 0on line 2794. ✓ - Error path:
processNewBlockAnnouncementreturns beforesetBestBlockHeader, so BA stays aligned with the subtree processor's rolled-back state. ✓
Suggestions (non-blocking)
-
Duplicate
getReorgBlockson the reorg branch. The new code callsgetReorgBlocksat the top of the gap-≥-2 case, thenhandleReorgcalls it again atBlockAssembler.go:1303. For an N-block catch-up that turns out to be a reorg, that's ~N extraGetBlockcalls. The PR description acknowledges this as "zero blast radius on the reorg implementation," which is defensible — but a cheap follow-up would be to thread the already-fetched slices intohandleReorg. -
handleCatchUp(_ context.Context, ...). The ctx parameter is unused (Reorg uses its own channel-based lifecycle). Either drop it or thread it through for symmetry withhandleReorg. Minor. -
Per-block notification semantics during multi-block catch-up. The fast path passes
skipNotificationsAndDequeue = truefor every block except the last, so downstream consumers only see notifications from the final catch-up block. Matches the existingreorgBlocksfast-path contract and is a clear improvement over the broken Reset path — but it's a real semantic change versus a hypothetical "loopMoveForwardBlockdirectly" approach. Worth a line in the PR description acknowledging this. -
Test resource cleanup.
injectMockStpswapssubtreeProcessoraftersetupBlockAssemblyTesthas already started the real one (line 741). The cleanup hook then callsStop()on the mock, so the real processor's goroutines survive to process exit. Not a correctness issue, but consider stopping the real processor before swapping. -
Cosmetic: the
moveForward → []*model.Blockconversion inhandleCatchUpmirrors lines 1329-1341 ofhandleReorg. Could be a helper. Not blocking.
Follow-up worth doing
Confirm in mainnet legacy sync (the bsva-ovh-teranode-ttn-eu-4 scenario from #898) that the catch-up path actually runs and the 10 GB heap profile is gone. The unit test exercises the routing; production validates the memory shape.



Closes #898.
Summary
processNewBlockAnnouncementpreviously routed any gap ≥ 2 between the assembler's current block and the blockchain tip throughhandleReorg. WhenmoveBack == 0(pure catch-up after downtime),handleReorgwould still evaluate the large-reorg reset condition; withmoveForward ≥ CoinbaseMaturityafter a long downtime, this incorrectly triggered a full reset of block assembly state.Fix
Pre-fetch reorg blocks at the top of the gap ≥ 2 case, then branch on
moveBacklength:moveBack == 0→ newhandleCatchUphelper: delegates tosubtreeProcessor.Reorg([]*model.Block{}, blocks).reorgBlocksalready has a fast path forlen(moveBackBlocks) == 0 && len(moveForwardBlocks) > 0(SubtreeProcessor.go:2794) that iteratesmoveForwardBlockper block withskipNotification = trueandskipDequeue = truefor every block except the last, and rolls back the subtree processor to pre-catchup state on any mid-loop failure (lines 2820-2828). On error,processNewBlockAnnouncementreturns early soBlockAssembler.setBestBlockHeaderis not called either; BA and the subtree processor both stay at the pre-catchup tip, aligned. Note: pass an empty (non-nil)[]*model.Block{}for moveBack —reorgBlocksrejects nil at line 2778 but takes the fast path atlen == 0.moveBack > 0→handleReorg(ctx, header, height)unchanged. The function body is byte-identical to upstream; the only cost on the reorg path is one additionalgetReorgBlockscall, accepted as the price of zero blast radius on the reorg implementation.New
prometheusBlockAssemblerCatchUpcounter mirrors the existing reorg counter for ops visibility.Test plan
TestProcessNewBlockAnnouncement_CatchupVsReorgwith 7 subtests:gap=0 no-op— same hash returns early without state change.gap=1 moving up— existing default branch, singleMoveForwardBlock.gap=2 catch-up moveBack=0— new catch-up path; stubsReorg, asserts called withlen(moveBack) == 0and the expectedmoveForwardslice.large catch-up moveBack=0 moveForward=150 no reset— pins the regression the fix addresses; 150-block forward gap routes throughReorgfast path, never throughhandleReorg's reset gate.real reorg moveBack=1— existing reorg path unchanged.large reorg triggers reset— existinghandleReorgreset gate still fires for genuine reorgs at depth ≥ CoinbaseMaturity.catch-up mid-error stops— stubsReorgto return an error; assertsCurrentBlock()returns the pre-catchup tip (becauseprocessNewBlockAnnouncementreturns early on error, leaving BA'ssetBestBlockHeaderunreached, andreorgBlocksrolls back the subtree processor's state in the same scenario).make lint-newclean,go test -race ./services/blockassembly/...passes.