fix(blockvalidation): don't short-circuit catchup on forked peer's zero headers#718
Conversation
…ro headers When the filterHeaders step leaves zero new headers to process, the catchup function was returning nil. The outer peer-selection loop treats nil as success and stops trying other peers — so a peer on a dead/shorter fork could silently block the node from reaching blocks that honest peers have. Observed on teratestnet: node stuck at height 15630 while the network was at 15637. Every new block announcement arrived on Kafka, was sent to catchup, fell through to a peer whose tip was on a 46-block-old fork, and that peer returned zero new headers after common-ancestor filtering. Catchup logged "already synced" and returned nil, leaving the peer-selection loop happy and the node frozen. Fix: before declaring success when no new headers are left, verify blockUpTo is actually in our chain. If it is, legitimate "already synced". If it is not, return an error so the peer-selection loop tries the next peer. The check is extracted into handleNoNewHeaders so it can be unit tested directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Claude Code Review Status: Complete Current Review: This PR fixes a peer-selection bug where a forked peer returning zero new headers would cause catchup to succeed without verifying the target block exists locally, preventing the node from trying other peers. ✅ All issues resolved:
Analysis Summary: The fix is well-reasoned: when Design note: Using The tests cover all four critical paths: target exists locally (success), target not known (error), existence check fails (error), and verification that the correct hash is being checked. No further issues found. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-14 08:48 UTC |
There was a problem hiding this comment.
Pull request overview
Fixes catchup peer-selection behavior so a forked/short peer that yields zero headers after filtering no longer short-circuits catchup as “success” unless the target block is already known locally.
Changes:
- Refactors the “zero headers after filterHeaders” early-exit into
handleNoNewHeaders. - Adds an existence check for
blockUpTobefore returning success; returns an error when the target is not known. - Adds unit tests covering the new helper behavior and arguments passed to
GetBlockExists.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
services/blockvalidation/catchup.go |
Routes the zero-headers early-exit through handleNoNewHeaders and adds the target-existence verification. |
services/blockvalidation/catchup_forked_peer_test.go |
Adds test coverage for the new handleNoNewHeaders behavior and hash used in the existence check. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (u *Server) handleNoNewHeaders(ctx context.Context, blockUpTo *model.Block) error { | ||
| exists, err := u.blockchainClient.GetBlockExists(ctx, blockUpTo.Hash()) | ||
| if err != nil { | ||
| return errors.NewProcessingError("[catchup][%s] peer returned no new headers and block existence check failed: %v", blockUpTo.Hash().String(), err) |
Copilot
AI
Apr 22, 2026
•
There was a problem hiding this comment.
errors.NewProcessingError treats a trailing error argument as the wrapped error and removes it from the format params. Because the format string still includes : %v, this will produce %!v(MISSING) in the message and lose the intended formatting. Prefer omitting the : %v placeholder and pass err as the wrapped error (or include err.Error() as a non-error param if you need it in the formatted message).
| return errors.NewProcessingError("[catchup][%s] peer returned no new headers and block existence check failed: %v", blockUpTo.Hash().String(), err) | |
| return errors.NewProcessingError("[catchup][%s] peer returned no new headers and block existence check failed", blockUpTo.Hash().String(), err) |
✓ Fixed - The : %v placeholder has been removed and the error is correctly passed as a wrapped error.
There was a problem hiding this comment.
Good catch — errors.NewProcessingError/errors.New strip the trailing error before calling fmt.Errorf, so : %v would have rendered as %!v(MISSING). Removed the : %v placeholder and kept err as the wrapped error (it still appears in the wrapped chain via Error()).
| // A peer on a dead or shorter fork can legitimately return zero new headers even when | ||
| // blockUpTo is unknown to us. Returning nil unconditionally would make the outer | ||
| // peer-selection loop treat this as success and stop trying other peers, leaving the | ||
| // node unable to sync past the announced block. Only treat it as "already synced" | ||
| // when blockUpTo is actually in our chain; otherwise surface an error so the caller |
Copilot
AI
Apr 22, 2026
•
There was a problem hiding this comment.
The comment says to verify blockUpTo is "in our chain", but GetBlockExists only checks for presence in the blocks table (see stores/blockchain/sql/GetBlockExists.go), which can be true even for non-main-chain blocks. Either tighten the check to main-chain membership (e.g., fetch the block meta/ID then call CheckBlockIsInCurrentChain) or adjust the wording/error messages to reflect that you're only checking local existence.
✓ Fixed - The wording has been updated to "known locally" throughout (doc comment, log message, error text) and includes a note explaining why local existence is sufficient for this use case.
There was a problem hiding this comment.
You're right — GetBlockExists is a blocks-table presence check, not main-chain membership. I went with adjusting the wording rather than tightening the check, because for this specific path (decide whether to keep trying other peers), local presence is actually what we want: if we already have the announced block in any form, including as a stale off-chain block, we have already advanced to or past it, so there is nothing to catch up to on another peer. I updated the doc comment, log message, and error text to say "known locally" instead of "in our chain" and added a note explaining why local existence is sufficient.
| // TestHandleNoNewHeaders covers the peer-selection short-circuit fix. | ||
| // When a peer returns zero new headers after common-ancestor filtering, the | ||
| // outer peer-selection loop uses the return value of catchup() to decide | ||
| // whether to try other peers. A peer on a dead fork can legitimately return | ||
| // zero new headers while blockUpTo is still unknown to us — in that case we | ||
| // must surface an error so the caller tries the next peer, not pretend we | ||
| // are already synced. |
Copilot
AI
Apr 22, 2026
•
There was a problem hiding this comment.
This test (and the surrounding comment) talks about the target being "in our chain", but the mocked call is GetBlockExists, which is an existence check and not necessarily a main-chain membership check. Consider renaming the test/subtests and adjusting the wording to "target exists locally" unless the production code is updated to check current-chain membership.
✓ Fixed - The subtests have been renamed (TargetExistsLocallyReturnsNil, TargetNotKnownLocallyReturnsError) and test comments updated to match the production semantics.
There was a problem hiding this comment.
Agreed — renamed the subtests (TargetExistsLocallyReturnsNil, TargetNotKnownLocallyReturnsError) and updated the surrounding test comments to match the production semantics (local existence, not main-chain membership).
- Remove stray `: %v` from NewProcessingError format string; the errors package strips trailing errors from params before fmt.Errorf, so `: %v` was rendering `%!v(MISSING)`. Keep err as the wrapped error. - Clarify that handleNoNewHeaders uses a local-existence check, not main- chain membership. Update doc comment, log message, error text, and test subtests/comments to say "known locally" instead of "in our chain". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
oskarszoon
left a comment
There was a problem hiding this comment.
Approve. The fix is the right shape and the prod trigger (teratestnet stuck at 15630 while honest peers were at 15637) is exactly the kind of silent freeze that the old return nil was hiding. Returning nil only when the target is locally known (main chain OR off-chain) and ProcessingError otherwise so the peer-selection loop tries the next candidate is the safe choice. Tests are tight — UsesTargetBlockHash in particular guards the right hash from being silently swapped in a refactor.
Two confirmations before merge:
testCI job is failing on the latest run while everything else is green. Same flake pattern seen on a couple of unrelated PRs this week — worth a re-run to confirm flake vs real (the 4 new tests do exercise new code paths).- Test plan's last item is unchecked: "Deploy to teratestnet and verify it advances past 15630". Has that validation been done? Given the PR claims to fix that exact incident, worth confirming the fix actually unblocks the stuck environment before merge.



Summary
When the
filterHeadersstep leaves zero new headers to process,catchup()was returningnil. The outer peer-selection loop treatsnilas success and stops trying other peers — so a peer on a dead or shorter fork can silently block the node from reaching blocks honest peers hold.Fix: before declaring success when no new headers are left, verify
blockUpTois actually in our chain. If it is, legitimate "already synced". If it is not, return an error so the peer-selection loop tries the next peer.Observed in the wild
On teratestnet, the node got stuck at height 15630 while the network advanced to 15637. Every new block announcement arrived on Kafka, was sent to catchup, fell through to a peer whose tip was on a 46-block-old fork, and that peer returned zero new headers after common-ancestor filtering:
Catchup logged "already synced" and returned nil; the peer-selection loop saw success and never tried icellan (the peer that actually announced the block).
What changed
services/blockvalidation/catchup.go: extract the zero-headers early-exit intohandleNoNewHeaders. CallGetBlockExists(ctx, blockUpTo.Hash())there. Return an error if the target is not in our chain.services/blockvalidation/catchup_forked_peer_test.go: 4 subtests covering target-in-chain, target-not-in-chain, existence-check error, and that the correct hash is being checked.Test plan
go test -race -count=1 ./services/blockvalidation/)go vet ./services/blockvalidation/...clean🤖 Generated with Claude Code