fix(blockvalidation): treat transient catchup errors as incomplete, not invalid (#1031)#1036
Conversation
…s transient, not invalid ErrTxMissingParent/ErrTxNotFound from validateBlockSubtrees are catchup-state conditions, not consensus violations. Persisting them as invalid=true permanently stalls sync. Return BlockIncomplete so catchup retries instead. Refs bsv-blockchain#1031.
…invalid getParentTxMetaBlockIDs converted a not-yet-present parent tx into BlockInvalid, which the block-validation consumer then persisted. During catchup the parent tx simply hasn't been absorbed yet. Return BlockIncomplete so callers can retry. Update the two TestCheckParentExistsOnChain assertions that pinned the old BlockInvalid classification. Refs bsv-blockchain#1031.
…k.Valid failures block.Valid now surfaces BlockIncomplete for a not-yet-present parent tx. The ValidateBlock consumer no longer stores such blocks as invalid, and the gRPC ValidateBlock handler returns BlockIncomplete instead of BlockInvalid. Refs bsv-blockchain#1031.
…ent catchup errors The catchup classifier counted ErrTxMissingParent/ErrTxNotFound as unvalidatable-block peer faults, reporting honest peers as malicious and abandoning retry. Extract the classification into isUnvalidatablePeerError and limit it to genuine consensus failures (BlockInvalid/TxInvalid). Refs bsv-blockchain#1031.
…sv-blockchain#1031 site D) The parent-cascade stays unchanged: with transient errors no longer persisting invalid=true, an invalid parent is invalid for a real consensus reason and the cascade is correct. Add a comment and a regression guard so it is not 'fixed' later.
…r transient missing parent Refs bsv-blockchain#1031.
|
🤖 Claude Code Review Status: Complete Current ReviewNo critical issues found. The fix correctly addresses the transient catchup-state problem described in #1031. Logic verification:
Test coverage:
Safety direction:
Documentation:
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-04 16:08 UTC |
|
ordishs
left a comment
There was a problem hiding this comment.
Approve. Correct fix in the right direction — transient catchup-state errors (missing parent / tx-not-found) no longer persist invalid=true and instead surface ErrBlockIncomplete, which catchup already retries against alternative peers. Safety direction is right: under-marking soft-stalls and retries, never accepts a bad block.
Verified:
- errgroup propagation preserves
errors.Is(err, ErrBlockIncomplete)up to block.Valid's caller - incomplete check is placed before storeInvalidBlock, after the existing infra filter
- both other block.Valid consumers (optimistic background, reValidateBlock) gate invalidation on ErrBlockInvalid, so they retry rather than poison
- catchup reuses the established 'abort peer / retry alternative / not malicious' path; isUnvalidatablePeerError cleanly narrows malicious flagging
- build clean; model tests pass
Test coverage is thorough (classifier table, sync + gRPC paths end-to-end, isUnvalidatablePeerError truth table, parent-cascade regression guard), and the checkParentInvalid doc-comment correctly pins the non-obvious cascade invariant.
Noted follow-ups (appropriately deferred): the gRPC ValidateBlock storage/service/processing-filter asymmetry, operator recovery for already-poisoned DBs, and the legacy/netsync blockExists re-validation gap.



Fixes #1031.
Problem
ValidateBlocktreatedErrTxMissingParentandErrTxNotFoundfromvalidateBlockSubtreesas consensus violations and persisted the block asinvalid=true. Both are catchup-state conditions — the parent tx isn't in our store yet because we haven't absorbed its block. Once persisted invalid, the block can never be re-validated; sync stalls permanently at that height across restarts, recoverable only by manualUPDATE blocks SET invalid=falsein Postgres. Hit on ttn-eu-4: one bad catchup window stamped 20 consecutive blocks invalid and wedged sync.Fix
Principle: only genuine consensus failures persist
invalid=true. Missing-data conditions returnErrBlockIncomplete, which catchup already retries against another peer. Safety direction holds — under-marking soft-stalls and retries (never accepts a bad block); over-marking permanently rejects good blocks (the bug).Sites changed:
validateBlockSubtreesclassifier (BlockValidation.go) —ErrTxInvalidstill persists invalid;ErrTxMissingParent/ErrTxNotFoundnow returnErrBlockIncomplete, not persisted.model.getParentTxMetaBlockIDs— a not-yet-present parent tx (and empty-BlockIDs) returnsErrBlockIncompleteinstead ofBlockInvalid. This also fixes the two untouchedblock.Validconsumers (optimistic-mining background path andreValidateBlock), which gate invalidation onErrBlockInvalid.block.Validconsumers — the synchronousValidateBlockhandler no longerstoreInvalidBlocks a transient; the gRPCValidateBlockhandler returnsBlockIncompleteinstead ofBlockInvalid.Server.go) — extractedisUnvalidatablePeerError(=BlockInvalid || TxInvalid); transient errors no longer flag honest peers malicious or abandon peer-retry.checkParentInvalid— cascade left unchanged, now documented: with transient errors no longer persisting invalid, an invalid parent is invalid for a real consensus reason, so cascading is correct. (BlockHeaderMetahas no reason field to classify on anyway.)Test plan
go test -race -tags testtxmetacache ./services/blockvalidation/ ./model/— greengo build ./...,go vet— cleanErrTxInvalid→ persisted); model missing-parent → incomplete; synchronous + gRPCblock.Validtransient → not persisted/returned invalid;isUnvalidatablePeerErrortable; parent-cascade regression guard.Notes
BlockIncomplete(errgroup first-error) and is retried. Fails safe — never accepted; re-validation still catches the invalid tx once data resolves, or soft-stalls otherwise.BlockIncompletemessages render%!s(MISSING)from a trailing: %swith no arg — matches the file's existingNewBlockInvalidErrorconvention; the wrapped error chain still carries the real cause. Left for consistency.Out of scope (follow-ups)
legacy/netsync/handle_block.goblockExistsearly-return: doesn't re-validate an existing-but-invalid block (compounds the stall).BlockHeaderMetainvalid-reason field that would enable richer parent-cascade classification.ValidateBlockinfra-filter asymmetry (no storage/service/processing filter on that path).