Skip to content

fix(blockvalidation): treat transient catchup errors as incomplete, not invalid (#1031)#1036

Merged
oskarszoon merged 7 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/invalid-block
Jun 5, 2026
Merged

fix(blockvalidation): treat transient catchup errors as incomplete, not invalid (#1031)#1036
oskarszoon merged 7 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/invalid-block

Conversation

@oskarszoon

Copy link
Copy Markdown
Contributor

Fixes #1031.

Problem

ValidateBlock treated ErrTxMissingParent and ErrTxNotFound from validateBlockSubtrees as consensus violations and persisted the block as invalid=true. Both are catchup-state conditions — the parent tx isn't in our store yet because we haven't absorbed its block. Once persisted invalid, the block can never be re-validated; sync stalls permanently at that height across restarts, recoverable only by manual UPDATE blocks SET invalid=false in Postgres. Hit on ttn-eu-4: one bad catchup window stamped 20 consecutive blocks invalid and wedged sync.

Fix

Principle: only genuine consensus failures persist invalid=true. Missing-data conditions return ErrBlockIncomplete, which catchup already retries against another peer. Safety direction holds — under-marking soft-stalls and retries (never accepts a bad block); over-marking permanently rejects good blocks (the bug).

Sites changed:

  • validateBlockSubtrees classifier (BlockValidation.go) — ErrTxInvalid still persists invalid; ErrTxMissingParent/ErrTxNotFound now return ErrBlockIncomplete, not persisted.
  • model.getParentTxMetaBlockIDs — a not-yet-present parent tx (and empty-BlockIDs) returns ErrBlockIncomplete instead of BlockInvalid. This also fixes the two untouched block.Valid consumers (optimistic-mining background path and reValidateBlock), which gate invalidation on ErrBlockInvalid.
  • block.Valid consumers — the synchronous ValidateBlock handler no longer storeInvalidBlocks a transient; the gRPC ValidateBlock handler returns BlockIncomplete instead of BlockInvalid.
  • Catchup classifier (Server.go) — extracted isUnvalidatablePeerError (= BlockInvalid || TxInvalid); transient errors no longer flag honest peers malicious or abandon peer-retry.
  • checkParentInvalid — cascade left unchanged, now documented: with transient errors no longer persisting invalid, an invalid parent is invalid for a real consensus reason, so cascading is correct. (BlockHeaderMeta has no reason field to classify on anyway.)

Test plan

  • go test -race -tags testtxmetacache ./services/blockvalidation/ ./model/ — green
  • go build ./..., go vet — clean
  • New tests: subtree-error classification (transient → incomplete + not persisted; ErrTxInvalid → persisted); model missing-parent → incomplete; synchronous + gRPC block.Valid transient → not persisted/returned invalid; isUnvalidatablePeerError table; parent-cascade regression guard.

Notes

  • Deliberate tradeoff: a block that is both genuinely-invalid and incomplete surfaces BlockIncomplete (errgroup first-error) and is retried. Fails safe — never accepted; re-validation still catches the invalid tx once data resolves, or soft-stalls otherwise.
  • Cosmetic: the new BlockIncomplete messages render %!s(MISSING) from a trailing : %s with no arg — matches the file's existing NewBlockInvalidError convention; the wrapped error chain still carries the real cause. Left for consistency.

Out of scope (follow-ups)

  • legacy/netsync/handle_block.go blockExists early-return: doesn't re-validate an existing-but-invalid block (compounds the stall).
  • Operator-facing recovery for already-poisoned DBs (the SQL stays a manual runbook).
  • BlockHeaderMeta invalid-reason field that would enable richer parent-cascade classification.
  • Pre-existing gRPC ValidateBlock infra-filter asymmetry (no storage/service/processing filter on that path).

…s transient, not invalid

ErrTxMissingParent/ErrTxNotFound from validateBlockSubtrees are catchup-state
conditions, not consensus violations. Persisting them as invalid=true permanently
stalls sync. Return BlockIncomplete so catchup retries instead. Refs bsv-blockchain#1031.
…invalid

getParentTxMetaBlockIDs converted a not-yet-present parent tx into BlockInvalid,
which the block-validation consumer then persisted. During catchup the parent tx
simply hasn't been absorbed yet. Return BlockIncomplete so callers can retry.
Update the two TestCheckParentExistsOnChain assertions that pinned the old
BlockInvalid classification. Refs bsv-blockchain#1031.
…k.Valid failures

block.Valid now surfaces BlockIncomplete for a not-yet-present parent tx. The
ValidateBlock consumer no longer stores such blocks as invalid, and the gRPC
ValidateBlock handler returns BlockIncomplete instead of BlockInvalid. Refs bsv-blockchain#1031.
…ent catchup errors

The catchup classifier counted ErrTxMissingParent/ErrTxNotFound as unvalidatable-block
peer faults, reporting honest peers as malicious and abandoning retry. Extract the
classification into isUnvalidatablePeerError and limit it to genuine consensus
failures (BlockInvalid/TxInvalid). Refs bsv-blockchain#1031.
…sv-blockchain#1031 site D)

The parent-cascade stays unchanged: with transient errors no longer persisting
invalid=true, an invalid parent is invalid for a real consensus reason and the
cascade is correct. Add a comment and a regression guard so it is not 'fixed' later.
@oskarszoon oskarszoon requested review from freemans13, icellan and ordishs and removed request for freemans13 June 4, 2026 14:11
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Current Review

No critical issues found. The fix correctly addresses the transient catchup-state problem described in #1031.

Logic verification:

  • ✅ Error classification is sound: consensus violations (ErrTxInvalid) persist invalid; transient states (ErrTxMissingParent/ErrTxNotFound) return ErrBlockIncomplete for retry
  • ✅ All four persistence sites correctly handle ErrBlockIncomplete: validateBlockSubtrees (BlockValidation.go:1421), block.Valid consumer (BlockValidation.go:1637), gRPC ValidateBlock (Server.go:1335), and model.getParentTxMetaBlockIDs (Block.go:1129, 1140)
  • ✅ Catchup logic (Server.go:651) uses isUnvalidatablePeerError to avoid marking honest peers malicious for transient errors
  • ✅ Parent-invalid cascade intentionally unchanged (checkParentInvalid) with clear documentation justifying the design

Test coverage:

  • ✅ 193 new lines in BlockValidation_test.go cover subtree error classification, synchronous block.Valid path, and parent-cascade regression
  • ✅ 156 new lines in Server_test.go cover gRPC ValidateBlock path and isUnvalidatablePeerError table
  • ✅ 20 new lines in Block_test.go verify model.getParentTxMetaBlockIDs returns incomplete for missing parents

Safety direction:

  • ✅ Under-marks instead of over-marks: soft-stalls and retries rather than permanently rejecting valid blocks
  • ✅ No risk of accepting invalid blocks: transient errors prevent acceptance until data resolves, then full validation still runs

Documentation:

  • Clear inline comments explain the transient vs. consensus distinction at all decision points
  • PR description explicitly documents the deliberate tradeoff for simultaneous incomplete+invalid blocks

@oskarszoon oskarszoon requested a review from galt-tr June 4, 2026 14:22
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1036 (138e7bd)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 148
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.666µ 1.603µ ~ 0.100
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.03n 71.21n ~ 0.100
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.08n 71.51n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 70.95n 70.93n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 33.62n 33.27n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 56.70n 56.56n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 134.3n 130.9n ~ 1.000
MiningCandidate_Stringify_Short-4 217.6n 221.9n ~ 0.100
MiningCandidate_Stringify_Long-4 1.650µ 1.676µ ~ 0.100
MiningSolution_Stringify-4 849.4n 860.1n ~ 0.100
BlockInfo_MarshalJSON-4 1.739µ 1.716µ ~ 0.200
NewFromBytes-4 156.8n 130.8n ~ 0.700
AddTxBatchColumnar_Validation-4 2.446µ 2.436µ ~ 0.700
OffsetValidationLoop-4 642.7n 638.3n ~ 1.000
Mine_EasyDifficulty-4 61.22µ 61.24µ ~ 1.000
Mine_WithAddress-4 7.326µ 7.001µ ~ 0.200
DirectSubtreeAdd/4_per_subtree-4 59.88n 61.78n ~ 0.700
DirectSubtreeAdd/64_per_subtree-4 32.04n 32.37n ~ 0.100
DirectSubtreeAdd/256_per_subtree-4 31.30n 31.64n ~ 0.400
DirectSubtreeAdd/1024_per_subtree-4 29.69n 29.94n ~ 0.400
DirectSubtreeAdd/2048_per_subtree-4 29.26n 29.36n ~ 0.700
SubtreeProcessorAdd/4_per_subtree-4 305.1n 297.4n ~ 0.200
SubtreeProcessorAdd/64_per_subtree-4 324.3n 296.2n ~ 0.100
SubtreeProcessorAdd/256_per_subtree-4 353.0n 293.5n ~ 0.100
SubtreeProcessorAdd/1024_per_subtree-4 344.5n 286.7n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 315.1n 285.8n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 334.3n 283.5n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 328.4n 290.0n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 326.8n 294.4n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 352.5n 289.9n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 56.69n 55.72n ~ 0.100
SubtreeNodeAddOnly/64_per_subtree-4 34.85n 34.51n ~ 0.400
SubtreeNodeAddOnly/256_per_subtree-4 33.94n 33.94n ~ 0.800
SubtreeNodeAddOnly/1024_per_subtree-4 33.23n 32.99n ~ 0.100
SubtreeCreationOnly/4_per_subtree-4 120.7n 117.6n ~ 0.100
SubtreeCreationOnly/64_per_subtree-4 437.9n 429.3n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.485µ 1.426µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.752µ 4.576µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 9.308µ 8.549µ ~ 0.200
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 392.3n 300.0n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 405.6n 284.0n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.378m 2.287m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 6.037m 5.879m ~ 0.400
ParallelGetAndSetIfNotExists/50k_nodes-4 8.739m 7.913m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 12.72m 11.57m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 2.190m 2.015m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 7.591m 5.663m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 20.35m 15.91m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 37.54m 26.69m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.403m 2.337m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.902m 8.590m ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 15.52m 14.63m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 2.110m 2.041m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 10.310m 9.093m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 72.90m 46.49m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.495µ 3.544µ ~ 0.400
DiskTxMap_SetIfNotExists_Parallel-4 3.292µ 3.318µ ~ 0.800
DiskTxMap_ExistenceOnly-4 313.0n 315.5n ~ 0.700
Queue-4 189.6n 188.3n ~ 0.100
AtomicPointer-4 4.352n 4.801n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 909.5µ 888.5µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 828.1µ 808.6µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 114.6µ 104.6µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 62.33µ 62.15µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/10K-4 59.84µ 56.88µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/10K-4 11.33µ 10.97µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 5.388µ 5.160µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 2.037µ 1.936µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.894m 10.277m ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.11m 10.21m ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.182m 1.213m ~ 0.100
ReorgOptimizations/AllMarkFalse/New/100K-4 681.0µ 679.3µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/100K-4 658.0µ 575.6µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 318.2µ 295.9µ ~ 0.400
ReorgOptimizations/NodeFlags/Old/100K-4 49.07µ 49.24µ ~ 0.700
ReorgOptimizations/NodeFlags/New/100K-4 17.28µ 16.87µ ~ 0.400
TxMapSetIfNotExists-4 52.22n 52.51n ~ 0.500
TxMapSetIfNotExistsDuplicate-4 40.31n 39.83n ~ 0.100
ChannelSendReceive-4 604.6n 622.8n ~ 0.100
BlockAssembler_AddTx-4 0.02599n 0.02951n ~ 1.000
AddNode-4 10.92 10.68 ~ 0.400
AddNodeWithMap-4 11.92 11.23 ~ 0.400
CalcBlockWork-4 519.8n 523.0n ~ 0.700
CalculateWork-4 713.0n 713.0n ~ 1.000
CheckOldBlockIDs/on-chain-prefetch/1000-4 56.84µ 54.76µ ~ 0.100
CheckOldBlockIDs/off-chain-prefetch/1000-4 46.73µ 56.05µ ~ 0.200
CheckOldBlockIDs/on-chain-prefetch/10000-4 432.3µ 420.5µ ~ 0.100
CheckOldBlockIDs/off-chain-prefetch/10000-4 339.5µ 338.3µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_10-4 1.409µ 1.355µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 13.59µ 13.05µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 132.6µ 129.3µ ~ 0.100
CatchupWithHeaderCache-4 104.5m 104.6m ~ 1.000
_BufferPoolAllocation/16KB-4 5.680µ 4.062µ ~ 0.200
_BufferPoolAllocation/32KB-4 9.277µ 9.619µ ~ 0.200
_BufferPoolAllocation/64KB-4 18.57µ 23.09µ ~ 0.400
_BufferPoolAllocation/128KB-4 38.46µ 38.38µ ~ 1.000
_BufferPoolAllocation/512KB-4 126.1µ 129.9µ ~ 0.100
_BufferPoolConcurrent/32KB-4 20.67µ 22.89µ ~ 0.100
_BufferPoolConcurrent/64KB-4 35.13µ 36.08µ ~ 0.400
_BufferPoolConcurrent/512KB-4 168.7µ 165.9µ ~ 0.200
_SubtreeDeserializationWithBufferSizes/16KB-4 641.0µ 685.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 652.2µ 670.6µ ~ 0.200
_SubtreeDeserializationWithBufferSizes/64KB-4 645.0µ 674.6µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 643.8µ 674.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 642.3µ 681.5µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.86m 37.61m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4 37.04m 37.82m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4 37.29m 37.86m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4 37.34m 38.09m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 37.36m 37.61m ~ 0.400
_PooledVsNonPooled/Pooled-4 746.3n 744.1n ~ 0.400
_PooledVsNonPooled/NonPooled-4 9.438µ 9.262µ ~ 0.200
_MemoryFootprint/Current_512KB_32concurrent-4 8.017µ 7.201µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 11.01µ 10.07µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 11.370µ 9.567µ ~ 0.100
_prepareTxsPerLevel-4 403.4m 401.7m ~ 1.000
_prepareTxsPerLevelOrdered-4 3.589m 3.461m ~ 0.100
_prepareTxsPerLevel_Comparison/Original-4 404.9m 394.6m ~ 0.200
_prepareTxsPerLevel_Comparison/Optimized-4 3.623m 3.480m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.470m 1.454m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 334.9µ 330.2µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 80.95µ 80.67µ ~ 0.700
SubtreeSizes/10k_tx_256_per_subtree-4 20.25µ 20.24µ ~ 1.000
SubtreeSizes/10k_tx_512_per_subtree-4 10.15µ 10.02µ ~ 0.400
SubtreeSizes/10k_tx_1024_per_subtree-4 4.972µ 5.002µ ~ 0.300
SubtreeSizes/10k_tx_2k_per_subtree-4 2.473µ 2.480µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 78.67µ 78.27µ ~ 0.700
BlockSizeScaling/10k_tx_256_per_subtree-4 20.17µ 20.07µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.987µ 4.932µ ~ 0.200
BlockSizeScaling/50k_tx_64_per_subtree-4 403.2µ 404.6µ ~ 1.000
BlockSizeScaling/50k_tx_256_per_subtree-4 98.53µ 98.27µ ~ 0.200
BlockSizeScaling/50k_tx_1024_per_subtree-4 24.69µ 24.98µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 161.1µ 162.2µ ~ 0.400
SubtreeAllocations/small_subtrees_data_fetch-4 171.2µ 173.6µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 326.5µ 330.6µ ~ 0.200
SubtreeAllocations/medium_subtrees_exists_check-4 9.729µ 9.783µ ~ 0.700
SubtreeAllocations/medium_subtrees_data_fetch-4 10.79µ 10.86µ ~ 0.100
SubtreeAllocations/medium_subtrees_full_validation-4 20.31µ 20.67µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.410µ 2.424µ ~ 0.700
SubtreeAllocations/large_subtrees_data_fetch-4 2.671µ 2.726µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 5.145µ 5.141µ ~ 1.000
StoreBlock_Sequential/BelowCSVHeight-4 326.5µ 327.9µ ~ 0.700
StoreBlock_Sequential/AboveCSVHeight-4 329.5µ 327.6µ ~ 0.400
GetUtxoHashes-4 273.2n 272.0n ~ 1.000
GetUtxoHashes_ManyOutputs-4 43.09µ 43.59µ ~ 0.200
_NewMetaDataFromBytes-4 226.7n 227.0n ~ 1.000
_Bytes-4 396.2n 397.3n ~ 0.400
_MetaBytes-4 136.0n 136.5n ~ 0.500

Threshold: >10% with p < 0.05 | Generated: 2026-06-04 16:08 UTC

@sonarqubecloud

sonarqubecloud Bot commented Jun 4, 2026

Copy link
Copy Markdown

@oskarszoon oskarszoon self-assigned this Jun 4, 2026

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Correct fix in the right direction — transient catchup-state errors (missing parent / tx-not-found) no longer persist invalid=true and instead surface ErrBlockIncomplete, which catchup already retries against alternative peers. Safety direction is right: under-marking soft-stalls and retries, never accepts a bad block.

Verified:

  • errgroup propagation preserves errors.Is(err, ErrBlockIncomplete) up to block.Valid's caller
  • incomplete check is placed before storeInvalidBlock, after the existing infra filter
  • both other block.Valid consumers (optimistic background, reValidateBlock) gate invalidation on ErrBlockInvalid, so they retry rather than poison
  • catchup reuses the established 'abort peer / retry alternative / not malicious' path; isUnvalidatablePeerError cleanly narrows malicious flagging
  • build clean; model tests pass

Test coverage is thorough (classifier table, sync + gRPC paths end-to-end, isUnvalidatablePeerError truth table, parent-cascade regression guard), and the checkParentInvalid doc-comment correctly pins the non-obvious cascade invariant.

Noted follow-ups (appropriately deferred): the gRPC ValidateBlock storage/service/processing-filter asymmetry, operator recovery for already-poisoned DBs, and the legacy/netsync blockExists re-validation gap.

@oskarszoon oskarszoon merged commit a079083 into bsv-blockchain:main Jun 5, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

blockvalidation: ErrTxMissingParent/ErrTxNotFound during subtree validation wrongly persists block as invalid (permanent sync stall)

3 participants