Skip to content

fix(blockchain): reject FSM RUN below highest network checkpoint#844

Merged
oskarszoon merged 4 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/fsm-no-running-below-checkpoint
May 12, 2026
Merged

fix(blockchain): reject FSM RUN below highest network checkpoint#844
oskarszoon merged 4 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/fsm-no-running-below-checkpoint

Conversation

@oskarszoon

@oskarszoon oskarszoon commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Block the FSMEventType_RUN transition in Blockchain.SendFSMEvent whenever the local chain tip is below the network's highest hard-coded checkpoint. Catchup callers continue to log the rejection and re-enter CATCHINGBLOCKS, so the node stops bouncing between states while mid-IBD.
  • Remove the unconditional Init-time RUN flip in blockvalidation.Server.Init (added in #3672). It bypassed the gate and was the load-bearing reason the node could lie about being RUNNING during IBD. Crash recovery from a stuck CATCHINGBLOCKS state is already covered by legacy startSync and restoreFSMState (both now subject to the gate).
  • Networks with no checkpoints (regtest, brand-new chains) bypass the gate. Mainnet's highest is currently 938000, testnet3's is 1700000 (go-chaincfg.MainNetParams.Checkpoints / TestNet3Params.Checkpoints).

Why

Follow-up to #843. While investigating that ban incident on bsva-ovh-teranode-eu-3 (chain tip ~293k, mainnet highest checkpoint 938000), Coralogix showed 30 Sending Run event blockvalidation/Server (CATCHINGBLOCKS => Run) log lines in 2.5 hours, all from services/blockvalidation/catchup.go:973 restoreFSMState. The defer u.restoreFSMState(ctx, catchupCtx) at catchup.go:790 flips state back to RUN unconditionally on every catchup chunk exit, regardless of whether the chain is actually caught up to network tip. Each RUN window let the legacy service emit at least one P2SH-bearing tx inv to seed-mainnet-eu-1 and earn bad-txns-vout-p2sh BAN THRESHOLD EXCEEDED (the re-ban at 18:01:38 lines up exactly with the same-second RUN event).

Investigation also surfaced a second path: blockvalidation.Server.Init (added by PR #3672) checks at startup whether FSM is in CATCHINGBLOCKS and, if so, sends RUN with no consultation of chain height. Comment in the original PR: "this is probably a remnant of a crash". Intent was reasonable (don't pin the node in CATCHINGBLOCKS if the prior process crashed mid-catchup), but the implementation lies about progress whenever the chain genuinely is mid-IBD on restart. The right invariant is the one this PR enforces in SendFSMEvent.

Implementation

services/blockchain/Server.go:

  • New guardRunBelowHighestCheckpoint(ctx) helper. Reads b.store.GetBestBlockHeader(ctx); compares meta height against blockchain.HighestCheckpointHeight(b.settings.ChainCfgParams.Checkpoints). Returns errors.NewStateError(...) when below.
  • New exported HighestCheckpointHeight([]chaincfg.Checkpoint) uint32 helper. Replaces the private duplicate previously in services/blockvalidation/catchup.go.
  • SendFSMEvent calls the guard for FSMEventType_RUN. The existing CATCHINGBLOCKS-cannot-be-manually-exited gate is preserved.

services/blockvalidation/Server.go:

  • Remove the Init-time RUN flip (lines 527-541 before this PR). With the gate in place it was actively dangerous; without the gate it was the source of the eu-3 ban incident. Crash recovery for a stuck CATCHINGBLOCKS state still works through:
    • services/legacy/netsync/manager.go:546 startSync — issues RUN when a peer reports the same height as the local tip.
    • services/blockvalidation/catchup.go:973 restoreFSMState — defers RUN at the end of every catchup chunk.
      Both now go through the gate, so neither can fire while the local tip is below the highest checkpoint.

services/blockvalidation/catchup.go, services/blockvalidation/BlockValidation.go:

  • Drop the private getHighestCheckpointHeight; rewire the three callsites (catchup.go:682, BlockValidation.go:1339, BlockValidation.go:1764) to the shared blockchain.HighestCheckpointHeight.

Caller behaviour for Run rejection:

  • services/legacy/netsync/manager.go:546,1345,1973 — already log+continue.
  • services/blockvalidation/catchup.go:973 — already log+continue.
  • services/blockvalidation/Server.go — Init flip removed entirely.

No caller crashes on the new rejection; they just stay in CATCHINGBLOCKS until catchup genuinely passes the highest checkpoint.

Edge cases

  • Networks with no checkpoints (regtest, brand-new chains): HighestCheckpointHeight returns 0 → guard returns nil → RUN allowed.
  • Missing settings.ChainCfgParams: guard returns nil (defensive; production always populates).
  • Store read error or nil meta: fails closed (StateError) so the node retries later rather than slipping into RUN with unknown chain state.

Test plan

services/blockchain/server_fsm_run_gate_test.go:

  • TestHighestCheckpointHeight — nil/empty/unordered + MainNet (938000) + RegTest (0).

  • TestGuardRunBelowHighestCheckpoint — table-driven: below/at/above checkpoint, no-checkpoint network permits, store error fails closed, nil meta fails closed.

  • TestGuardRunBelowHighestCheckpoint_NilSettings — defensive paths.

  • go build ./...

  • go test -count=1 -race ./services/blockchain/ → 609 pass

  • go test -count=1 -race ./services/blockvalidation/ → 562 pass

  • go test -count=1 -race ./services/blockvalidation/ ./services/legacy/netsync/ → 605 pass (downstream callers unaffected)

  • go vet ./services/blockchain/...

  • Manual: starting testnet daemon from a mid-IBD chain state (height 1481322) no longer crashes Init; node comes up in CATCHINGBLOCKS and catchup runs.

Follow-ups

  • The deeper issue (restoreFSMState issues RUN even when more catchup is needed) is mitigated by the gate but worth a separate change to make restoreFSMState consult "are we actually at tip?" before transitioning. Avoids the constant rejection log spam during IBD.
  • Mainnet checkpoint list in go-chaincfg is currently stale (highest = 938000, real tip ~948677). Update the checkpoints periodically so the gate threshold tracks reality.
  • Pair with fix(legacy): suppress tx relay while FSM != RUNNING #843 to fully close the SV-node ban vector: fix(legacy): suppress tx relay while FSM != RUNNING #843 stops outbound tx relay while not RUNNING; this PR stops bogus RUN transitions while below checkpoint.

Refuse `FSMEventType_RUN` whenever the local chain tip is below the
network's highest hard-coded checkpoint. Pre-checkpoint heights are
guaranteed to be deep history (mainnet's highest is 938000), so a node
sitting below them is mid-IBD even if a catchup worker thinks it has
finished its current chunk. Going to RUNNING in that state lets the
mempool/validator operate under pre-Genesis output rules and the
legacy service relay tx invs that current peers ban on sight
(`bad-txns-vout-p2sh BAN THRESHOLD EXCEEDED`).

Observed on mainnet eu-3 (chain tip ~293k, mainnet highest checkpoint
938000): 30 RUN events fired in 2.5 hours from
`services/blockvalidation/catchup.go:973 restoreFSMState`. The
unconditional `defer u.restoreFSMState(ctx, catchupCtx)` flips state
back to RUN on every catchup chunk exit, regardless of whether the
chain is anywhere near the network tip. Each RUN window let the legacy
service emit at least one P2SH-bearing tx inv to seed-mainnet-eu-1 and
collected an instant ban.

The gate sits in `SendFSMEvent` so it covers every caller of
`blockchainClient.Run(...)` (`legacy/netsync/manager.go` startSync,
handleBlockMsg, blockHandler; `blockvalidation/Server.go` Start;
`blockvalidation/catchup.go` restoreFSMState). Existing callers log
the rejection and continue; FSM stays in CATCHINGBLOCKS until catchup
genuinely passes the checkpoint.

Fails closed: store read errors or missing meta also reject RUN.
Networks with no checkpoints (regtest, brand-new nets) bypass the
gate. The helper has no production-config dependency beyond
`settings.ChainCfgParams.Checkpoints`.
@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Current Review:
No issues found. The PR correctly implements a safety gate for FSM RUN transitions.

Key Changes Verified:

  • Gate blocks RUN when chain tip < highest checkpoint (services/blockchain/Server.go:2601-2606)
  • Removed Init-time RUN flip that bypassed the gate (services/blockvalidation/Server.go)
  • Consolidated duplicate HighestCheckpointHeight implementations into shared function
  • Comprehensive test coverage including edge cases (nil settings, store errors, regtest)

History:

  • ✅ Fixed: Code duplication issue resolved - all call sites now use shared blockchain.HighestCheckpointHeight

The implementation follows defensive patterns (fail-closed on errors), preserves regtest compatibility, and includes proper test coverage per AGENTS.md requirements.

Comment thread services/blockchain/Server.go Outdated
@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-844 (855b349)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.546µ 1.564µ ~ 0.100
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.25n 71.16n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.18n 71.75n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.15n 71.25n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 32.62n 33.01n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 57.14n 55.07n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 127.6n 126.4n ~ 0.100
MiningCandidate_Stringify_Short-4 218.4n 218.7n ~ 0.700
MiningCandidate_Stringify_Long-4 1.637µ 1.646µ ~ 0.100
MiningSolution_Stringify-4 847.4n 853.5n ~ 0.100
BlockInfo_MarshalJSON-4 1.725µ 1.721µ ~ 0.300
NewFromBytes-4 131.5n 149.2n ~ 0.100
Mine_EasyDifficulty-4 60.88µ 61.46µ ~ 0.400
Mine_WithAddress-4 6.954µ 6.879µ ~ 0.700
BlockAssembler_AddTx-4 0.03373n 0.02895n ~ 0.400
AddNode-4 10.53 10.63 ~ 1.000
AddNodeWithMap-4 11.31 10.68 ~ 0.200
DirectSubtreeAdd/4_per_subtree-4 61.58n 65.79n ~ 0.200
DirectSubtreeAdd/64_per_subtree-4 31.35n 31.62n ~ 0.400
DirectSubtreeAdd/256_per_subtree-4 30.63n 30.37n ~ 0.400
DirectSubtreeAdd/1024_per_subtree-4 29.07n 29.23n ~ 0.200
DirectSubtreeAdd/2048_per_subtree-4 28.58n 28.67n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 278.6n 281.3n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 276.2n 273.3n ~ 0.200
SubtreeProcessorAdd/256_per_subtree-4 276.5n 274.6n ~ 0.700
SubtreeProcessorAdd/1024_per_subtree-4 269.0n 266.4n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 269.6n 265.8n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 274.5n 272.2n ~ 0.400
SubtreeProcessorRotate/64_per_subtree-4 272.9n 270.7n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 273.9n 271.4n ~ 0.200
SubtreeProcessorRotate/1024_per_subtree-4 270.8n 269.0n ~ 0.400
SubtreeNodeAddOnly/4_per_subtree-4 54.04n 54.70n ~ 0.200
SubtreeNodeAddOnly/64_per_subtree-4 34.39n 34.27n ~ 0.700
SubtreeNodeAddOnly/256_per_subtree-4 33.28n 33.56n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 32.80n 32.71n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 112.7n 112.8n ~ 0.700
SubtreeCreationOnly/64_per_subtree-4 400.2n 395.3n ~ 0.400
SubtreeCreationOnly/256_per_subtree-4 1.325µ 1.419µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.377µ 4.546µ ~ 0.200
SubtreeCreationOnly/2048_per_subtree-4 8.055µ 8.152µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 267.9n 268.2n ~ 0.800
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 270.6n 268.2n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 586.4µ 588.9µ ~ 0.400
ParallelGetAndSetIfNotExists/10k_nodes-4 1.327m 1.405m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 6.623m 6.753m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 13.22m 13.38m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 658.4µ 671.2µ ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 2.751m 2.920m ~ 0.200
SequentialGetAndSetIfNotExists/50k_nodes-4 10.32m 10.42m ~ 0.200
SequentialGetAndSetIfNotExists/100k_nodes-4 19.90m 19.91m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 636.1µ 647.9µ ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.253m 4.202m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 16.48m 16.41m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 710.3µ 702.4µ ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 5.906m 5.891m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 37.97m 39.55m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.581µ 3.482µ ~ 0.100
DiskTxMap_SetIfNotExists_Parallel-4 3.293µ 3.367µ ~ 0.200
DiskTxMap_ExistenceOnly-4 312.4n 296.6n ~ 0.100
Queue-4 199.1n 195.7n ~ 0.100
AtomicPointer-4 4.803n 5.073n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 898.6µ 869.2µ ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/10K-4 909.8µ 898.6µ ~ 0.200
ReorgOptimizations/AllMarkFalse/Old/10K-4 115.3µ 129.8µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 62.60µ 61.88µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/10K-4 64.92µ 59.88µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/10K-4 11.33µ 11.96µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 5.446µ 5.316µ ~ 0.400
ReorgOptimizations/NodeFlags/New/10K-4 1.829µ 1.910µ ~ 1.000
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.559m 10.222m ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/100K-4 9.739m 9.723m ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.154m 1.136m ~ 1.000
ReorgOptimizations/AllMarkFalse/New/100K-4 683.1µ 689.3µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/100K-4 630.2µ 721.7µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/100K-4 320.0µ 291.7µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/100K-4 57.58µ 54.94µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 19.79µ 19.65µ ~ 1.000
TxMapSetIfNotExists-4 51.41n 51.57n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 38.08n 38.14n ~ 1.000
ChannelSendReceive-4 615.3n 620.9n ~ 0.100
CalcBlockWork-4 517.4n 507.7n ~ 0.100
CalculateWork-4 706.7n 761.6n ~ 0.700
BuildBlockLocatorString_Helpers/Size_10-4 1.287µ 1.306µ ~ 0.700
BuildBlockLocatorString_Helpers/Size_100-4 14.59µ 15.53µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_1000-4 122.4µ 124.6µ ~ 0.700
CatchupWithHeaderCache-4 104.1m 104.2m ~ 0.200
_prepareTxsPerLevel-4 411.6m 412.4m ~ 1.000
_prepareTxsPerLevelOrdered-4 3.564m 3.495m ~ 1.000
_prepareTxsPerLevel_Comparison/Original-4 418.5m 414.6m ~ 1.000
_prepareTxsPerLevel_Comparison/Optimized-4 3.548m 3.557m ~ 1.000
_BufferPoolAllocation/16KB-4 3.299µ 3.221µ ~ 0.400
_BufferPoolAllocation/32KB-4 6.746µ 8.789µ ~ 0.100
_BufferPoolAllocation/64KB-4 16.63µ 16.58µ ~ 1.000
_BufferPoolAllocation/128KB-4 30.43µ 28.46µ ~ 0.700
_BufferPoolAllocation/512KB-4 106.9µ 107.6µ ~ 1.000
_BufferPoolConcurrent/32KB-4 18.80µ 18.13µ ~ 0.100
_BufferPoolConcurrent/64KB-4 26.14µ 29.32µ ~ 0.100
_BufferPoolConcurrent/512KB-4 138.0µ 138.2µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/16KB-4 622.0µ 608.3µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/32KB-4 637.1µ 634.7µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/64KB-4 636.8µ 619.9µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 631.8µ 632.0µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/512KB-4 629.2µ 631.5µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 35.14m 35.69m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/32KB-4 35.29m 35.14m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/64KB-4 35.24m 35.03m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4 34.93m 35.13m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/512KB-4 34.65m 34.76m ~ 1.000
_PooledVsNonPooled/Pooled-4 828.7n 828.9n ~ 1.000
_PooledVsNonPooled/NonPooled-4 7.147µ 6.413µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.938µ 6.735µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 11.156µ 9.245µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.836µ 8.665µ ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.286m 1.271m ~ 0.400
SubtreeSizes/10k_tx_16_per_subtree-4 305.3µ 304.2µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 71.05µ 70.90µ ~ 0.400
SubtreeSizes/10k_tx_256_per_subtree-4 17.90µ 17.91µ ~ 0.700
SubtreeSizes/10k_tx_512_per_subtree-4 8.844µ 8.973µ ~ 0.400
SubtreeSizes/10k_tx_1024_per_subtree-4 4.356µ 4.429µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.173µ 2.177µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 69.67µ 69.25µ ~ 0.700
BlockSizeScaling/10k_tx_256_per_subtree-4 17.51µ 18.04µ ~ 0.200
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.358µ 4.417µ ~ 0.100
BlockSizeScaling/50k_tx_64_per_subtree-4 381.0µ 369.8µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 87.54µ 88.88µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 21.63µ 21.88µ ~ 1.000
SubtreeAllocations/small_subtrees_exists_check-4 152.2µ 151.7µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 160.7µ 158.1µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 311.8µ 308.4µ ~ 0.700
SubtreeAllocations/medium_subtrees_exists_check-4 8.891µ 8.851µ ~ 0.400
SubtreeAllocations/medium_subtrees_data_fetch-4 9.486µ 9.357µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 17.75µ 17.78µ ~ 1.000
SubtreeAllocations/large_subtrees_exists_check-4 2.117µ 2.067µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.224µ 2.233µ ~ 0.200
SubtreeAllocations/large_subtrees_full_validation-4 4.358µ 4.347µ ~ 0.400
StoreBlock_Sequential/BelowCSVHeight-4 319.5µ 330.6µ ~ 0.200
StoreBlock_Sequential/AboveCSVHeight-4 332.2µ 333.3µ ~ 1.000
GetUtxoHashes-4 254.2n 254.5n ~ 1.000
GetUtxoHashes_ManyOutputs-4 49.42µ 50.19µ ~ 0.700
_NewMetaDataFromBytes-4 238.1n 238.1n ~ 1.000
_Bytes-4 632.9n 634.9n ~ 0.400
_MetaBytes-4 590.5n 576.1n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-12 08:25 UTC

Address review feedback on bsv-blockchain#844: the new helper duplicates
`getHighestCheckpointHeight` in services/blockvalidation/catchup.go.

- Export the helper as `blockchain.HighestCheckpointHeight` and keep
  the negative-height guard from the new copy (cp.Height is signed).
- Delete the private duplicate from blockvalidation/catchup.go.
- Update the three callsites in blockvalidation
  (BlockValidation.go:1339, :1764, catchup.go:682) to use the shared
  function.
- Rename Server_fsm_run_gate_test.go → server_fsm_run_gate_test.go to
  satisfy SonarQube's snake_case.go rule (per PR feedback).
oskarszoon added a commit that referenced this pull request May 12, 2026
Block the FSMEventType_RUN transition in Blockchain.SendFSMEvent
whenever the local chain tip is below the network's highest hard-coded
checkpoint. Catchup callers continue to log the rejection and re-enter
CATCHINGBLOCKS, so the node stops bouncing between states while mid-IBD.

Networks with no checkpoints (regtest, brand-new chains) bypass the
gate. Mainnet's highest is currently 938000.

Follow-up to #843. The ban incident on bsva-ovh-teranode-eu-3 (chain
tip ~293k, mainnet highest checkpoint 938000) showed 30 RUN events in
2.5 hours from blockvalidation/catchup's restoreFSMState. The defer
unconditionally transitions to RUN on every catchup chunk exit,
regardless of whether the chain is caught up. Each RUN window allowed
the legacy service to emit at least one P2SH-bearing tx inv to seed-
mainnet-eu-1, earning bad-txns-vout-p2sh BAN THRESHOLD EXCEEDED.

The right surgical fix is at the FSM chokepoint rather than rewriting
the catchup state machine. Pair with #843: 843 stops outbound tx relay
while not RUNNING; this stops bogus RUN transitions while below
checkpoint.

Includes a refactor pulling HighestCheckpointHeight out of blockchain
and blockvalidation into a single shared helper.
blockvalidation's Init crash-recovery path (services/blockvalidation/
Server.go:528, added in #3672) unconditionally flips FSM from
CATCHINGBLOCKS to RUNNING on startup, on the assumption that a stuck
CATCHINGBLOCKS state is a remnant of a prior crash. With the new
SendFSMEvent gate (bsv-blockchain#844) the blockchain server legitimately refuses
RUN when the local chain tip is below the network's highest
checkpoint, which currently happens whenever a node is doing IBD on
testnet (highest checkpoint 1700000) or mainnet (highest checkpoint
938000) from a low height.

Previously a refused RUN returned ServiceError which the daemon
treated as a fatal Init failure and shut the node down:
  refusing RUN: chain tip height 1481322 is below highest checkpoint
  1700000 for testnet

Match the behaviour of the other blockchainClient.Run callers
(legacy/netsync/manager.go:546,1345,1973;
blockvalidation/catchup.go:973): log the rejection and continue. The
node stays in CATCHINGBLOCKS, blocks keep flowing via the Kafka
consumer (unconditional Start at Server.go:1053) and the legacy/p2p
sync paths, catchup runs as normal, and once the chain crosses the
checkpoint the next restoreFSMState will get a permit.

The Init flip is a workaround for a stuck FSM after a crash, not a
declaration that we are caught up. The gate is the real invariant.
The crash-recovery path added in #3672 unconditionally transitioned
FSM from CATCHINGBLOCKS to RUNNING at blockvalidation startup, on the
assumption that a persisted CATCHINGBLOCKS state was a remnant of a
prior crash. With the SendFSMEvent checkpoint gate in this PR the
flip can't actually transition when the local chain is mid-IBD, so
the previous commit just downgraded its error from fatal to a
warning. That's still wrong in spirit: the flip claimed catchup was
complete, when it wasn't.

Stuck-after-crash recovery is covered by existing runtime triggers
that issue RUN under conditions the gate accepts:

  services/legacy/netsync/manager.go:546   startSync, when a peer
                                            reports the same height as
                                            the local tip
  services/blockvalidation/catchup.go:973  restoreFSMState, deferred
                                            at the end of every
                                            catchup chunk

Both run on the normal block-arrival cadence, so a crashed-at-tip
node recovers as soon as its first peer handshake completes, and a
crashed-mid-IBD node continues catching up.
oskarszoon added a commit that referenced this pull request May 12, 2026
Block the FSMEventType_RUN transition in Blockchain.SendFSMEvent
whenever the local chain tip is below the network's highest hard-coded
checkpoint. Catchup callers continue to log the rejection and re-enter
CATCHINGBLOCKS, so the node stops bouncing between states while
mid-IBD. Networks with no checkpoints (regtest, brand-new chains)
bypass the gate. Mainnet's highest is currently 938000, testnet3's is
1700000.

Also removes the unconditional Init-time RUN flip in
blockvalidation.Server.Init (added in #3672). It bypassed the gate and
was the load-bearing reason the node could lie about being RUNNING
during IBD. Crash recovery from a stuck CATCHINGBLOCKS state is
already covered by legacy startSync and restoreFSMState (both now
subject to the gate).

Follow-up to #843. The ban incident on bsva-ovh-teranode-eu-3 (chain
tip ~293k, mainnet highest checkpoint 938000) showed 30 RUN events in
2.5 hours from blockvalidation/catchup's restoreFSMState. Each RUN
window allowed the legacy service to emit at least one P2SH-bearing tx
inv to seed-mainnet-eu-1, earning bad-txns-vout-p2sh BAN THRESHOLD
EXCEEDED.

The right surgical fix is at the FSM chokepoint plus removing the
known-bad Init flip, rather than rewriting the catchup state machine.
Pair with #843: 843 stops outbound tx relay while not RUNNING; this
stops bogus RUN transitions while below checkpoint.

Includes a refactor pulling HighestCheckpointHeight out of blockchain
and blockvalidation into a single shared blockchain.HighestCheckpointHeight
helper.
@sonarqubecloud

Copy link
Copy Markdown

@oskarszoon oskarszoon merged commit f8903b7 into bsv-blockchain:main May 12, 2026
25 checks passed
@oskarszoon oskarszoon self-assigned this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants