fix(blockchain): reject FSM RUN below highest network checkpoint#844
Merged
oskarszoon merged 4 commits intoMay 12, 2026
Merged
Conversation
Refuse `FSMEventType_RUN` whenever the local chain tip is below the network's highest hard-coded checkpoint. Pre-checkpoint heights are guaranteed to be deep history (mainnet's highest is 938000), so a node sitting below them is mid-IBD even if a catchup worker thinks it has finished its current chunk. Going to RUNNING in that state lets the mempool/validator operate under pre-Genesis output rules and the legacy service relay tx invs that current peers ban on sight (`bad-txns-vout-p2sh BAN THRESHOLD EXCEEDED`). Observed on mainnet eu-3 (chain tip ~293k, mainnet highest checkpoint 938000): 30 RUN events fired in 2.5 hours from `services/blockvalidation/catchup.go:973 restoreFSMState`. The unconditional `defer u.restoreFSMState(ctx, catchupCtx)` flips state back to RUN on every catchup chunk exit, regardless of whether the chain is anywhere near the network tip. Each RUN window let the legacy service emit at least one P2SH-bearing tx inv to seed-mainnet-eu-1 and collected an instant ban. The gate sits in `SendFSMEvent` so it covers every caller of `blockchainClient.Run(...)` (`legacy/netsync/manager.go` startSync, handleBlockMsg, blockHandler; `blockvalidation/Server.go` Start; `blockvalidation/catchup.go` restoreFSMState). Existing callers log the rejection and continue; FSM stays in CATCHINGBLOCKS until catchup genuinely passes the checkpoint. Fails closed: store read errors or missing meta also reject RUN. Networks with no checkpoints (regtest, brand-new nets) bypass the gate. The helper has no production-config dependency beyond `settings.ChainCfgParams.Checkpoints`.
Contributor
|
🤖 Claude Code Review Status: Complete Current Review: Key Changes Verified:
History:
The implementation follows defensive patterns (fail-closed on errors), preserves regtest compatibility, and includes proper test coverage per AGENTS.md requirements. |
Contributor
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-12 08:25 UTC |
Address review feedback on bsv-blockchain#844: the new helper duplicates `getHighestCheckpointHeight` in services/blockvalidation/catchup.go. - Export the helper as `blockchain.HighestCheckpointHeight` and keep the negative-height guard from the new copy (cp.Height is signed). - Delete the private duplicate from blockvalidation/catchup.go. - Update the three callsites in blockvalidation (BlockValidation.go:1339, :1764, catchup.go:682) to use the shared function. - Rename Server_fsm_run_gate_test.go → server_fsm_run_gate_test.go to satisfy SonarQube's snake_case.go rule (per PR feedback).
oskarszoon
added a commit
that referenced
this pull request
May 12, 2026
Block the FSMEventType_RUN transition in Blockchain.SendFSMEvent whenever the local chain tip is below the network's highest hard-coded checkpoint. Catchup callers continue to log the rejection and re-enter CATCHINGBLOCKS, so the node stops bouncing between states while mid-IBD. Networks with no checkpoints (regtest, brand-new chains) bypass the gate. Mainnet's highest is currently 938000. Follow-up to #843. The ban incident on bsva-ovh-teranode-eu-3 (chain tip ~293k, mainnet highest checkpoint 938000) showed 30 RUN events in 2.5 hours from blockvalidation/catchup's restoreFSMState. The defer unconditionally transitions to RUN on every catchup chunk exit, regardless of whether the chain is caught up. Each RUN window allowed the legacy service to emit at least one P2SH-bearing tx inv to seed- mainnet-eu-1, earning bad-txns-vout-p2sh BAN THRESHOLD EXCEEDED. The right surgical fix is at the FSM chokepoint rather than rewriting the catchup state machine. Pair with #843: 843 stops outbound tx relay while not RUNNING; this stops bogus RUN transitions while below checkpoint. Includes a refactor pulling HighestCheckpointHeight out of blockchain and blockvalidation into a single shared helper.
blockvalidation's Init crash-recovery path (services/blockvalidation/ Server.go:528, added in #3672) unconditionally flips FSM from CATCHINGBLOCKS to RUNNING on startup, on the assumption that a stuck CATCHINGBLOCKS state is a remnant of a prior crash. With the new SendFSMEvent gate (bsv-blockchain#844) the blockchain server legitimately refuses RUN when the local chain tip is below the network's highest checkpoint, which currently happens whenever a node is doing IBD on testnet (highest checkpoint 1700000) or mainnet (highest checkpoint 938000) from a low height. Previously a refused RUN returned ServiceError which the daemon treated as a fatal Init failure and shut the node down: refusing RUN: chain tip height 1481322 is below highest checkpoint 1700000 for testnet Match the behaviour of the other blockchainClient.Run callers (legacy/netsync/manager.go:546,1345,1973; blockvalidation/catchup.go:973): log the rejection and continue. The node stays in CATCHINGBLOCKS, blocks keep flowing via the Kafka consumer (unconditional Start at Server.go:1053) and the legacy/p2p sync paths, catchup runs as normal, and once the chain crosses the checkpoint the next restoreFSMState will get a permit. The Init flip is a workaround for a stuck FSM after a crash, not a declaration that we are caught up. The gate is the real invariant.
The crash-recovery path added in #3672 unconditionally transitioned
FSM from CATCHINGBLOCKS to RUNNING at blockvalidation startup, on the
assumption that a persisted CATCHINGBLOCKS state was a remnant of a
prior crash. With the SendFSMEvent checkpoint gate in this PR the
flip can't actually transition when the local chain is mid-IBD, so
the previous commit just downgraded its error from fatal to a
warning. That's still wrong in spirit: the flip claimed catchup was
complete, when it wasn't.
Stuck-after-crash recovery is covered by existing runtime triggers
that issue RUN under conditions the gate accepts:
services/legacy/netsync/manager.go:546 startSync, when a peer
reports the same height as
the local tip
services/blockvalidation/catchup.go:973 restoreFSMState, deferred
at the end of every
catchup chunk
Both run on the normal block-arrival cadence, so a crashed-at-tip
node recovers as soon as its first peer handshake completes, and a
crashed-mid-IBD node continues catching up.
oskarszoon
added a commit
that referenced
this pull request
May 12, 2026
Block the FSMEventType_RUN transition in Blockchain.SendFSMEvent whenever the local chain tip is below the network's highest hard-coded checkpoint. Catchup callers continue to log the rejection and re-enter CATCHINGBLOCKS, so the node stops bouncing between states while mid-IBD. Networks with no checkpoints (regtest, brand-new chains) bypass the gate. Mainnet's highest is currently 938000, testnet3's is 1700000. Also removes the unconditional Init-time RUN flip in blockvalidation.Server.Init (added in #3672). It bypassed the gate and was the load-bearing reason the node could lie about being RUNNING during IBD. Crash recovery from a stuck CATCHINGBLOCKS state is already covered by legacy startSync and restoreFSMState (both now subject to the gate). Follow-up to #843. The ban incident on bsva-ovh-teranode-eu-3 (chain tip ~293k, mainnet highest checkpoint 938000) showed 30 RUN events in 2.5 hours from blockvalidation/catchup's restoreFSMState. Each RUN window allowed the legacy service to emit at least one P2SH-bearing tx inv to seed-mainnet-eu-1, earning bad-txns-vout-p2sh BAN THRESHOLD EXCEEDED. The right surgical fix is at the FSM chokepoint plus removing the known-bad Init flip, rather than rewriting the catchup state machine. Pair with #843: 843 stops outbound tx relay while not RUNNING; this stops bogus RUN transitions while below checkpoint. Includes a refactor pulling HighestCheckpointHeight out of blockchain and blockvalidation into a single shared blockchain.HighestCheckpointHeight helper.
icellan
approved these changes
May 12, 2026
|
freemans13
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
FSMEventType_RUNtransition inBlockchain.SendFSMEventwhenever the local chain tip is below the network's highest hard-coded checkpoint. Catchup callers continue to log the rejection and re-enter CATCHINGBLOCKS, so the node stops bouncing between states while mid-IBD.blockvalidation.Server.Init(added in #3672). It bypassed the gate and was the load-bearing reason the node could lie about being RUNNING during IBD. Crash recovery from a stuck CATCHINGBLOCKS state is already covered by legacystartSyncandrestoreFSMState(both now subject to the gate).938000, testnet3's is1700000(go-chaincfg.MainNetParams.Checkpoints/TestNet3Params.Checkpoints).Why
Follow-up to #843. While investigating that ban incident on
bsva-ovh-teranode-eu-3(chain tip ~293k, mainnet highest checkpoint 938000), Coralogix showed 30Sending Run event blockvalidation/Server (CATCHINGBLOCKS => Run)log lines in 2.5 hours, all fromservices/blockvalidation/catchup.go:973 restoreFSMState. Thedefer u.restoreFSMState(ctx, catchupCtx)atcatchup.go:790flips state back to RUN unconditionally on every catchup chunk exit, regardless of whether the chain is actually caught up to network tip. Each RUN window let the legacy service emit at least one P2SH-bearing tx inv to seed-mainnet-eu-1 and earnbad-txns-vout-p2sh BAN THRESHOLD EXCEEDED(the re-ban at 18:01:38 lines up exactly with the same-second RUN event).Investigation also surfaced a second path:
blockvalidation.Server.Init(added by PR #3672) checks at startup whether FSM is in CATCHINGBLOCKS and, if so, sends RUN with no consultation of chain height. Comment in the original PR: "this is probably a remnant of a crash". Intent was reasonable (don't pin the node in CATCHINGBLOCKS if the prior process crashed mid-catchup), but the implementation lies about progress whenever the chain genuinely is mid-IBD on restart. The right invariant is the one this PR enforces inSendFSMEvent.Implementation
services/blockchain/Server.go:guardRunBelowHighestCheckpoint(ctx)helper. Readsb.store.GetBestBlockHeader(ctx); compares meta height againstblockchain.HighestCheckpointHeight(b.settings.ChainCfgParams.Checkpoints). Returnserrors.NewStateError(...)when below.HighestCheckpointHeight([]chaincfg.Checkpoint) uint32helper. Replaces the private duplicate previously inservices/blockvalidation/catchup.go.SendFSMEventcalls the guard forFSMEventType_RUN. The existing CATCHINGBLOCKS-cannot-be-manually-exited gate is preserved.services/blockvalidation/Server.go:services/legacy/netsync/manager.go:546 startSync— issues RUN when a peer reports the same height as the local tip.services/blockvalidation/catchup.go:973 restoreFSMState— defers RUN at the end of every catchup chunk.Both now go through the gate, so neither can fire while the local tip is below the highest checkpoint.
services/blockvalidation/catchup.go,services/blockvalidation/BlockValidation.go:getHighestCheckpointHeight; rewire the three callsites (catchup.go:682,BlockValidation.go:1339,BlockValidation.go:1764) to the sharedblockchain.HighestCheckpointHeight.Caller behaviour for
Runrejection:services/legacy/netsync/manager.go:546,1345,1973— already log+continue.services/blockvalidation/catchup.go:973— already log+continue.services/blockvalidation/Server.go— Init flip removed entirely.No caller crashes on the new rejection; they just stay in CATCHINGBLOCKS until catchup genuinely passes the highest checkpoint.
Edge cases
HighestCheckpointHeightreturns 0 → guard returns nil → RUN allowed.settings.ChainCfgParams: guard returns nil (defensive; production always populates).StateError) so the node retries later rather than slipping into RUN with unknown chain state.Test plan
services/blockchain/server_fsm_run_gate_test.go:TestHighestCheckpointHeight— nil/empty/unordered + MainNet (938000) + RegTest (0).TestGuardRunBelowHighestCheckpoint— table-driven: below/at/above checkpoint, no-checkpoint network permits, store error fails closed, nil meta fails closed.TestGuardRunBelowHighestCheckpoint_NilSettings— defensive paths.go build ./...go test -count=1 -race ./services/blockchain/→ 609 passgo test -count=1 -race ./services/blockvalidation/→ 562 passgo test -count=1 -race ./services/blockvalidation/ ./services/legacy/netsync/→ 605 pass (downstream callers unaffected)go vet ./services/blockchain/...Manual: starting testnet daemon from a mid-IBD chain state (height 1481322) no longer crashes Init; node comes up in CATCHINGBLOCKS and catchup runs.
Follow-ups
restoreFSMStateissues RUN even when more catchup is needed) is mitigated by the gate but worth a separate change to makerestoreFSMStateconsult "are we actually at tip?" before transitioning. Avoids the constant rejection log spam during IBD.go-chaincfgis currently stale (highest = 938000, real tip ~948677). Update the checkpoints periodically so the gate threshold tracks reality.