fix(blockchain): skip FSM RUN gate when transitioning from IDLE#855
Conversation
PR bsv-blockchain#844 added a checkpoint gate to SendFSMEvent that refuses RUN while the local chain tip is below the network's highest hard-coded checkpoint. On a fresh node the gate also blocks the boot path (IDLE -> RUNNING), because tip height is 0 and the highest mainnet checkpoint is 945000: STATE_ERROR (109): refusing RUN: chain tip height 0 is below highest checkpoint 945000 for mainnet Make the gate consider the source state. The dangerous transitions the gate exists to block are LEGACYSYNCING -> RUNNING and CATCHINGBLOCKS -> RUNNING, both of which claim "I'm caught up". IDLE -> RUNNING is the boot path: no tip exists yet, downstream services (legacy, p2p) need the FSM in RUNNING to start syncing, and the legacy service already suppresses tx relay while FSM != RUNNING (PR bsv-blockchain#843) so the original bad-txns-vout-p2sh ban concern does not regress. Regression test exercises SendFSMEvent end-to-end with mainnet params across all RUN-permitted source states.
|
🤖 Claude Code Review Status: Complete Review Summary: No issues found. The fix correctly addresses the boot-path regression while preserving the original safety guarantee. Analysis: The change adds a state-aware guard to the RUN gate introduced in PR #844:
The implementation is correct:
The PR description accurately reflects the fix and includes verification results. |
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-13 10:26 UTC |
ordishs
left a comment
There was a problem hiding this comment.
Approve
Correct, minimal, well-tested fix for the boot-path regression from #844. The gate-skip is keyed on the right discriminator (priorState == IDLE), preserving protection on the two source states (LEGACYSYNCING, CATCHINGBLOCKS) that actually carry an "I'm caught up" claim.
Strengths
- Surgical change. 8 lines of fix, ~120 lines of test — the right ratio for a state-machine gate.
- Test coverage hits the right matrix. All three RUN-permitted source states × (below-checkpoint, at-or-above-checkpoint), plus an assertion that the FSM remains in the source state on rejection — that last one would catch a future regression if someone flipped the order of gate vs.
Event(). - Test plumbing is correct.
newTestBlockchainForGatenow providesnotifications,stats(needed forSendNotification'stracing.WithParentStat(b.stats)),initPrometheusMetrics()(needed for theenter_stategauge write), and a buffered-channel drainer so the FSM callback can publish without blocking..Maybe()on the store mock is the right call — IDLE sources never reachGetBestBlockHeader.
Optional follow-ups (non-blocking)
1. Comment misrepresents the safety mechanism. The new comment on Server.go:2598-2604 says:
…the legacy service already suppresses tx relay while FSM != RUNNING (#843) so the original
bad-txns-vout-p2shban concern does not regress.
But after this fix the FSM does reach RUNNING, at which point peer_server.go:2612's gate (IsFSMCurrentState(RUNNING)) permits relay. The real safety argument is structural: while FSM was IDLE, #843 prevented mempool accumulation, so the mempool is empty at the moment of transition — there is nothing pre-Genesis to relay. Worth rewording so future readers don't think the cited mechanism is what protects the new path.
2. Operator-bypass surface worth acknowledging. Idle() emits STOP, which moves RUNNING/LEGACYSYNCING → IDLE while preserving the tip. An operator calling setfsmstate idle then setfsmstate running on a partially-synced node (e.g., tip 600000) bypasses the gate — explicitly accepted by the "IDLE with tip 100 still below checkpoint succeeds" case. This is a deliberate design choice (and #843's mempool clearing during IDLE makes it tolerable), but the test name doesn't flag it as such. Consider renaming to something like "IDLE bypasses gate even with non-zero tip (operator override path)" or expanding the test-block comment.
3. Minor: drainer goroutine lifetime. t.Cleanup(func() { close(done) }) signals shutdown but doesn't wait. Harmless today (each subtest gets its own Blockchain), but a future refactor that shares state across subtests would inherit cross-talk. Not a blocker.
Verdict: ship it. The two comment-level suggestions can land in a follow-up if convenient.



Problem
PR #844 added a checkpoint gate to
SendFSMEventthat refusesRUNwhile the local chain tip is below the network's highest hard-coded checkpoint. On a fresh node the gate also blocks the boot path (IDLE→RUNNING), because tip height is 0 and the highest mainnet checkpoint is 945000:Repro:
./cli.sh setfsmstate --fsmstate RUNNINGon a freshly-initialised node.Fix
Skip the gate when the prior FSM state is
IDLE. The dangerous transitions the gate exists to block areLEGACYSYNCING→RUNNINGandCATCHINGBLOCKS→RUNNING, both of which claim "I'm caught up".IDLE→RUNNINGis the boot path: no tip exists yet, downstream services (legacy, p2p) need the FSM inRUNNINGto start syncing, and the legacy service already suppresses tx relay while FSM ≠RUNNING(#843), so the originalbad-txns-vout-p2shban concern from #844 does not regress.Test
TestSendFSMEvent_RunGate_SourceStateexercisesSendFSMEventend-to-end with mainnet params:IDLE+ tip 0 → RUN succeeds (regression for this bug)IDLE+ tip below checkpoint → RUN succeedsLEGACYSYNCING+ tip below checkpoint → RUN rejectedLEGACYSYNCING+ tip at checkpoint → RUN succeedsCATCHINGBLOCKS+ tip below checkpoint → RUN rejectedCATCHINGBLOCKS+ tip above checkpoint → RUN succeedsVerification