Skip to content

fix(blockassembly): purge-conflicting-unmined + FSM IDLE enforcement#704

Closed
icellan wants to merge 52 commits into
mainfrom
fix/repair-conflicts
Closed

fix(blockassembly): purge-conflicting-unmined + FSM IDLE enforcement#704
icellan wants to merge 52 commits into
mainfrom
fix/repair-conflicts

Conversation

@icellan

@icellan icellan commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes BlockAssembler startup failures caused by unmined transactions in a locally-inconsistent state (Conflicting=true + UnminedSince>0 records, plus non-conflicting children referencing them). The iterator filters Conflicting=true, so the parent is absent from the processing list and validateParentChain rejects the child, parking the FSM in IDLE.

The branch started out as a repair tool that tried to reconstruct intended state via classification (Case A / C / D). Every iteration on mainnet uncovered a new graph shape and either added hours of runtime or still left the offending tx stuck — because the writers (SetConflicting, reorg handlers) don't clean up after themselves (stale conflictingCs, nil SpendingData, etc).

Replaced with a surgical purge: the unmined set is ephemeral by design (propagation re-arrives valid txs; next block sweeps them up), so a single-pass delete of every (Conflicting=true, UnminedSince>0) record is all BA needs to start cleanly. validateParentChain was relaxed to tolerate missing parents — non-conflicting children whose parents just got purged are harmlessly skipped, and get mined or pruned on their own.

New: teranode-cli purge-conflicting-unmined

Online CLI, run while the node is up and the FSM has parked in IDLE:

teranode-cli purge-conflicting-unmined [--dry-run] [--skip-unmined-since-scan]
  • Step 0 — full-store consistency scan, re-marks mined-on-best-chain txs that still carry UnminedSince. Reused from the repair era. --skip-unmined-since-scan skips it on re-runs once it has completed cleanly.
  • Step 1 — same single scan (no second pass) collects every record with Conflicting=true + UnminedSince>0.
  • Step 2 — batch Delete over the collected set. Aerospike record goes; external .tx/.outputs blobs are cleaned by the existing pruner on delete_at_height.
  • --dry-run counts without writing.

Dropped ~850 lines of classification / cascade / chase-up / cache machinery.

FSM IDLE enforcement (from earlier work on this branch)

validateParentChain sets the FSM to IDLE when it detects integrity problems. Four service hot paths early-return when FSM is IDLE so no new work reaches a half-initialised block assembler:

Service Function
Block validation blockHandler (Kafka consumer)
Subtree validation CheckSubtreeFromBlock
Propagation processTransaction
Legacy HandleBlockDirect

The blockchain FSM now accepts STOP from CATCHINGBLOCKS so a repair-needed error detected mid-catchup can actually park the node in IDLE (previously the transition was rejected and the node crash-looped).

Block assembly freezes its gRPC entry points via an atomic frozenForRepair flag and spawns a watcher goroutine that retries loadUnminedTransactions the next time the FSM leaves IDLE — so after the purge completes the operator flips the FSM out of IDLE and BA resumes live without a node restart.

Operator flow

  1. BA startup hits validateParentChain → trips on parent is unmined but not in processing list → FSM → IDLE. Log: Run 'teranode-cli purge-conflicting-unmined' to fix.
  2. Operator runs teranode-cli purge-conflicting-unmined --skip-unmined-since-scan (first run without the skip flag, subsequent iterations with).
  3. Operator calls teranode-cli setfsmstate --fsmstate RUNNING. BA's watcher retries, parent is gone, child is harmlessly null-skipped, BA unfreezes.
  4. Mempool state re-populates over propagation.

Test plan

Purge suite (stores/utxo/tests/purge_conflicting_unmined_test.go):

  • TestPurgeConflictingUnmined_CleanState — empty store yields zeroed report
  • TestPurgeConflictingUnmined_DeletesConflictingUnmined(Conflicting=true, UnminedSince>0) record deleted
  • TestPurgeConflictingUnmined_LeavesNonConflictingUnminedAlone — child with dangling parent ref untouched
  • TestPurgeConflictingUnmined_LeavesMinedTxAloneUnminedSince=0 records protected
  • TestPurgeConflictingUnmined_DryRun — candidates counted, no writes
  • TestPurgeConflictingUnmined_SkipUnminedSinceScan — step 0 skipped, steps 1+2 still run
  • TestPurgeConflictingUnmined_UnminedSinceFix — step 0 clears stray UnminedSince on mined-on-best-chain
  • TestPurgeConflictingUnmined_Idempotent — second run finds nothing
  • TestPurgeConflictingUnmined_DeleteForwardsThroughStore — store wrapper sees Delete per purged hash (TxMetaCache eviction hook)

Infrastructure:

  • TestValidateParentChain_* — parent-missing case now returns nil-tolerant (post-purge expected shape), all other integrity checks still trip idleAndError
  • Test_NewFiniteStateMachine — STOP from CATCHINGBLOCKS allowed
  • All propagation, subtreevalidation, legacy/netsync, blockassembly unit tests pass (553 in blockassembly packages alone)

Commits in the purge pivot

  • b510ebc00 — rename files + exports (repair_conflicts.gopurge_conflicting_unmined.go, RepairConflictingChainsPurgeConflictingUnmined, etc.). Pure rename, no logic change.
  • eea0468b2 — replace Case A/C/D classification with surgical purge. Extends InconsistentTxRecord with Conflicting bool so one scan seeds both step 0 and step 1. SQL ScanInconsistentUnminedTxs now implemented (was a no-op). validateParentChain skips missing parents.
  • 701287ebf — 9-test purge suite.
  • 3aebb81e6 — CLI subcommand rename, 11 operator-facing strings, settings doc, new docs/howto/recovery-from-idle.md runbook.

Earlier commits in the branch (repair era) are kept rather than rebased away — full history is useful post-mortem on why the classification approach was abandoned.

Not addressed / out of scope

  • Orphaned subtree blobs in subtree-store — existing pruner concern, not touched.
  • External .tx/.outputs blob cleanup — handled by existing pruner on delete_at_height.
  • Stale ConflictingChildren back-refs on purged parents — unread by any consumer after purge; no-op.
  • ErrRepairNeeded error type kept as-is (neutral semantics, ripple cost too high for marginal clarity).

icellan added 8 commits April 15, 2026 13:21
…nvalid chains

Update test for in-memory sort path to use no-input transactions so
validateParentChain passes trivially, since the function now hard-fails
instead of silently filtering transactions with unknown parents.
…rentChain

- stores/utxo/tests/repair_conflicts_test.go: 5 tests exercising RepairConflictingChains
  with a file-based SQLite store (WAL mode required to avoid SetConflicting deadlock
  in the SQL store). Covers clean state, Case A detection+fix, cascade to children,
  dry run, and step-0 UnminedSince no-op for SQL.
- services/blockassembly/validate_parent_chain_test.go: 3 tests for validateParentChain
  using a real sqlitememory UTXO store and blockchain.Mock for FSM event assertions.
  Covers hard-fail+FSMEventIDLE on unmined parent not in list, clean mined-parent success,
  and success when unmined parent precedes child in the processing list.
@github-actions

github-actions Bot commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Summary

This PR implements a surgical cleanup mechanism for inconsistent unmined transaction state that causes BlockAssembler startup failures. The approach is sound: instead of attempting complex state reconstruction, it deletes ephemeral unmined records that can be re-propagated, while preserving mined data. The implementation is well-tested, thoroughly documented, and integrates cleanly with the existing FSM IDLE infrastructure.

Findings

No critical issues found. The implementation demonstrates good engineering practices:

Strengths

  1. Clear separation of concerns: Cleanup logic (stores/utxo/cleanup_unmined.go) is decoupled from command interface (cmd/cleanupunmined/) and service integration
  2. Comprehensive test coverage: 9 tests covering clean state, deletion logic, dry-run, idempotency, and store integration
  3. Accurate documentation: recovery-from-idle.md correctly documents the operator workflow and safety guarantees
  4. Robust FSM integration: All four entry points (blockvalidation, subtreevalidation, propagation, legacy) correctly check FSM IDLE and pause processing
  5. Live repair without restart: watchForRepairCompletion goroutine enables resume after cleanup without node restart
  6. Safe deletion policy: Missing parents are tolerated post-cleanup; orphan-parent classification uses BA's CurrentBlock as anchor to prevent chain-drift issues

Documentation Accuracy Verified

  • Line 56 claim about SubtreeValidation.processMissingTransactions refetch safety: Accurate (services/subtreevalidation/processTxMetaUsingStore.go:140-148 treats TX_NOT_FOUND as miss counter, not fatal)
  • recovery-from-idle.md operator flow: Accurate (matches BlockAssembler.watchForRepairCompletion implementation)
  • blockassembly_settings.md FSM IDLE guidance: Accurate (teranode-cli cleanup-unmined is correct command name)

Architecture Notes

  • Three-step cleanup (unmined_since fix, conflicting-unmined purge, orphan-parent deletion) is well-justified in PR description
  • blockchainAdapter uses BA's CurrentBlock rather than blockchain service's best chain—correct choice to avoid reorg drift
  • SQL iterator implementation (stores/utxo/sql/unmined_iterator.go) matches Aerospike 1024-record batch size for consistency

Review Complete

No inline comments required. All previously reported issues have been resolved.

@icellan icellan self-assigned this Apr 15, 2026
@github-actions

github-actions Bot commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-704 (aed97f2)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.671µ 1.697µ ~ 0.100
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 59.42n 59.51n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 62.47n 59.40n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 59.34n 59.36n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 33.83n 34.14n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 59.43n 57.71n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 146.3n 149.2n ~ 0.100
MiningCandidate_Stringify_Short-4 248.5n 247.0n ~ 0.200
MiningCandidate_Stringify_Long-4 1.714µ 1.720µ ~ 0.200
MiningSolution_Stringify-4 888.1n 889.6n ~ 0.700
BlockInfo_MarshalJSON-4 1.713µ 1.718µ ~ 1.000
NewFromBytes-4 129.2n 142.8n ~ 0.100
Mine_EasyDifficulty-4 58.43µ 58.49µ ~ 0.700
Mine_WithAddress-4 4.749µ 4.791µ ~ 0.700
BlockAssembler_AddTx-4 0.02909n 0.03044n ~ 0.400
AddNode-4 11.69 11.29 ~ 0.100
AddNodeWithMap-4 11.72 11.46 ~ 0.100
DirectSubtreeAdd/4_per_subtree-4 60.10n 61.78n ~ 0.400
DirectSubtreeAdd/64_per_subtree-4 31.14n 28.63n ~ 0.100
DirectSubtreeAdd/256_per_subtree-4 30.44n 27.07n ~ 0.100
DirectSubtreeAdd/1024_per_subtree-4 28.91n 26.09n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 28.53n 25.75n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 311.1n 315.7n ~ 0.100
SubtreeProcessorAdd/64_per_subtree-4 306.1n 317.4n ~ 0.100
SubtreeProcessorAdd/256_per_subtree-4 308.0n 320.8n ~ 0.100
SubtreeProcessorAdd/1024_per_subtree-4 310.0n 316.4n ~ 0.200
SubtreeProcessorAdd/2048_per_subtree-4 307.8n 316.3n ~ 0.200
SubtreeProcessorRotate/4_per_subtree-4 311.9n 316.7n ~ 1.000
SubtreeProcessorRotate/64_per_subtree-4 314.9n 317.8n ~ 0.300
SubtreeProcessorRotate/256_per_subtree-4 317.8n 319.1n ~ 1.000
SubtreeProcessorRotate/1024_per_subtree-4 309.6n 302.2n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 67.37n 66.61n ~ 0.200
SubtreeNodeAddOnly/64_per_subtree-4 40.16n 39.01n ~ 0.100
SubtreeNodeAddOnly/256_per_subtree-4 37.62n 37.72n ~ 1.000
SubtreeNodeAddOnly/1024_per_subtree-4 37.11n 37.24n ~ 0.400
SubtreeCreationOnly/4_per_subtree-4 165.7n 165.2n ~ 1.000
SubtreeCreationOnly/64_per_subtree-4 634.8n 631.9n ~ 0.400
SubtreeCreationOnly/256_per_subtree-4 2.010µ 2.003µ ~ 0.700
SubtreeCreationOnly/1024_per_subtree-4 5.206µ 5.204µ ~ 0.700
SubtreeCreationOnly/2048_per_subtree-4 8.211µ 9.307µ ~ 0.700
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 308.8n 304.7n ~ 0.200
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 309.4n 309.5n ~ 0.700
ParallelGetAndSetIfNotExists/1k_nodes-4 958.6µ 969.7µ ~ 0.200
ParallelGetAndSetIfNotExists/10k_nodes-4 1.932m 1.903m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 8.872m 8.640m ~ 0.200
ParallelGetAndSetIfNotExists/100k_nodes-4 17.62m 17.36m ~ 0.200
SequentialGetAndSetIfNotExists/1k_nodes-4 765.1µ 753.8µ ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 2.958m 2.935m ~ 0.700
SequentialGetAndSetIfNotExists/50k_nodes-4 10.88m 10.83m ~ 0.700
SequentialGetAndSetIfNotExists/100k_nodes-4 20.64m 20.34m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 1.042m 1.016m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.737m 4.695m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 19.40m 18.94m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 840.6µ 830.4µ ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 6.027m 6.035m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 39.86m 39.53m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.462µ 3.601µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.311µ 3.289µ ~ 0.100
DiskTxMap_ExistenceOnly-4 298.6n 298.3n ~ 1.000
Queue-4 194.5n 191.7n ~ 0.100
AtomicPointer-4 4.901n 4.883n ~ 1.000
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 847.6µ 843.1µ ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4 813.1µ 817.2µ ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/10K-4 115.0µ 113.2µ ~ 0.700
ReorgOptimizations/AllMarkFalse/New/10K-4 62.02µ 61.71µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/10K-4 68.09µ 72.07µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/10K-4 11.40µ 11.44µ ~ 1.000
ReorgOptimizations/NodeFlags/Old/10K-4 5.522µ 6.130µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.809µ 2.472µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.460m 10.136m ~ 0.400
ReorgOptimizations/DedupFilterPipeline/New/100K-4 9.500m 10.088m ~ 0.700
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.119m 1.179m ~ 0.200
ReorgOptimizations/AllMarkFalse/New/100K-4 679.5µ 681.5µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/100K-4 711.2µ 666.1µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 306.4µ 338.2µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/100K-4 52.57µ 56.32µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 19.69µ 19.41µ ~ 0.700
TxMapSetIfNotExists-4 51.33n 51.54n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 38.53n 37.91n ~ 0.700
ChannelSendReceive-4 621.6n 589.6n ~ 0.100
CalcBlockWork-4 468.2n 469.8n ~ 0.400
CalculateWork-4 631.6n 632.5n ~ 1.000
BuildBlockLocatorString_Helpers/Size_10-4 1.652µ 1.618µ ~ 1.000
BuildBlockLocatorString_Helpers/Size_100-4 12.37µ 12.52µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 123.0µ 122.3µ ~ 0.700
CatchupWithHeaderCache-4 104.2m 104.1m ~ 0.700
_BufferPoolAllocation/16KB-4 3.349µ 3.437µ ~ 0.100
_BufferPoolAllocation/32KB-4 7.429µ 8.171µ ~ 0.100
_BufferPoolAllocation/64KB-4 16.82µ 16.59µ ~ 0.700
_BufferPoolAllocation/128KB-4 28.21µ 32.63µ ~ 0.100
_BufferPoolAllocation/512KB-4 111.7µ 106.4µ ~ 0.100
_BufferPoolConcurrent/32KB-4 19.15µ 19.23µ ~ 1.000
_BufferPoolConcurrent/64KB-4 30.81µ 29.94µ ~ 0.400
_BufferPoolConcurrent/512KB-4 147.0µ 147.6µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/16KB-4 619.3µ 631.2µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 611.0µ 626.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 610.8µ 620.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 597.5µ 619.3µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 625.5µ 629.0µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 35.04m 34.96m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 34.57m 35.04m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/64KB-4 34.61m 34.77m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/128KB-4 34.68m 35.19m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 34.77m 34.58m ~ 0.700
_PooledVsNonPooled/Pooled-4 736.5n 737.6n ~ 0.700
_PooledVsNonPooled/NonPooled-4 7.139µ 7.422µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.585µ 6.632µ ~ 0.400
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.846µ 9.929µ ~ 0.700
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.089µ 10.113µ ~ 0.100
_prepareTxsPerLevel-4 399.9m 417.0m ~ 0.200
_prepareTxsPerLevelOrdered-4 3.495m 3.469m ~ 0.400
_prepareTxsPerLevel_Comparison/Original-4 412.2m 415.0m ~ 0.200
_prepareTxsPerLevel_Comparison/Optimized-4 3.482m 3.530m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.264m 1.281m ~ 0.200
SubtreeSizes/10k_tx_16_per_subtree-4 295.8µ 298.8µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 70.75µ 71.79µ ~ 0.400
SubtreeSizes/10k_tx_256_per_subtree-4 17.56µ 17.70µ ~ 0.400
SubtreeSizes/10k_tx_512_per_subtree-4 8.675µ 8.855µ ~ 0.200
SubtreeSizes/10k_tx_1024_per_subtree-4 4.324µ 4.318µ ~ 0.400
SubtreeSizes/10k_tx_2k_per_subtree-4 2.176µ 2.177µ ~ 0.700
BlockSizeScaling/10k_tx_64_per_subtree-4 68.65µ 69.61µ ~ 0.200
BlockSizeScaling/10k_tx_256_per_subtree-4 17.27µ 17.41µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.298µ 4.348µ ~ 0.200
BlockSizeScaling/50k_tx_64_per_subtree-4 362.5µ 364.3µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 86.43µ 86.90µ ~ 1.000
BlockSizeScaling/50k_tx_1024_per_subtree-4 21.29µ 21.24µ ~ 1.000
SubtreeAllocations/small_subtrees_exists_check-4 147.1µ 147.8µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 156.2µ 157.7µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 304.8µ 307.7µ ~ 1.000
SubtreeAllocations/medium_subtrees_exists_check-4 8.760µ 8.824µ ~ 0.700
SubtreeAllocations/medium_subtrees_data_fetch-4 9.162µ 9.201µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 17.17µ 17.33µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.073µ 2.080µ ~ 0.700
SubtreeAllocations/large_subtrees_data_fetch-4 2.198µ 2.205µ ~ 0.700
SubtreeAllocations/large_subtrees_full_validation-4 4.336µ 4.399µ ~ 0.400
StoreBlock_Sequential/BelowCSVHeight-4 315.4µ 306.0µ ~ 0.400
StoreBlock_Sequential/AboveCSVHeight-4 306.9µ 307.3µ ~ 1.000
GetUtxoHashes-4 207.8n 209.0n ~ 1.000
GetUtxoHashes_ManyOutputs-4 36.35µ 39.33µ ~ 0.100
_NewMetaDataFromBytes-4 179.1n 178.7n ~ 0.200
_Bytes-4 474.1n 473.0n ~ 1.000
_MetaBytes-4 446.9n 427.4n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-04-19 15:23 UTC

@icellan icellan requested review from ordishs and oskarszoon April 15, 2026 14:09
…sed metric; add nil-guards to subtreeHandler

validateParentChain no longer filters — it hard-fails with FSM IDLE.
The old setting and its prometheus counter were left behind as dead code.
subtreeHandler.go lacked nil-guards on blockchainClient and FSM state
that other services already had, risking a panic in edge cases.
icellan and others added 7 commits April 16, 2026 11:03
… repair can run

validateParentChain errors were propagating as fatal, killing all services
including blockchain gRPC — making repair-conflicts unreachable. Now the
error is caught in Start(), FSM stays IDLE, and the node stays up.

Also switch idleAndError from SendFSMEvent(STOP) to Idle() which handles
already-IDLE state gracefully instead of logging a spurious error.
…rogress

- Replace brittle string matching with typed ErrRepairNeeded error
  (ERR_REPAIR_NEEDED=102 in proto) for validateParentChain → Start() flow
- Fix "teranodecli" → "teranode-cli" typo across all services
- Add progress logging to RepairConflictingChains so it's not silent
  for minutes during large UTXO store repairs
… repair progress

Idle() sent gRPC but never updated the local fmsState cache, leaving it
stale at RUNNING. GetMiningCandidate then passed the FSM guard and
returned empty block templates while the node needed repair.

Also switch repair progress from batch counts to record counts via
TotalScanned() for meaningful output on large UTXO stores.
oskarszoon and others added 18 commits April 17, 2026 08:50
Blocking STOP from CATCHINGBLOCKS traps the node in a crash loop when a
data-integrity check fails during catchup: BlockAssembler's
validateParentChain calls Idle() to move to IDLE for repair, the FSM
rejects the event, BA.Start returns StorageError, ServiceManager stops
BA, the node exits, and on restart the FSM is still persisted as
CATCHINGBLOCKS — so the same thing happens again. The operator has no
window to run teranode-cli repair-conflicts.

Adds CATCHINGBLOCKS to STOP's Src list and widens the SendFSMEvent
guard to permit STOP alongside RUN. RUN is still the normal exit when
catchup completes; STOP is a safety valve for repair. Updates the
state-machine diagram and fsm_test.

The guard's original intent — preventing accidental transitions that
would abandon catchup to RUNNING/LEGACYSYNCING — is preserved: those
events remain rejected from CATCHINGBLOCKS.
After unmarking an orphan-conflicting parent P in step 4, any
grandparent that (a) is Conflicting=true, (b) either names P as the
recorded spender per its own SpendingData or has no SpendingData for
the relevant vout at all (common when the grandparent was already
conflicting when P's Spend ran, so the write was skipped), and (c) has
no *active* conflicting children (counting only entries still
Conflicting=true) is itself an orphan. Enqueue grandparents into the
step-4 worklist so chains of stacked orphan-conflicting ancestors are
resolved in a single repair run.

Also broadens step-1 detection: when the scanned child's parent has
SpendingData==nil for the relevant vout but is flagged conflicting
with no active conflicting children, treat the parent as a Case D
candidate. Without this, the child's input walk would stop at the nil
SpendingData and the parent would only be reachable via chase-up
started from yet another candidate.

Adds hasActiveConflictingChildren helper to consistently ignore stale
back-references in ConflictingChildren left over from unmarking.

Observed on mainnet-eu-1: tx 8dacf3...f464 got unmarked on the first
repair pass, but its grandparent 217494d8...17ef stayed conflicting
with conflictingCs=[8dacf3] (stale) and spentUtxos=0 (no SpendingData
ever recorded), so validateParentChain kept tripping on the next
restart.
Mainnet-eu-1 run missed a parent (4557bdc6) that aerospikereader shows
matches Case D criteria exactly (Conflicting=true, SD for the relevant
vout is nil, ConflictingChildren=[5d12221c], and 5d12221c is currently
non-conflicting). Step 1 report said only 2 Case D candidates. No
report.Errors were raised for that parent, so the path that skipped it
is unclear.

Adds targeted logProgress lines inside step 1 that fire only when:
  - s.Get(parent) returned an error or nil,
  - vout is out of range of parent's SpendingDatas,
  - parent.Conflicting is true (rare — bounded noise),
  - hasActiveConflictingChildren bails due to a child Get error or a
    still-conflicting child (with the child hash).

Next repair run will either show 4557bdc6 being added to the orphan
list (implying the real miss is elsewhere), or reveal the exact reason
detection skipped it. No behavior change; only logging.
… dedup

A dirty UTXO store is never acceptable in Teranode, so any DB read or
write error during repair must halt the run rather than be swallowed
and reported as "non-fatal". Every `report.Errors = append(...);
continue` path in RepairConflictingChains now returns the error and
aborts — the only errors treated as benign are TX_NOT_FOUND responses
for external references that were never stored (parent's grandparent,
pruned ancestors), which are a legitimate outcome rather than a
failure. The RepairReport.Errors slice is removed.

Other correctness / safety changes in the same pass:

- Case C dedup: the previous key (pair.loser) silently dropped distinct
  winners that happened to share a loser, leaving their Conflicting=true
  flag set. Dedup is now by winner — each distinct real-winner has
  ProcessConflicting called exactly once, using a fresh dedup map per
  call so dry-run no longer mutates shared state.
- cascadeConflictingViaSpendingData: add cascadeMaxVisited cap so a
  corrupted or pathological SpendingData graph cannot grow the visited
  set without bound. SetConflicting is now issued once per frontier
  level instead of once per child, cutting N round trips to one per
  level.
- hasActiveConflictingChildren returns (bool, error) and is invoked
  with a logReason callback so step-1 diagnostics can record why a
  parent wasn't enqueued as a Case D candidate.
- Out-of-range vout is now a ProcessingError (genuine store corruption)
  rather than a silent skip.
- progressFn is a single parameter, not variadic — the "optional slice
  that only reads [0]" shape was a footgun.
- Step log tags are consistent at /4 throughout.

CLI: drop the "non-fatal errors" section, nothing to iterate anymore —
the single abort error is surfaced via the wrapper ProcessingError.

Tests updated for the new signature and removed field; 13 repair tests
still pass and cover Case A, Case C, Case D with dry-run, cascades,
chained orphans and legit-conflict safety checks.
…very on FSM leave IDLE

When loadUnminedTransactions returns ErrRepairNeeded the assembler used
to return nil from Start() but skipped subtreeProcessor.Start and
startChannelListeners entirely — leaving the gRPC server accepting
calls against a half-initialised assembler. The FSM IDLE guards in
upstream services are best-effort and miss some paths, so a call could
still reach AddTx / GetMiningCandidate / SubmitMiningSolution and
either hang on an unreferenced channel or touch an uninitialised
processor.

A new atomic frozenForRepair flag is set in that path and exposed via
FrozenForRepair(). The gRPC methods most likely to be invoked
(AddTx, AddTxBatch, AddTxBatchColumnar, RemoveTx, GetMiningCandidate,
SubmitMiningSolution, GetCandidateBlock, ResetBlockAssembly) now call
ba.assertNotFrozenForRepair() up front and return ErrRepairNeeded —
defence-in-depth alongside the FSM IDLE checks in upstream services.

Recovery is live rather than requiring a restart, to match the
pause/resume semantics of blockvalidation and subtreevalidation:

- Start() spawns watchForRepairCompletion as a wg-tracked goroutine.
- The watcher blocks on WaitUntilFSMTransitionFromIdleState and retries
  loadUnminedTransactions once the operator moves the FSM out of IDLE
  (after running teranode-cli repair-conflicts).
- On success it runs startAfterLoadUnmined (subtreeProcessor.Start,
  startChannelListeners, height metric) and clears the frozen flag —
  gated gRPC methods start accepting traffic without a node restart.
- If loadUnminedTransactions returns ErrRepairNeeded again, idleAndError
  has already put the FSM back to IDLE; the watcher loops and waits
  for the next transition.
- Any other error stops the watcher and keeps the assembler frozen —
  non-repair failures are outside this recovery path's remit.
- Cleanly exits on context.Canceled during shutdown.

Unrelated cleanup in the same area: subtreeprocessor reset's
clear-processed-at errgroup now has SetLimit(16) so a reset spanning
hundreds of moveBack blocks doesn't launch hundreds of concurrent
SetBlockProcessedAt writes against the blockchain store.
…routine

Every handler that checked for FSM IDLE used to log-and-fall-through on
a check failure, spawn a fresh resume goroutine per invocation, and
(in blockvalidation's case) wait on context.Background() so the
goroutine couldn't exit on service shutdown. Under load that's
hundreds of routines racing to ResumeAll, and a transient FSM-check
error silently bypasses the guard entirely.

Changes applied to blockvalidation/Server.go, subtreevalidation/
subtreeHandler.go, subtreevalidation/txmetaHandler.go, legacy/netsync/
handle_block.go and propagation/Server.go:

- FSM-check errors now return an error rather than logging and
  continuing. Fail closed: if we can't confirm the FSM is not IDLE,
  don't admit the block / subtree / tx while the node may be in repair.
- A new idleConsumerPaused atomic.Bool (on blockvalidation.Server and
  subtreevalidation.Server) guards the pause/resume transition. Only
  the first IDLE-observed call PauseAll's the consumers and spawns a
  single watcher; concurrent handler invocations short-circuit via
  CompareAndSwap. The watcher defers Store(false) on completion so
  the next IDLE episode re-arms cleanly.
- blockvalidation's resume goroutine now uses the service context
  plumbed through from consumerMessageHandler instead of
  context.Background(), so it exits on shutdown.
- blockHandler returns ErrServiceError when IDLE instead of nil so the
  Kafka offset is not advanced and the in-hand message is retried after
  the FSM leaves IDLE — matching what the log claims.
- txmetaHandler operator hint updated to reference the repair CLI,
  matching the other handlers.

pruner's triggerInitialPruning hash-lookup comment rewritten to be
accurate about reorg semantics: GetBlockHeadersByHeight returns the
current main-chain hash at the persisted height, which may differ
from the hash that was actually persisted on an older fork; pruning
is by height so this only affects the log line, not the work
performed.
…se D

Mainnet-eu-1 run surfaced a mutual-blocker pathology where a parent's
ConflictingChildren list names a child that is itself orphan-conflicting
(Conflicting=true with no credible reason — grandparent SpendingData
does not show a legit loss for the child either). The old
hasActiveConflictingChildren check saw Conflicting=true on the child
and classified the parent as having active conflicts, so the parent
was never added as a Case D candidate. The orphan child is invisible
to the unmined iterator (conflicting filter), so it is never reached
either. The pair stays stuck forever across repair runs.

Replace hasActiveConflictingChildren with classifyConflictingChildren,
which recursively checks each still-conflicting entry in the list:

  - stale back-reference (child.Conflicting=false now) → ignore
  - child.Conflicting=true AND some grandparent.SpendingData names a
    different spender for one of child's inputs → legit loser, parent
    is not a Case D candidate
  - child.Conflicting=true AND every reachable grandparent either
    names the child itself or has nil SpendingData → orphan, return
    alongside bool hasLegit=false

Step 1 and step-4 chase-up enqueue any orphan children they find so
they are unmarked in the same pass as the parent they were blocking.
Legit-loser detection is unchanged.

Adds TestRepairConflictingChains_CaseD_OrphanBlocksParentDetection
covering the exact shape from mainnet: parentX (orphan) with
conflictingCs=[blocker] where blocker is itself orphan, plus a
non-conflicting goodChild spending a different output. One repair
pass unmarks both and leaves goodChild untouched.
SetConflicting(false) clears the child's Conflicting flag but leaves
the back-reference in every parent's ConflictingChildren list — the
SQL updateParentConflictingChildren helper only ever INSERTs. If such
a stale sibling is also on the best chain, the Case C scan would enqueue
it as the "real winner", ProcessConflicting would reject it with "tx is
not conflicting", and — now that DB errors are fatal — the whole repair
would abort before Case D even starts.

Filter stale entries by checking sibling.Conflicting=true alongside the
best-chain check. A BlockIDs-only Get is not enough.

Observed on mainnet-eu-1: step 1 correctly identified 2716 Case D
orphans (after the previous orphan-blocker fix lit up detection) but
step 2 aborted on tx 1e541f1… which is on the best chain but had been
unmarked in an earlier repair run.

Adds TestRepairConflictingChains_CaseC_StaleSiblingSkipped.
… empty

A legit-losing parent whose outputs were never recorded as spent
(spentUtxos=0, all SpendingDatas nil) is a real shape on mainnet: the
parent was already Conflicting=true at the time its children ran
Spend, and some code paths skip the SpendingData write for conflicting
parents. cascadeConflictingViaSpendingData(parent) then walks an empty
SD list and marks zero descendants — the real non-conflicting children
(whose inputs do name the parent) stay visible to the unmined iterator
and validateParentChain keeps tripping across restarts.

Step 1 now records the direct children that spend each orphan-candidate
parent in caseDDirectChildren. When step 4 classifies a parent as a
legit loser, the cascade seeds from those tracked children in addition
to whatever parent.SD turns up. The children's own SpendingDatas are
properly populated, so the subsequent walk propagates correctly.

Observed on mainnet-eu-1: parent 4557bdc6 is a legit loser of rootTx
grandparent (rootTx.SD[0] names a10bd058, not 4557bdc6), repair
correctly identified the legit-conflict path and called the cascade,
but 4557bdc6.SD is all nil so child 5d12221c — which spends
4557bdc6[1] — never got its Conflicting=true mark.

Adds TestRepairConflictingChains_CaseD_LegitCascadeWithNilParentSD
covering the exact shape (parentLoser with nil SD, childOfLoser whose
input still names the parent).
…tep 1

Mainnet run showed step 1 stalling for hours. One parent accumulated
341 entries in its ConflictingChildren list, and ~14k non-conflicting
unmined txs had inputs pointing at it. Without caching each visit
refetched the parent (external tx = file store hit) and re-ran
classifyConflictingChildren over all 341 entries, each of which does a
child Get + a Get per input → grandparent SD check. Tens of millions
of Gets, many external.

Step 1 does no writes, so the parent's metadata and the classification
result are stable for the duration of the scan. Add two scoped caches:

  - parentMetaCache (hash → *meta.Data, plus parentMetaNotFound for
    negative caching) behind fetchParent, so each distinct parent is
    fetched once regardless of how many children reference it.
  - classificationCache (parent hash → {orphans, hasLegit}) behind
    classifyCached, reusing the expensive recursion over the
    ConflictingChildren list across every child visit of the same
    parent.

Also drops the per-input debug log lines that flooded output when a
parent's ConflictingChildren was large (hundreds of hashes per line,
>100KB per visit). logProgress callback is still threaded into
classifyConflictingChildren for diagnostics from within the helper.

Both caches are local to step 1 — step 4 writes invalidate them, so
step 4 continues to call classifyConflictingChildren directly without
the cache.
Step 0 is a full-store consistency scan (hundreds of millions of
records on a production node) and is almost always a no-op once it has
run cleanly once. Iterating on Case A / C / D fixes currently pays
that cost on every run, which has turned into hours per attempt.

Add RepairOptions.SkipUnminedSinceScan and a --skip-unmined-since-scan
CLI flag on teranode-cli repair-conflicts. When set, step 0 is
announced as skipped and the run jumps straight to scanning unmined
transactions. Best-chain header data (needed by Case A / Case C) is
fetched unconditionally up front, outside the skip gate.

Defaults unchanged — a fresh run still does the full scan.
Mainnet run stalled in step 4: each distinct parent triggered
classifyConflictingChildren which did a Get+Tx on every entry in the
parent's ConflictingChildren list, plus Gets for each grandparent for
every input of every child. Two "blocker" children were shared across
~2700 parents — each parent re-classified them from scratch — and the
blockers are external txs with 2001 utxos each, so every Get hit the
file store.

Split classifyConflictingChildren into:

  - classifyChild (new): per-child classification returning
    {exists, conflicting, legit}. Stable for the lifetime of a single
    repair run so long as SetConflicting(h, false) is not later called
    on the same h — and if it is, treating h as still-conflicting in
    other parents' lists just produces an acceptable stale back-ref,
    not an incorrect Case D decision.
  - classifyConflictingChildren: now takes an optional childCache map
    and memoizes per-child results across calls.

A single childClassCache is shared by step 1 (via classifyCached) and
step 4 (both the fresh-parent legit check and the chase-up grandparent
check). Each distinct child is fetched + grandparent-walked once per
repair run regardless of how many parents reference it.

Also add appendCaseDOrphan dedup at step-1 append time — the raw slice
grew to ~40k entries on mainnet while the unique orphan count was a
few hundred, because the same parents and blocker children are pushed
for every non-conflicting child that visits them. Step 4's seenCaseD
still dedups but traversing the bloated slice was wasted time.
Step 1 only logged progress after each iterator batch and only when
the aggregate scanned count had moved by 10,000. On mainnet the
iterator delivers ~14k unmined txs in a small number of batches, and
the first encounter with a big conflicting parent can stall a single
tx for minutes while classifyChild populates the cache from external
storage — so the whole scan went silent for over an hour with no
output at all.

Add a time-based gate (30s) and an intra-batch trigger (every 500 txs
within a batch). maybeLogProgress fires whichever way the threshold is
crossed first. The log line is unchanged; just called more reliably.

No behavior change to the classification logic itself.
The careful Case D classification is taking hours on mainnet — each
first-contact with a big conflicting parent runs hundreds of
sequential external-store Gets to populate the child-class cache, and
the main goroutine is blocked on futex for the duration. The most
recent run unmarked 383 orphans + cascaded 4 but still left tx
4557bdc6 pointed at by 5d12221c (the original offender) stuck,
meaning either the direct-children seeding or the cascade path has
a subtle bug we haven't tracked down.

Unmined txs are ephemeral: valid txs propagate back in minutes and
the next block sweeps them up anyway. A coarse "mark every non-
conflicting unmined child of a Conflicting=true+UnminedSince>0 parent
Conflicting=true" pass does what BlockAssembler actually needs
(descendants of a conflicting ancestor must not be in the iterator)
without any classification. Valid children that happen to reference
a wrongly-flagged parent get pruned at delete_at_height and re-enter
via propagation.

Add RepairOptions.AggressiveCascade and a CLI flag
--aggressive-cascade. When set, step 1 collects candidates into
aggressiveCascadeChildren and writes them all Conflicting=true in one
SetConflicting batch before the Case C sweep. Step 4 is skipped
entirely. Case A and Case C detection run as normal — they're cheap
and strictly correct.

Also add a heartbeat ticker that prints the current phase every 15s
via atomic.Pointer, so a repair stuck in one deep Get still reports
liveness. Replaces the per-500-tx progress check that could go silent
for tens of minutes when a single tx stalled on external fetches.

Default behavior unchanged.
…ting-unmined

Pure rename commit, no logic change. Sets up the following commit which
replaces the classification machinery with a surgical purge of records
where Conflicting=true and UnminedSince>0.

- stores/utxo/repair_conflicts.go → purge_conflicting_unmined.go
- stores/utxo/tests/repair_conflicts_test.go → purge_conflicting_unmined_test.go
- cmd/repairconflicts/ → cmd/purgeconflictingunmined/
- RepairConflictingChains → PurgeConflictingUnmined
- RepairReport → PurgeReport
- RepairOptions → PurgeOptions
- RepairProgressFunc → PurgeProgressFunc
- cmd wrapper RepairConflicts → PurgeConflictingUnmined

The "repair-conflicts" CLI subcommand keeps its name here; a later commit
renames it to "purge-conflicting-unmined" along with the operator-facing
log strings.

errors.ErrRepairNeeded / NewRepairNeededError are intentionally retained
— the error semantic ("operator intervention required") is unchanged and
renaming would ripple through ~10 test files for no functional gain.
… purge

Replaces the Case A/C/D classification machinery with a single-pass delete of
every (Conflicting=true, UnminedSince>0) record. The unmined set is ephemeral
by design, so propagation and the next block are enough to restore any valid
tx the purge removes; there is no need to reverse-engineer correct state from
a graph whose writers never fully clean up after themselves.

stores/utxo/purge_conflicting_unmined.go
- Single scan over ScanInconsistentUnminedTxs combines step 0 (unmined_since
  fixup for mined txs still carrying the marker) and step 1 (collect
  conflicting-unmined hashes).
- Step 2 batches Delete(ctx, hash) over the collected set.
- PurgeReport fields: UnminedSinceFixed, ConflictingUnminedPurged.
- PurgeOptions fields: SkipUnminedSinceScan (AggressiveCascade removed, moot).
- Drops ~850 lines of classification helpers
  (classifyChild/classifyConflictingChildren/cascadeConflictingViaSpendingData
  and their caches).

stores/utxo/UnminedTxIterator.go + aerospike/consistency_scan.go
- InconsistentTxRecord gains a Conflicting bool so the single scan can seed
  both step 0 and step 1.
- Aerospike scan fetches the conflicting bin and extracts it in
  parseConsistencyRecord.

stores/utxo/sql/unmined_iterator.go
- ScanInconsistentUnminedTxs is no longer a no-op on SQL; it now iterates
  every record with unmined_since IS NOT NULL and returns hash, block_ids,
  unmined_since, conflicting. Required so SQLite-backed tests exercise the
  purge logic through the same code path.

services/blockassembly/BlockAssembler.go
- validateParentChain now skips parents that are not in the UTXO store
  instead of parking FSM in IDLE. This is the load-bearing change that makes
  the surgical purge viable: non-conflicting children whose parents get
  deleted remain harmlessly in the iterator and get mined or pruned.

cmd/purgeconflictingunmined/purge_conflicting_unmined.go +
cmd/teranodecli/teranodecli/cli.go
- Drop --aggressive-cascade flag and rewrite the report output to the two
  remaining counters.

The "repair-conflicts" CLI subcommand name is kept here; a later commit
renames it to "purge-conflicting-unmined" alongside the operator-facing log
strings.

Tests: the repair-era Case A/C/D tests are replaced with a single clean-state
smoke test in this commit; the full purge test suite lands in the next commit.
Replaces the smoke test from the previous commit with comprehensive coverage
of PurgeConflictingUnmined behavior against a real SQLite-backed utxo.Store:

- CleanState: empty store yields a zeroed report.
- DeletesConflictingUnmined: (Conflicting=true, UnminedSince>0) record is
  removed from the store.
- LeavesNonConflictingUnminedAlone: purging a parent does not touch its
  non-conflicting unmined child; BA's validateParentChain change handles the
  dangling ref.
- LeavesMinedTxAlone: records with Conflicting=true but UnminedSince=0 are
  protected by the UnminedSince>0 filter.
- DryRun: candidates are counted but not deleted.
- SkipUnminedSinceScan: step 0 is skipped while step 1 still runs.
- UnminedSinceFix: step 0 re-marks mined-on-best-chain records still carrying
  UnminedSince.
- Idempotent: second run finds nothing to delete.
- DeleteForwardsThroughStore: a store wrapper sees Delete called once per
  purged hash — this is the hook a live node's TxMetaCache uses to evict
  stale entries.

setupSQLiteFileStore now pre-creates the shared parent Tx as mined so
SetConflicting on child transactions can resolve the conflicting_children FK.
- Rename the CLI subcommand from repair-conflicts to purge-conflicting-unmined
  and update its help text to describe the new behavior (delete, not repair).
- Update all 11 operator-facing log/error strings across blockassembly,
  blockvalidation, subtreevalidation, propagation, and legacy/netsync so the
  suggested fix points at the new command.
- Update docs/references/settings/services/blockassembly_settings.md.
- Add docs/howto/recovery-from-idle.md with a short runbook entry describing
  the IDLE → purge → FSM RUNNING recovery flow and why non-conflicting
  children are intentionally left alone.
- Update the validateParentChain test comment; the test itself asserts on
  errors.ErrRepairNeeded (unchanged) so no behavior change is needed.

errors.ErrRepairNeeded and NewRepairNeededError keep their names — the error
semantic ("operator intervention required") is unchanged, only the fix
command name is renamed, and ripple-renaming would touch ~10 test files for
no functional gain.
@oskarszoon oskarszoon changed the title fix(blockassembly): repair conflicting tx chains + FSM IDLE enforcement fix(blockassembly): purge-conflicting-unmined + FSM IDLE enforcement Apr 19, 2026
The command is about to grow beyond deleting conflicting-unmined records —
it will also remove non-conflicting unmined transactions whose parents are
mined on an orphaned fork (surfaced on mainnet-eu-1 after the first run of
purge-conflicting-unmined unfroze BA but left stale orphan-mined parent
references tripping validateParentChain). Rename now to keep the next
commit's diff focused on the new logic.

- stores/utxo/purge_conflicting_unmined.go → cleanup_unmined.go
- stores/utxo/tests/purge_conflicting_unmined_test.go → cleanup_unmined_test.go
- cmd/purgeconflictingunmined/ → cmd/cleanupunmined/
- PurgeConflictingUnmined → CleanupUnmined
- PurgeReport → CleanupReport
- PurgeOptions → CleanupOptions
- PurgeProgressFunc → CleanupProgressFunc
- CLI subcommand: purge-conflicting-unmined → cleanup-unmined
- All 11 operator log/error strings updated.
- docs/howto/recovery-from-idle.md + blockassembly_settings.md updated.

No logic change in this commit.
The first mainnet-eu-1 run of the prior purge-conflicting-unmined command
unfroze Block Assembly long enough to reveal a second inconsistency class
the tool had not touched: non-conflicting unmined transactions whose parent
is mined on a block that is no longer on the best chain (an orphaned fork).
Example: parent 1aebda16... was mined in block id 945137 at height 945052
but that block is off the current best chain, so BA's validateParentChain
trips with "parent is on wrong chain" on every load of the unmined set.

Step 3 now iterates GetUnminedTxIterator (non-conflicting unmined) in
batches, BatchDecorate-fetches parent BlockIDs + UnminedSince + Conflicting,
and deletes children whose parent is:

  - Conflicting=true (the child is dangling)
  - UnminedSince=0 with empty BlockIDs (inconsistency)
  - UnminedSince=0 with BlockIDs all off the best chain (orphan-mined)

Missing parents are tolerated (BA's validateParentChain skips those since
the purge rewrite). Parents with UnminedSince>0 remain visible to BA's
iterator and are therefore valid — no child delete.

CleanupReport:
- Renames ConflictingUnminedPurged → ConflictingUnminedDeleted.
- Adds OrphanParentUnminedDeleted counter.

Tests (3 new, alongside updated helpers):
- DeletesOrphanParentChildren: parent mined off-best-chain, non-conflicting
  unmined child is deleted.
- LeavesMainChainParentChildrenAlone: parent on best chain, child untouched.
- OrphanParentDryRun: dry run counts but does not delete.

Existing tests updated: newQuerier() now publishes the shared parent Tx's
block id as on the best chain so step 3 does not accidentally flag
newTestTx-derived children in scenarios that are not exercising orphan-mined
behavior.

Runbook (docs/howto/recovery-from-idle.md) updated to describe step 3 and
to note that unmined subtree blobs are left to the pruner/TTL (content-
addressed, unique by hash, stale blob costs only disk).
Comment thread docs/references/settings/services/blockassembly_settings.md Outdated
Clarify in the IDLE-recovery runbook that cleanup-unmined's step 3 deletions
are safe even if a peer has the deleted tx in a blessed subtree. If a later
block arrives referencing such a subtree, block validation does not hard-
fail: SubtreeValidation.processMissingTransactions refetches the tx bytes
from the peer and reconstructs the UTXO metadata as part of normal
validation. BatchDecorate TX_NOT_FOUND is treated as a miss counter, not a
fatal error.

Cite the relevant source locations so future readers can verify the safety
argument without having to re-trace the path.
… anchor

Mainnet-eu-1 run found 0 orphan-parent children even though Block Assembly
then tripped validateParentChain with "parent is on wrong chain (blocks:
[945137])". Root cause: cleanup classified against the blockchain service's
best chain (via GetBestBlockHeader), but BA uses its own persisted
CurrentBlock from the blockchain DB state table on startup. After a reorg
the two views diverge; cleanup saw 945137 as on main while BA did not.

Adapter's GetBestBlockHeaderInfo now calls blockassembly.Client.
GetBlockAssemblyState and uses CurrentHash / CurrentHeight as the walk
anchor for GetBlockHeaderIDs. BA is now a required dependency of cleanup —
its gRPC stays reachable in IDLE (only write entry points are gated by
frozenForRepair). If BA is unreachable or returns an invalid hash we fail
loudly rather than silently cleaning against a drifted chain view.
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
61.5% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@oskarszoon

Copy link
Copy Markdown
Contributor

Even after all the proper cleanup, we still had incorrect conflicting marks in subtrees from already received blocks. Because of the previous bugs, there isn't an easy recovery path for Teranodes in this state and resetting/reseeding is preferred

@oskarszoon oskarszoon closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants