fix(blockassembly): purge-conflicting-unmined + FSM IDLE enforcement#704
fix(blockassembly): purge-conflicting-unmined + FSM IDLE enforcement#704icellan wants to merge 52 commits into
Conversation
…nvalid chains Update test for in-memory sort path to use no-input transactions so validateParentChain passes trivially, since the function now hard-fails instead of silently filtering transactions with unknown parents.
…rentChain - stores/utxo/tests/repair_conflicts_test.go: 5 tests exercising RepairConflictingChains with a file-based SQLite store (WAL mode required to avoid SetConflicting deadlock in the SQL store). Covers clean state, Case A detection+fix, cascade to children, dry run, and step-0 UnminedSince no-op for SQL. - services/blockassembly/validate_parent_chain_test.go: 3 tests for validateParentChain using a real sqlitememory UTXO store and blockchain.Mock for FSM event assertions. Covers hard-fail+FSMEventIDLE on unmined parent not in list, clean mined-parent success, and success when unmined parent precedes child in the processing list.
…ion, propagation, legacy
|
🤖 Claude Code Review Status: Complete SummaryThis PR implements a surgical cleanup mechanism for inconsistent unmined transaction state that causes BlockAssembler startup failures. The approach is sound: instead of attempting complex state reconstruction, it deletes ephemeral unmined records that can be re-propagated, while preserving mined data. The implementation is well-tested, thoroughly documented, and integrates cleanly with the existing FSM IDLE infrastructure. FindingsNo critical issues found. The implementation demonstrates good engineering practices: Strengths
Documentation Accuracy Verified
Architecture Notes
Review CompleteNo inline comments required. All previously reported issues have been resolved. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-04-19 15:23 UTC |
…rror log before FSM IDLE transition
…sed metric; add nil-guards to subtreeHandler validateParentChain no longer filters — it hard-fails with FSM IDLE. The old setting and its prometheus counter were left behind as dead code. subtreeHandler.go lacked nil-guards on blockchainClient and FSM state that other services already had, risking a panic in edge cases.
f6d9449 to
2fbd137
Compare
…add 3 Case C tests
…x reviewer issues
… repair can run validateParentChain errors were propagating as fatal, killing all services including blockchain gRPC — making repair-conflicts unreachable. Now the error is caught in Start(), FSM stays IDLE, and the node stays up. Also switch idleAndError from SendFSMEvent(STOP) to Idle() which handles already-IDLE state gracefully instead of logging a spurious error.
…rogress - Replace brittle string matching with typed ErrRepairNeeded error (ERR_REPAIR_NEEDED=102 in proto) for validateParentChain → Start() flow - Fix "teranodecli" → "teranode-cli" typo across all services - Add progress logging to RepairConflictingChains so it's not silent for minutes during large UTXO store repairs
… repair progress Idle() sent gRPC but never updated the local fmsState cache, leaving it stale at RUNNING. GetMiningCandidate then passed the FSM guard and returned empty block templates while the node needed repair. Also switch repair progress from batch counts to record counts via TotalScanned() for meaningful output on large UTXO stores.
Blocking STOP from CATCHINGBLOCKS traps the node in a crash loop when a data-integrity check fails during catchup: BlockAssembler's validateParentChain calls Idle() to move to IDLE for repair, the FSM rejects the event, BA.Start returns StorageError, ServiceManager stops BA, the node exits, and on restart the FSM is still persisted as CATCHINGBLOCKS — so the same thing happens again. The operator has no window to run teranode-cli repair-conflicts. Adds CATCHINGBLOCKS to STOP's Src list and widens the SendFSMEvent guard to permit STOP alongside RUN. RUN is still the normal exit when catchup completes; STOP is a safety valve for repair. Updates the state-machine diagram and fsm_test. The guard's original intent — preventing accidental transitions that would abandon catchup to RUNNING/LEGACYSYNCING — is preserved: those events remain rejected from CATCHINGBLOCKS.
After unmarking an orphan-conflicting parent P in step 4, any grandparent that (a) is Conflicting=true, (b) either names P as the recorded spender per its own SpendingData or has no SpendingData for the relevant vout at all (common when the grandparent was already conflicting when P's Spend ran, so the write was skipped), and (c) has no *active* conflicting children (counting only entries still Conflicting=true) is itself an orphan. Enqueue grandparents into the step-4 worklist so chains of stacked orphan-conflicting ancestors are resolved in a single repair run. Also broadens step-1 detection: when the scanned child's parent has SpendingData==nil for the relevant vout but is flagged conflicting with no active conflicting children, treat the parent as a Case D candidate. Without this, the child's input walk would stop at the nil SpendingData and the parent would only be reachable via chase-up started from yet another candidate. Adds hasActiveConflictingChildren helper to consistently ignore stale back-references in ConflictingChildren left over from unmarking. Observed on mainnet-eu-1: tx 8dacf3...f464 got unmarked on the first repair pass, but its grandparent 217494d8...17ef stayed conflicting with conflictingCs=[8dacf3] (stale) and spentUtxos=0 (no SpendingData ever recorded), so validateParentChain kept tripping on the next restart.
Mainnet-eu-1 run missed a parent (4557bdc6) that aerospikereader shows
matches Case D criteria exactly (Conflicting=true, SD for the relevant
vout is nil, ConflictingChildren=[5d12221c], and 5d12221c is currently
non-conflicting). Step 1 report said only 2 Case D candidates. No
report.Errors were raised for that parent, so the path that skipped it
is unclear.
Adds targeted logProgress lines inside step 1 that fire only when:
- s.Get(parent) returned an error or nil,
- vout is out of range of parent's SpendingDatas,
- parent.Conflicting is true (rare — bounded noise),
- hasActiveConflictingChildren bails due to a child Get error or a
still-conflicting child (with the child hash).
Next repair run will either show 4557bdc6 being added to the orphan
list (implying the real miss is elsewhere), or reveal the exact reason
detection skipped it. No behavior change; only logging.
… dedup A dirty UTXO store is never acceptable in Teranode, so any DB read or write error during repair must halt the run rather than be swallowed and reported as "non-fatal". Every `report.Errors = append(...); continue` path in RepairConflictingChains now returns the error and aborts — the only errors treated as benign are TX_NOT_FOUND responses for external references that were never stored (parent's grandparent, pruned ancestors), which are a legitimate outcome rather than a failure. The RepairReport.Errors slice is removed. Other correctness / safety changes in the same pass: - Case C dedup: the previous key (pair.loser) silently dropped distinct winners that happened to share a loser, leaving their Conflicting=true flag set. Dedup is now by winner — each distinct real-winner has ProcessConflicting called exactly once, using a fresh dedup map per call so dry-run no longer mutates shared state. - cascadeConflictingViaSpendingData: add cascadeMaxVisited cap so a corrupted or pathological SpendingData graph cannot grow the visited set without bound. SetConflicting is now issued once per frontier level instead of once per child, cutting N round trips to one per level. - hasActiveConflictingChildren returns (bool, error) and is invoked with a logReason callback so step-1 diagnostics can record why a parent wasn't enqueued as a Case D candidate. - Out-of-range vout is now a ProcessingError (genuine store corruption) rather than a silent skip. - progressFn is a single parameter, not variadic — the "optional slice that only reads [0]" shape was a footgun. - Step log tags are consistent at /4 throughout. CLI: drop the "non-fatal errors" section, nothing to iterate anymore — the single abort error is surfaced via the wrapper ProcessingError. Tests updated for the new signature and removed field; 13 repair tests still pass and cover Case A, Case C, Case D with dry-run, cascades, chained orphans and legit-conflict safety checks.
…very on FSM leave IDLE When loadUnminedTransactions returns ErrRepairNeeded the assembler used to return nil from Start() but skipped subtreeProcessor.Start and startChannelListeners entirely — leaving the gRPC server accepting calls against a half-initialised assembler. The FSM IDLE guards in upstream services are best-effort and miss some paths, so a call could still reach AddTx / GetMiningCandidate / SubmitMiningSolution and either hang on an unreferenced channel or touch an uninitialised processor. A new atomic frozenForRepair flag is set in that path and exposed via FrozenForRepair(). The gRPC methods most likely to be invoked (AddTx, AddTxBatch, AddTxBatchColumnar, RemoveTx, GetMiningCandidate, SubmitMiningSolution, GetCandidateBlock, ResetBlockAssembly) now call ba.assertNotFrozenForRepair() up front and return ErrRepairNeeded — defence-in-depth alongside the FSM IDLE checks in upstream services. Recovery is live rather than requiring a restart, to match the pause/resume semantics of blockvalidation and subtreevalidation: - Start() spawns watchForRepairCompletion as a wg-tracked goroutine. - The watcher blocks on WaitUntilFSMTransitionFromIdleState and retries loadUnminedTransactions once the operator moves the FSM out of IDLE (after running teranode-cli repair-conflicts). - On success it runs startAfterLoadUnmined (subtreeProcessor.Start, startChannelListeners, height metric) and clears the frozen flag — gated gRPC methods start accepting traffic without a node restart. - If loadUnminedTransactions returns ErrRepairNeeded again, idleAndError has already put the FSM back to IDLE; the watcher loops and waits for the next transition. - Any other error stops the watcher and keeps the assembler frozen — non-repair failures are outside this recovery path's remit. - Cleanly exits on context.Canceled during shutdown. Unrelated cleanup in the same area: subtreeprocessor reset's clear-processed-at errgroup now has SetLimit(16) so a reset spanning hundreds of moveBack blocks doesn't launch hundreds of concurrent SetBlockProcessedAt writes against the blockchain store.
…routine Every handler that checked for FSM IDLE used to log-and-fall-through on a check failure, spawn a fresh resume goroutine per invocation, and (in blockvalidation's case) wait on context.Background() so the goroutine couldn't exit on service shutdown. Under load that's hundreds of routines racing to ResumeAll, and a transient FSM-check error silently bypasses the guard entirely. Changes applied to blockvalidation/Server.go, subtreevalidation/ subtreeHandler.go, subtreevalidation/txmetaHandler.go, legacy/netsync/ handle_block.go and propagation/Server.go: - FSM-check errors now return an error rather than logging and continuing. Fail closed: if we can't confirm the FSM is not IDLE, don't admit the block / subtree / tx while the node may be in repair. - A new idleConsumerPaused atomic.Bool (on blockvalidation.Server and subtreevalidation.Server) guards the pause/resume transition. Only the first IDLE-observed call PauseAll's the consumers and spawns a single watcher; concurrent handler invocations short-circuit via CompareAndSwap. The watcher defers Store(false) on completion so the next IDLE episode re-arms cleanly. - blockvalidation's resume goroutine now uses the service context plumbed through from consumerMessageHandler instead of context.Background(), so it exits on shutdown. - blockHandler returns ErrServiceError when IDLE instead of nil so the Kafka offset is not advanced and the in-hand message is retried after the FSM leaves IDLE — matching what the log claims. - txmetaHandler operator hint updated to reference the repair CLI, matching the other handlers. pruner's triggerInitialPruning hash-lookup comment rewritten to be accurate about reorg semantics: GetBlockHeadersByHeight returns the current main-chain hash at the persisted height, which may differ from the hash that was actually persisted on an older fork; pruning is by height so this only affects the log line, not the work performed.
…se D
Mainnet-eu-1 run surfaced a mutual-blocker pathology where a parent's
ConflictingChildren list names a child that is itself orphan-conflicting
(Conflicting=true with no credible reason — grandparent SpendingData
does not show a legit loss for the child either). The old
hasActiveConflictingChildren check saw Conflicting=true on the child
and classified the parent as having active conflicts, so the parent
was never added as a Case D candidate. The orphan child is invisible
to the unmined iterator (conflicting filter), so it is never reached
either. The pair stays stuck forever across repair runs.
Replace hasActiveConflictingChildren with classifyConflictingChildren,
which recursively checks each still-conflicting entry in the list:
- stale back-reference (child.Conflicting=false now) → ignore
- child.Conflicting=true AND some grandparent.SpendingData names a
different spender for one of child's inputs → legit loser, parent
is not a Case D candidate
- child.Conflicting=true AND every reachable grandparent either
names the child itself or has nil SpendingData → orphan, return
alongside bool hasLegit=false
Step 1 and step-4 chase-up enqueue any orphan children they find so
they are unmarked in the same pass as the parent they were blocking.
Legit-loser detection is unchanged.
Adds TestRepairConflictingChains_CaseD_OrphanBlocksParentDetection
covering the exact shape from mainnet: parentX (orphan) with
conflictingCs=[blocker] where blocker is itself orphan, plus a
non-conflicting goodChild spending a different output. One repair
pass unmarks both and leaves goodChild untouched.
SetConflicting(false) clears the child's Conflicting flag but leaves the back-reference in every parent's ConflictingChildren list — the SQL updateParentConflictingChildren helper only ever INSERTs. If such a stale sibling is also on the best chain, the Case C scan would enqueue it as the "real winner", ProcessConflicting would reject it with "tx is not conflicting", and — now that DB errors are fatal — the whole repair would abort before Case D even starts. Filter stale entries by checking sibling.Conflicting=true alongside the best-chain check. A BlockIDs-only Get is not enough. Observed on mainnet-eu-1: step 1 correctly identified 2716 Case D orphans (after the previous orphan-blocker fix lit up detection) but step 2 aborted on tx 1e541f1… which is on the best chain but had been unmarked in an earlier repair run. Adds TestRepairConflictingChains_CaseC_StaleSiblingSkipped.
… empty A legit-losing parent whose outputs were never recorded as spent (spentUtxos=0, all SpendingDatas nil) is a real shape on mainnet: the parent was already Conflicting=true at the time its children ran Spend, and some code paths skip the SpendingData write for conflicting parents. cascadeConflictingViaSpendingData(parent) then walks an empty SD list and marks zero descendants — the real non-conflicting children (whose inputs do name the parent) stay visible to the unmined iterator and validateParentChain keeps tripping across restarts. Step 1 now records the direct children that spend each orphan-candidate parent in caseDDirectChildren. When step 4 classifies a parent as a legit loser, the cascade seeds from those tracked children in addition to whatever parent.SD turns up. The children's own SpendingDatas are properly populated, so the subsequent walk propagates correctly. Observed on mainnet-eu-1: parent 4557bdc6 is a legit loser of rootTx grandparent (rootTx.SD[0] names a10bd058, not 4557bdc6), repair correctly identified the legit-conflict path and called the cascade, but 4557bdc6.SD is all nil so child 5d12221c — which spends 4557bdc6[1] — never got its Conflicting=true mark. Adds TestRepairConflictingChains_CaseD_LegitCascadeWithNilParentSD covering the exact shape (parentLoser with nil SD, childOfLoser whose input still names the parent).
…tep 1
Mainnet run showed step 1 stalling for hours. One parent accumulated
341 entries in its ConflictingChildren list, and ~14k non-conflicting
unmined txs had inputs pointing at it. Without caching each visit
refetched the parent (external tx = file store hit) and re-ran
classifyConflictingChildren over all 341 entries, each of which does a
child Get + a Get per input → grandparent SD check. Tens of millions
of Gets, many external.
Step 1 does no writes, so the parent's metadata and the classification
result are stable for the duration of the scan. Add two scoped caches:
- parentMetaCache (hash → *meta.Data, plus parentMetaNotFound for
negative caching) behind fetchParent, so each distinct parent is
fetched once regardless of how many children reference it.
- classificationCache (parent hash → {orphans, hasLegit}) behind
classifyCached, reusing the expensive recursion over the
ConflictingChildren list across every child visit of the same
parent.
Also drops the per-input debug log lines that flooded output when a
parent's ConflictingChildren was large (hundreds of hashes per line,
>100KB per visit). logProgress callback is still threaded into
classifyConflictingChildren for diagnostics from within the helper.
Both caches are local to step 1 — step 4 writes invalidate them, so
step 4 continues to call classifyConflictingChildren directly without
the cache.
Step 0 is a full-store consistency scan (hundreds of millions of records on a production node) and is almost always a no-op once it has run cleanly once. Iterating on Case A / C / D fixes currently pays that cost on every run, which has turned into hours per attempt. Add RepairOptions.SkipUnminedSinceScan and a --skip-unmined-since-scan CLI flag on teranode-cli repair-conflicts. When set, step 0 is announced as skipped and the run jumps straight to scanning unmined transactions. Best-chain header data (needed by Case A / Case C) is fetched unconditionally up front, outside the skip gate. Defaults unchanged — a fresh run still does the full scan.
Mainnet run stalled in step 4: each distinct parent triggered
classifyConflictingChildren which did a Get+Tx on every entry in the
parent's ConflictingChildren list, plus Gets for each grandparent for
every input of every child. Two "blocker" children were shared across
~2700 parents — each parent re-classified them from scratch — and the
blockers are external txs with 2001 utxos each, so every Get hit the
file store.
Split classifyConflictingChildren into:
- classifyChild (new): per-child classification returning
{exists, conflicting, legit}. Stable for the lifetime of a single
repair run so long as SetConflicting(h, false) is not later called
on the same h — and if it is, treating h as still-conflicting in
other parents' lists just produces an acceptable stale back-ref,
not an incorrect Case D decision.
- classifyConflictingChildren: now takes an optional childCache map
and memoizes per-child results across calls.
A single childClassCache is shared by step 1 (via classifyCached) and
step 4 (both the fresh-parent legit check and the chase-up grandparent
check). Each distinct child is fetched + grandparent-walked once per
repair run regardless of how many parents reference it.
Also add appendCaseDOrphan dedup at step-1 append time — the raw slice
grew to ~40k entries on mainnet while the unique orphan count was a
few hundred, because the same parents and blocker children are pushed
for every non-conflicting child that visits them. Step 4's seenCaseD
still dedups but traversing the bloated slice was wasted time.
Step 1 only logged progress after each iterator batch and only when the aggregate scanned count had moved by 10,000. On mainnet the iterator delivers ~14k unmined txs in a small number of batches, and the first encounter with a big conflicting parent can stall a single tx for minutes while classifyChild populates the cache from external storage — so the whole scan went silent for over an hour with no output at all. Add a time-based gate (30s) and an intra-batch trigger (every 500 txs within a batch). maybeLogProgress fires whichever way the threshold is crossed first. The log line is unchanged; just called more reliably. No behavior change to the classification logic itself.
The careful Case D classification is taking hours on mainnet — each first-contact with a big conflicting parent runs hundreds of sequential external-store Gets to populate the child-class cache, and the main goroutine is blocked on futex for the duration. The most recent run unmarked 383 orphans + cascaded 4 but still left tx 4557bdc6 pointed at by 5d12221c (the original offender) stuck, meaning either the direct-children seeding or the cascade path has a subtle bug we haven't tracked down. Unmined txs are ephemeral: valid txs propagate back in minutes and the next block sweeps them up anyway. A coarse "mark every non- conflicting unmined child of a Conflicting=true+UnminedSince>0 parent Conflicting=true" pass does what BlockAssembler actually needs (descendants of a conflicting ancestor must not be in the iterator) without any classification. Valid children that happen to reference a wrongly-flagged parent get pruned at delete_at_height and re-enter via propagation. Add RepairOptions.AggressiveCascade and a CLI flag --aggressive-cascade. When set, step 1 collects candidates into aggressiveCascadeChildren and writes them all Conflicting=true in one SetConflicting batch before the Case C sweep. Step 4 is skipped entirely. Case A and Case C detection run as normal — they're cheap and strictly correct. Also add a heartbeat ticker that prints the current phase every 15s via atomic.Pointer, so a repair stuck in one deep Get still reports liveness. Replaces the per-500-tx progress check that could go silent for tens of minutes when a single tx stalled on external fetches. Default behavior unchanged.
…ting-unmined
Pure rename commit, no logic change. Sets up the following commit which
replaces the classification machinery with a surgical purge of records
where Conflicting=true and UnminedSince>0.
- stores/utxo/repair_conflicts.go → purge_conflicting_unmined.go
- stores/utxo/tests/repair_conflicts_test.go → purge_conflicting_unmined_test.go
- cmd/repairconflicts/ → cmd/purgeconflictingunmined/
- RepairConflictingChains → PurgeConflictingUnmined
- RepairReport → PurgeReport
- RepairOptions → PurgeOptions
- RepairProgressFunc → PurgeProgressFunc
- cmd wrapper RepairConflicts → PurgeConflictingUnmined
The "repair-conflicts" CLI subcommand keeps its name here; a later commit
renames it to "purge-conflicting-unmined" along with the operator-facing
log strings.
errors.ErrRepairNeeded / NewRepairNeededError are intentionally retained
— the error semantic ("operator intervention required") is unchanged and
renaming would ripple through ~10 test files for no functional gain.
… purge Replaces the Case A/C/D classification machinery with a single-pass delete of every (Conflicting=true, UnminedSince>0) record. The unmined set is ephemeral by design, so propagation and the next block are enough to restore any valid tx the purge removes; there is no need to reverse-engineer correct state from a graph whose writers never fully clean up after themselves. stores/utxo/purge_conflicting_unmined.go - Single scan over ScanInconsistentUnminedTxs combines step 0 (unmined_since fixup for mined txs still carrying the marker) and step 1 (collect conflicting-unmined hashes). - Step 2 batches Delete(ctx, hash) over the collected set. - PurgeReport fields: UnminedSinceFixed, ConflictingUnminedPurged. - PurgeOptions fields: SkipUnminedSinceScan (AggressiveCascade removed, moot). - Drops ~850 lines of classification helpers (classifyChild/classifyConflictingChildren/cascadeConflictingViaSpendingData and their caches). stores/utxo/UnminedTxIterator.go + aerospike/consistency_scan.go - InconsistentTxRecord gains a Conflicting bool so the single scan can seed both step 0 and step 1. - Aerospike scan fetches the conflicting bin and extracts it in parseConsistencyRecord. stores/utxo/sql/unmined_iterator.go - ScanInconsistentUnminedTxs is no longer a no-op on SQL; it now iterates every record with unmined_since IS NOT NULL and returns hash, block_ids, unmined_since, conflicting. Required so SQLite-backed tests exercise the purge logic through the same code path. services/blockassembly/BlockAssembler.go - validateParentChain now skips parents that are not in the UTXO store instead of parking FSM in IDLE. This is the load-bearing change that makes the surgical purge viable: non-conflicting children whose parents get deleted remain harmlessly in the iterator and get mined or pruned. cmd/purgeconflictingunmined/purge_conflicting_unmined.go + cmd/teranodecli/teranodecli/cli.go - Drop --aggressive-cascade flag and rewrite the report output to the two remaining counters. The "repair-conflicts" CLI subcommand name is kept here; a later commit renames it to "purge-conflicting-unmined" alongside the operator-facing log strings. Tests: the repair-era Case A/C/D tests are replaced with a single clean-state smoke test in this commit; the full purge test suite lands in the next commit.
Replaces the smoke test from the previous commit with comprehensive coverage of PurgeConflictingUnmined behavior against a real SQLite-backed utxo.Store: - CleanState: empty store yields a zeroed report. - DeletesConflictingUnmined: (Conflicting=true, UnminedSince>0) record is removed from the store. - LeavesNonConflictingUnminedAlone: purging a parent does not touch its non-conflicting unmined child; BA's validateParentChain change handles the dangling ref. - LeavesMinedTxAlone: records with Conflicting=true but UnminedSince=0 are protected by the UnminedSince>0 filter. - DryRun: candidates are counted but not deleted. - SkipUnminedSinceScan: step 0 is skipped while step 1 still runs. - UnminedSinceFix: step 0 re-marks mined-on-best-chain records still carrying UnminedSince. - Idempotent: second run finds nothing to delete. - DeleteForwardsThroughStore: a store wrapper sees Delete called once per purged hash — this is the hook a live node's TxMetaCache uses to evict stale entries. setupSQLiteFileStore now pre-creates the shared parent Tx as mined so SetConflicting on child transactions can resolve the conflicting_children FK.
- Rename the CLI subcommand from repair-conflicts to purge-conflicting-unmined
and update its help text to describe the new behavior (delete, not repair).
- Update all 11 operator-facing log/error strings across blockassembly,
blockvalidation, subtreevalidation, propagation, and legacy/netsync so the
suggested fix points at the new command.
- Update docs/references/settings/services/blockassembly_settings.md.
- Add docs/howto/recovery-from-idle.md with a short runbook entry describing
the IDLE → purge → FSM RUNNING recovery flow and why non-conflicting
children are intentionally left alone.
- Update the validateParentChain test comment; the test itself asserts on
errors.ErrRepairNeeded (unchanged) so no behavior change is needed.
errors.ErrRepairNeeded and NewRepairNeededError keep their names — the error
semantic ("operator intervention required") is unchanged, only the fix
command name is renamed, and ripple-renaming would touch ~10 test files for
no functional gain.
The command is about to grow beyond deleting conflicting-unmined records — it will also remove non-conflicting unmined transactions whose parents are mined on an orphaned fork (surfaced on mainnet-eu-1 after the first run of purge-conflicting-unmined unfroze BA but left stale orphan-mined parent references tripping validateParentChain). Rename now to keep the next commit's diff focused on the new logic. - stores/utxo/purge_conflicting_unmined.go → cleanup_unmined.go - stores/utxo/tests/purge_conflicting_unmined_test.go → cleanup_unmined_test.go - cmd/purgeconflictingunmined/ → cmd/cleanupunmined/ - PurgeConflictingUnmined → CleanupUnmined - PurgeReport → CleanupReport - PurgeOptions → CleanupOptions - PurgeProgressFunc → CleanupProgressFunc - CLI subcommand: purge-conflicting-unmined → cleanup-unmined - All 11 operator log/error strings updated. - docs/howto/recovery-from-idle.md + blockassembly_settings.md updated. No logic change in this commit.
The first mainnet-eu-1 run of the prior purge-conflicting-unmined command unfroze Block Assembly long enough to reveal a second inconsistency class the tool had not touched: non-conflicting unmined transactions whose parent is mined on a block that is no longer on the best chain (an orphaned fork). Example: parent 1aebda16... was mined in block id 945137 at height 945052 but that block is off the current best chain, so BA's validateParentChain trips with "parent is on wrong chain" on every load of the unmined set. Step 3 now iterates GetUnminedTxIterator (non-conflicting unmined) in batches, BatchDecorate-fetches parent BlockIDs + UnminedSince + Conflicting, and deletes children whose parent is: - Conflicting=true (the child is dangling) - UnminedSince=0 with empty BlockIDs (inconsistency) - UnminedSince=0 with BlockIDs all off the best chain (orphan-mined) Missing parents are tolerated (BA's validateParentChain skips those since the purge rewrite). Parents with UnminedSince>0 remain visible to BA's iterator and are therefore valid — no child delete. CleanupReport: - Renames ConflictingUnminedPurged → ConflictingUnminedDeleted. - Adds OrphanParentUnminedDeleted counter. Tests (3 new, alongside updated helpers): - DeletesOrphanParentChildren: parent mined off-best-chain, non-conflicting unmined child is deleted. - LeavesMainChainParentChildrenAlone: parent on best chain, child untouched. - OrphanParentDryRun: dry run counts but does not delete. Existing tests updated: newQuerier() now publishes the shared parent Tx's block id as on the best chain so step 3 does not accidentally flag newTestTx-derived children in scenarios that are not exercising orphan-mined behavior. Runbook (docs/howto/recovery-from-idle.md) updated to describe step 3 and to note that unmined subtree blobs are left to the pruner/TTL (content- addressed, unique by hash, stale blob costs only disk).
Clarify in the IDLE-recovery runbook that cleanup-unmined's step 3 deletions are safe even if a peer has the deleted tx in a blessed subtree. If a later block arrives referencing such a subtree, block validation does not hard- fail: SubtreeValidation.processMissingTransactions refetches the tx bytes from the peer and reconstructs the UTXO metadata as part of normal validation. BatchDecorate TX_NOT_FOUND is treated as a miss counter, not a fatal error. Cite the relevant source locations so future readers can verify the safety argument without having to re-trace the path.
… anchor Mainnet-eu-1 run found 0 orphan-parent children even though Block Assembly then tripped validateParentChain with "parent is on wrong chain (blocks: [945137])". Root cause: cleanup classified against the blockchain service's best chain (via GetBestBlockHeader), but BA uses its own persisted CurrentBlock from the blockchain DB state table on startup. After a reorg the two views diverge; cleanup saw 945137 as on main while BA did not. Adapter's GetBestBlockHeaderInfo now calls blockassembly.Client. GetBlockAssemblyState and uses CurrentHash / CurrentHeight as the walk anchor for GetBlockHeaderIDs. BA is now a required dependency of cleanup — its gRPC stays reachable in IDLE (only write entry points are gated by frozenForRepair). If BA is unreachable or returns an invalid hash we fail loudly rather than silently cleaning against a drifted chain view.
|
|
Even after all the proper cleanup, we still had incorrect conflicting marks in subtrees from already received blocks. Because of the previous bugs, there isn't an easy recovery path for Teranodes in this state and resetting/reseeding is preferred |


Summary
Fixes BlockAssembler startup failures caused by unmined transactions in a locally-inconsistent state (
Conflicting=true+UnminedSince>0records, plus non-conflicting children referencing them). The iterator filtersConflicting=true, so the parent is absent from the processing list andvalidateParentChainrejects the child, parking the FSM in IDLE.The branch started out as a repair tool that tried to reconstruct intended state via classification (Case A / C / D). Every iteration on mainnet uncovered a new graph shape and either added hours of runtime or still left the offending tx stuck — because the writers (SetConflicting, reorg handlers) don't clean up after themselves (stale
conflictingCs, nilSpendingData, etc).Replaced with a surgical purge: the unmined set is ephemeral by design (propagation re-arrives valid txs; next block sweeps them up), so a single-pass delete of every
(Conflicting=true, UnminedSince>0)record is all BA needs to start cleanly.validateParentChainwas relaxed to tolerate missing parents — non-conflicting children whose parents just got purged are harmlessly skipped, and get mined or pruned on their own.New:
teranode-cli purge-conflicting-unminedOnline CLI, run while the node is up and the FSM has parked in IDLE:
UnminedSince. Reused from the repair era.--skip-unmined-since-scanskips it on re-runs once it has completed cleanly.Conflicting=true + UnminedSince>0.Deleteover the collected set. Aerospike record goes; external.tx/.outputsblobs are cleaned by the existing pruner ondelete_at_height.--dry-runcounts without writing.Dropped ~850 lines of classification / cascade / chase-up / cache machinery.
FSM IDLE enforcement (from earlier work on this branch)
validateParentChainsets the FSM to IDLE when it detects integrity problems. Four service hot paths early-return when FSM is IDLE so no new work reaches a half-initialised block assembler:blockHandler(Kafka consumer)CheckSubtreeFromBlockprocessTransactionHandleBlockDirectThe blockchain FSM now accepts
STOPfromCATCHINGBLOCKSso a repair-needed error detected mid-catchup can actually park the node in IDLE (previously the transition was rejected and the node crash-looped).Block assembly freezes its gRPC entry points via an atomic
frozenForRepairflag and spawns a watcher goroutine that retriesloadUnminedTransactionsthe next time the FSM leaves IDLE — so after the purge completes the operator flips the FSM out of IDLE and BA resumes live without a node restart.Operator flow
validateParentChain→ trips onparent is unmined but not in processing list→ FSM → IDLE. Log:Run 'teranode-cli purge-conflicting-unmined' to fix.teranode-cli purge-conflicting-unmined --skip-unmined-since-scan(first run without the skip flag, subsequent iterations with).teranode-cli setfsmstate --fsmstate RUNNING. BA's watcher retries, parent is gone, child is harmlessly null-skipped, BA unfreezes.Test plan
Purge suite (
stores/utxo/tests/purge_conflicting_unmined_test.go):TestPurgeConflictingUnmined_CleanState— empty store yields zeroed reportTestPurgeConflictingUnmined_DeletesConflictingUnmined—(Conflicting=true, UnminedSince>0)record deletedTestPurgeConflictingUnmined_LeavesNonConflictingUnminedAlone— child with dangling parent ref untouchedTestPurgeConflictingUnmined_LeavesMinedTxAlone—UnminedSince=0records protectedTestPurgeConflictingUnmined_DryRun— candidates counted, no writesTestPurgeConflictingUnmined_SkipUnminedSinceScan— step 0 skipped, steps 1+2 still runTestPurgeConflictingUnmined_UnminedSinceFix— step 0 clears stray UnminedSince on mined-on-best-chainTestPurgeConflictingUnmined_Idempotent— second run finds nothingTestPurgeConflictingUnmined_DeleteForwardsThroughStore— store wrapper sees Delete per purged hash (TxMetaCache eviction hook)Infrastructure:
TestValidateParentChain_*— parent-missing case now returns nil-tolerant (post-purge expected shape), all other integrity checks still tripidleAndErrorTest_NewFiniteStateMachine— STOP from CATCHINGBLOCKS allowedCommits in the purge pivot
b510ebc00— rename files + exports (repair_conflicts.go→purge_conflicting_unmined.go,RepairConflictingChains→PurgeConflictingUnmined, etc.). Pure rename, no logic change.eea0468b2— replace Case A/C/D classification with surgical purge. ExtendsInconsistentTxRecordwithConflictingbool so one scan seeds both step 0 and step 1. SQLScanInconsistentUnminedTxsnow implemented (was a no-op).validateParentChainskips missing parents.701287ebf— 9-test purge suite.3aebb81e6— CLI subcommand rename, 11 operator-facing strings, settings doc, newdocs/howto/recovery-from-idle.mdrunbook.Earlier commits in the branch (repair era) are kept rather than rebased away — full history is useful post-mortem on why the classification approach was abandoned.
Not addressed / out of scope
.tx/.outputsblob cleanup — handled by existing pruner ondelete_at_height.ConflictingChildrenback-refs on purged parents — unread by any consumer after purge; no-op.ErrRepairNeedederror type kept as-is (neutral semantics, ripple cost too high for marginal clarity).