fix(utxo/aerospike): clear locked flag on pagination records when mining (#1037)#1039
Conversation
…ing (bsv-blockchain#1037) SetMinedMulti — the "mined => spendable" safety net — only cleared the `locked` flag on the master UTXO record. For external/paginated transactions (>utxoBatchSize outputs) the pagination/extra records kept locked=true forever. A tx is created WithLocked(true) on EVERY record by the block-validation quick-validate pipeline and the validator 2PC. Those paths unlock via SetLocked(false), which does recurse into child records — but only on success. quick_validate documents that on error the unlock is never called and recovery comes from SetMinedMulti on retry. Because SetMinedMulti was keyed on the master record only, the pagination records were never unlocked, so a child spending an output that lives on a pagination record (vout >= utxoBatchSize) failed permanently with TX_LOCKED, wedging legacy IBD in a reset loop. Fix: SetMinedMulti now clears locked on all of a mined tx's records. - teranode.lua setMined surfaces totalExtraRecs as childCount on every mine so the client knows how many pagination records to unlock (lua version bumped to v60 so the UDF re-registers). - New clearLockedOnRecordsMulti helper: unfiltered UPDATE_ONLY Locked=false batch over the pagination records, tolerant of KEY_NOT_FOUND. The lock-clear I/O is performed by the public SetMinedMulti / SetMinedMultiWithExpressions methods so the result processors stay free of follow-up writes and remain unit-testable. - Expression path also clears the master lock on FILTERED_OUT records (blockID already present), where the blockID filter would otherwise skip the Locked=false op entirely. Regression test (set_mined_locked_test.go) covers both the Lua UDF and filter-expression paths plus the FILTERED_OUT case; it fails without this change.
|
🤖 Claude Code Review Status: Complete SummaryThis PR correctly fixes issue #1037 where legacy IBD wedged forever at block 604,315 with FindingsNo issues found. The implementation demonstrates careful engineering:
The fix properly maintains the invariant that mining a transaction makes ALL outputs spendable by clearing locks on master + pagination records. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-04 21:06 UTC |
…re-mine (bsv-blockchain#1037) Follow-up to the pagination-lock fix. Two gaps closed: 1. Expression path, FILTERED_OUT (blockID already present): the batch write — and all its reads — are skipped, so neither the child count nor the master's lock state are known. The previous attempt only cleared the master, leaving pagination records locked, so an already-orphaned paginated tx never healed on re-mine. Now such records go through SetLocked(false), which reads the child count from the master and clears every record (master + all pagination records). 2. The lock-clearing follow-up is now returned from the result processors as a lockClearWork value and executed by the public SetMinedMulti / SetMinedMultiWithExpressions methods via applyLockClearWork, so the processors do no follow-up I/O and stay unit-testable. Adds TestSetMinedMultiHealsAlreadyMinedLockedPaginationRecord, which reproduces the wedged state (master mined, pagination record still locked) and re-mines with an already-present blockID; it failed on the expression path before this change.
…in#1037 pagination-lock fix Findings from a mixture-of-experts review of PR bsv-blockchain#1039. P1 — close the unit-coverage gap on the load-bearing FILTERED_OUT unlock: - TestProcessBatchResultsForSetMinedExpressions_FilteredOutSynthesizesMapEntry now binds the lockClearWork return and asserts work.fullUnlock contains the hash (and work.items is empty). Previously it discarded the return, so deleting the work.fullUnlock append in set_mined_expressions.go left the test green; only the container integration test (skipped without Docker) guarded it. P2 / nits: - teranode.lua: document the childCount == totalExtraRecs invariant behind the unconditional FIELD_CHILD_COUNT overwrite on a mine (comment only, no version bump). - set_mined.go applyLockClearWork: assign the first non-nil error directly instead of errors.Join(nil, err), which would drop the rich *errors.Error type. - set_mined.go clearLockedOnRecordsMulti: guard the child-index loop against a non-positive childCount (uint32 wrap), and document the reliance on per-record (not top-level) KEY_NOT_FOUND reporting. - set_mined_expressions.go: collapse the nested `if nrErrors > 0 { if errs != nil }` to the single guard the Lua path uses (`errs != nil || nrErrors > 0`). New integration test TestSetMinedMultiUnlocksDespiteSiblingError (lua + expression): a paginated tx that mines successfully alongside a sibling hash that errors must still have its pagination records unlocked — guards the "runs even when err != nil" contract that prevents a single bad tx from stranding a healthy parent's lock.
|
ordishs
left a comment
There was a problem hiding this comment.
Approved.
Verified the fix against the actual record semantics rather than the description:
- Expression path writes
Locked=falseon the master (set_mined_expressions.go:276) and reads backTotalExtraRecs(:266). - Lua
childCountinvariant holds:setDeleteAtHeightreturns childCount ==BIN_TOTAL_EXTRA_RECSfor external records, and the new unconditional overwrite sits outside the signal block, so DAH follow-up work is unaffected. fullUnlockviaSetLocked(false)recurses into pagination records (locked.go:177-204).GetAerospikeBatchWritePolicyreturns a fresh policy, so mutatingRecordExistsAction = UPDATE_ONLYis local (no shared-state race).clearLockedOnRecordsMultimirrors the acceptedSetDAHForChildRecordsMultipattern.
Strengths: thorough root-cause evidence, testable processor/IO separation, and TestSetMinedMultiUnlocksDespiteSiblingError pins the load-bearing 'run lock-clear even on sibling error' behavior.
Non-blocking notes:
- Write amplification: every mine of an external tx now issues an extra batch write to pagination records unconditionally (heavier on the expression FILTERED_OUT path = full SetLocked per tx). Inherent to the approach; worth monitoring SetMinedBatch metrics + Aerospike write load post-deploy.
- Ensure CI actually runs the aerospike-tagged tests — all new behavioral tests skip under -short, so they're the only end-to-end protection for this fix.
- Carry the 'already-wedged tip needs reset+resync or invalidateblock/reconsiderblock on 604,314' caveat into the deploy runbook.
freemans13
left a comment
There was a problem hiding this comment.
Approved.
Correct, well-reasoned fix for a real production wedge (#1037). Root-cause analysis is thorough and the fix matches it: SetMinedMulti now clears locked on all of a mined tx's records (master + pagination), restoring the documented mined-⇒-spendable invariant. Clean separation — result processors stay write-free and return []lockClearItem, the public methods perform the lock-clear I/O. Lua v59→v60 bump is present. Build passes locally.
Verified the one open question (expression-path FILTERED_OUT only clears the master, not pagination) is not a forward-operation bug: under normal operation the first mine always passes the filter and clears pagination, so the 'blockID present + pagination still locked' state is unreachable except from pre-fix data. The reported incident is on the default lua path, which self-heals correctly on re-mine. Residual gap is a bounded self-heal limitation for EE expression-path nodes carrying pre-fix wedged txs — acceptable to address as a follow-up.
Non-blocking follow-ups for later:
- Close the expression-path
FILTERED_OUTpagination gap (e.g. route filtered txs through the lua UDF for the heal), or document the lua-path remediation. - Add a non-integration unit test for
clearLockedOnRecordsMulticounting / KEY_NOT_FOUND tolerance (current new tests are Aerospike integration-only, skipped inmake test).



Summary
Fixes #1037 — fresh legacy IBD wedges forever at testnet block 604,315 with
TX_LOCKEDon two parent UTXOs that read as unlocked.Root cause:
SetMinedMulti(the "mined ⇒ spendable" safety net) only cleared thelockedflag on the master UTXO record. For external/paginated transactions (>utxoBatchSizeoutputs) the pagination/extra records keptlocked=trueforever, so a child spending an output that lives on a pagination record (vout ≥utxoBatchSize) failed permanently withTX_LOCKED.Causal chain (verified on the wedged node)
WithLocked(true)on every record (master + pagination) by the block-validation quick-validate pipeline (quick_validate.go:1147) and the validator 2PC.SetLocked(false), which recurses into pagination records) only runs on success.quick_validate.go:440-443documents that on error the unlock is never called; recovery is expected fromSetMinedMultion retry (block 604,314 = internal id 635313 fails atSetMinedMulti, then permanently at block-addBLOCK_EXISTS, so the unlock never runs).SetMinedMulti(luasetMinedUDFteranode.lua:647and the Go expression batchset_mined_expressions.go:276) is keyed onhash[:]= master only, so the documented recovery never touches the pagination records.Live
aerospikereaderon both stuck parents:lockedfalsetrueChildren spend
cdb5409c…vout 827 and1519b4cd…vout 575 — both>512, both on the locked pagination record. The master reads unlocked whileSPEND_BATCH_LUAreports the parent locked. Node runs the lua path (EnableSetMinedFilterExpressions=false).Fix
SetMinedMultinow clearslockedon all of a mined tx's records, restoring the documented invariant.teranode.luasetMinedsurfacestotalExtraRecsaschildCounton every mine so the client knows how many pagination records to unlock. Lua version bumpedv59→v60so the UDF re-registers on startup.clearLockedOnRecordsMultihelper (unfilteredUPDATE_ONLYLocked=false, tolerant ofKEY_NOT_FOUND) clears pagination records. The lock-clearing follow-up is returned from the result processors as alockClearWorkvalue and executed by the publicSetMinedMulti/SetMinedMultiWithExpressionsmethods (applyLockClearWork), so the processors do no follow-up I/O and stay unit-testable.Locked=falseand the reads): fully unlocked viaSetLocked(false), which reads the child count from the master and clears every record. Closes the filter-gating bug.Testing
go build ./...,go vet ./stores/utxo/aerospike/,gofmt -l: clean.set_mined_locked_test.go, lua + expression paths):TestSetMinedMultiClearsLockOnPaginationRecords— mine clears all records.TestSetMinedMultiWithExpressionsClearsLockWhenBlockIDFilteredOut— master unlocked on a FILTERED_OUT write.TestSetMinedMultiHealsAlreadyMinedLockedPaginationRecord— re-mine of an already-mined tx whose pagination record is still locked heals it (this failed on the expression path before the FILTERED_OUT fix).stores/utxo/aerospikepackage suite:ok.Deploy / remediation — important
This fix is triggered by
SetMinedMulti. It prevents the permanent lock and heals an orphan wheneverSetMinedMultire-runs on the affected tx (i.e. while its block is (re)processed during IBD).It does not passively heal an instance that is already wedged with the affected block as the accepted tip: on such a node only the failing block (604,315) loops, its parents' block (604,314) is never reprocessed, so
SetMinedMultiis never called on the parents and nothing clears the stale pagination lock. To clear an already-wedged instance, do one of:SetMinedMultipath heals the orphan as the block is applied — no permanent wedge.invalidateblock <604314 hash>thenreconsiderblock <604314 hash>, which re-runsSetMinedMultion the parents and clears the pagination locks.Requires the v60 bump (included) so the UDF re-registers on startup.
Server investigation was read-only (
docker ps/inspect/logs,asinfo,aerospikereader,getfsmstate, settings dump) — no writes, no FSM changes, no restarts.