fix(setMinedMulti): enforce coverage invariant across store, model, cache, and block validation#850
Conversation
…ache, and block validation
Make it impossible for a block to reach SetBlockMinedSet without every
non-coinbase transaction being durably tagged with the current block ID
in the UTXO txmeta store.
- aerospike UDF path: handleBatchRecordError no longer swallows
KEY_NOT_FOUND_ERROR for normal set-mined; only the UnsetMined path
tolerates a missing record.
- aerospike expression path: FILTERED_OUT now synthesizes {hash: [blockID]}
in the returned map (the filter's truth condition already proves the
blockID is durably present), so downstream coverage checks see it.
- both Aerospike paths: batch-coverage postcondition verifies every
submitted hash is in the returned map and each slice contains
minedBlockInfo.BlockID for non-UnsetMined calls.
- UpdateTxMinedStatus: defensive caller-level postcondition mirrors the
store invariant via verifyMinedBatchCoverage and existing
setMinedErrorCount accounting.
- TxMetaCache.SetMinedMulti: delegates to the underlying utxo.Store
first; cache update failure after store success returns an error to
force retry instead of silently masking missing durable state.
- BlockValidation.processBlockMinedNotSet: the "processed block mined
and set mined_set" log now only fires when setTxMinedStatus succeeded
(previously emitted unconditionally even after requeue).
- utxo.Store interface: SetMinedMulti docstring documents the new
postcondition contract for all implementations.
- MockUtxostore.SetMinedMulti accepts a function return so tests can
synthesize realistic blockIDsMaps from submitted hashes.
Tests cover store-level KEY_NOT_FOUND semantics, caller-level coverage
gaps (missing hash and missing blockID), UnsetMined tolerance, and
TxMetaCache delegation + error propagation.
|
🤖 Claude Code Review Status: Complete All previously reported issues have been resolved in this version. History:
Current Review:
The PR successfully closes the gap that could allow blocks to reach MinedSet=true without proper transaction tagging. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-18 09:55 UTC |
…odel, cache, and block validation
…odel, cache, and block validation
There was a problem hiding this comment.
Pull request overview
This PR tightens the “tx mined tagging” invariants so block validation fails closed unless every non-coinbase tx is durably marked as mined in the UTXO txmeta store for the current blockID, and aligns cache behavior/logging with that goal.
Changes:
- Tighten Aerospike SetMinedMulti error handling (KEY_NOT_FOUND is only tolerated for
UnsetMined) and improve expression-mode handling ofFILTERED_OUT. - Add a model-layer postcondition check to detect coverage gaps in
SetMinedMultiresults and fail block processing closed. - Fix
TxMetaCache.SetMinedMultito delegate to the underlying store and evict from cache; adjust block-validation logging on failure.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
stores/utxo/mock.go |
Extends mock to allow function-based return values for realistic blockIDsMap synthesis in tests. |
stores/utxo/Interface.go |
Documents a stronger SetMinedMulti postcondition/contract. |
stores/utxo/aerospike/set_mined.go |
Makes KEY_NOT_FOUND a hard error for normal set-mined (no-op only for unset-mined). |
stores/utxo/aerospike/set_mined_expressions.go |
Treats FILTERED_OUT as covered by synthesizing a map entry. |
stores/utxo/aerospike/set_mined_expressions_test.go |
Adds coverage for parsing an explicitly empty blockIDs bin. |
stores/utxo/aerospike/refactored_methods_test.go |
Adds regression coverage for KEY_NOT_FOUND handling behavior. |
stores/txmetacache/txmetacache.go |
Ensures SetMinedMulti delegates to store then evicts cache entries. |
stores/txmetacache/txmetacache_test.go |
Adds tests proving delegation/eviction and store-error short-circuiting. |
services/blockvalidation/BlockValidation.go |
Prevents success log emission when setTxMinedStatus fails and block is requeued. |
model/update-tx-mined.go |
Adds coverage-gap verification logic after SetMinedMulti. |
model/update-tx-mined_test.go |
Updates existing tests and adds new subtests for coverage-gap erroring and unset-mined tolerance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
oskarszoon
left a comment
There was a problem hiding this comment.
The 5-layer defense-in-depth shape is right and the fix correctly transforms a silent postcondition violation into a surfaced error. Store returns full blockIDs map → checkBatchResults walks coverage per submitted hash → cache pass-through (clean) → setTxMinedStatus surfaces as ProcessingError → setMinedChan retry loop. No peer-controllable DoS path; SQL and Lua backends produce equivalent semantics; cache is pure forwarding.
One operational blocker before merge:
Unbounded 1Hz retry loop on setMinedChan. BlockValidation.go:561-567 retries forever at 1-second intervals on any non-ErrBlockNotFound error. After this PR lands, a node carrying historical-corrupt block_ids data (records that lost a blockID via a prior Lua signal-handling crash, a pre-PR-4561-class bug, or an Aerospike eviction race) will hit the new coverage check on first restart and spin at ~86,400 ProcessingError log lines per day per affected block, with no way to drain the channel and no metric for operators to see "I'm stuck on this block" until log volume catches their eye. The fix is exactly the class of bug this PR is designed to surface, but the response shape is too aggressive for an unrecoverable error. Three required changes:
- Add a max-retry counter on
setMinedChanre-queues (e.g. 10 attempts with exponential backoff 1s→16s), then log WARN + drop with amanual_intervention_requiredmetric/log marker. - Add
block_validation_setmined_retry_total{blockhash}Prometheus counter so the stuck state is visible in dashboards. - Document the new failure mode in
docs/topics/services/blockValidation.mdso on-call understand "this loop is normal during corrupt-state recovery; here's how to escalate".
Pre-deploy operator coordination separately: worth a one-shot diagnostic scan across block_ids for recently-processed blocks before flipping this on for any environment with historical state. If drift exists, repair-then-deploy.
Non-gating follow-ups: differentiate setMinedErrorCount into typed buckets (storage / conversion / coverage) so the retry loop can apply policy; push the coverage invariant down into Aerospike Lua + SQL for atomic enforcement; add a fixture-driven table test for checkBatchResults exercising mixed-mode batch results.
Happy to re-approve once the bounded retry + Prometheus counter are in.
UpdateTxMinedStatus lumped two distinct failures under a single setMinedErrorCount, surfacing both as "failed to set mined status for N batches". An operator triaging a stalled block could not tell a store I/O blip from a postcondition violation (SetMinedMulti returned nil error but did not durably tag every submitted tx). Track them in separate atomics — setMinedErrorCount for I/O failures, coverageGapCount for postcondition violations — and switch on the two counts at the end to emit a cause-specific error. The combined-failure variant reports both counts so a mixed batch is still legible. Adds three subtests pinning each error variant (pure I/O, pure coverage, combined).
…_FOUND behavior processBatchResultsForSetMinedExpressions synthesizes the result-map entry on FILTERED_OUT (proof, from the filter condition, that the current blockID is already in the durable list) and treats it as a successful idempotent retry. There was no direct test covering that branch, so a regression that dropped the synthesis or changed it to an error would only have surfaced as a coverage-gap downstream. Adds a focused test that drives the function with a fake BatchWrite returning FILTERED_OUT and asserts the synthesized [blockID] slice. Adds a paired regression guard confirming KEY_NOT_FOUND on the same path remains a hard error and does not synthesize a map entry.
The SQL backend (postgres, sqlite, sqlitememory) silently dropped any submitted hash that was absent from the transactions table: existence was probed up front, the INSERT/SELECT path then operated only on the intersection, and the missing hashes never appeared in the returned map. The Interface.go postcondition added in PR-850 says the opposite — for normal set-mined every submitted hash MUST be in the result. The model layer's coverage check already caught this (the hash showed up as a coverage gap and the block failed closed on retry), so the visible behavior was correct. But the contract was wrong, and a caller outside UpdateTxMinedStatus (or a future refactor that thins the model-layer guard) would re-introduce the silent drop. Bring SQL into parity with Aerospike's KEY_NOT_FOUND handling: when !UnsetMined and any submitted hash is missing from transactions, return TxNotFoundError and let the deferred rollback fire. UnsetMined still tolerates missing rows (the tx is already gone). Updates TestStore_SetMinedMultiChunk_NoExistingTransactions (which previously pinned the silent-drop) to expect the new error, and adds two new tests: one for UnsetMined=true tolerance, one for the partial-miss case where some hashes exist and some don't.
|
Reviewed the Aerospike expression path postcondition enforcement: Issue: Impact: Violates the Suggested fix: Add postcondition check before line 463 return: if \!minedBlockInfo.UnsetMined {
for _, h := range hashes {
blockIDs, exists := blockIDs[*h]
if \!exists || \!containsBlockID(blockIDs, minedBlockInfo.BlockID) {
errs = errors.Join(errs, errors.NewTxNotFoundError("coverage gap", h.String()))
nrErrors++
}
}
}Note: The model layer defensive check (model/update-tx-mined.go:289-297) catches this, but store implementations should enforce their own contracts per the interface docstring. |
ordishs
left a comment
There was a problem hiding this comment.
Approving on the fix correctness — the 5-layer defense-in-depth is sound, backend parity (SQL ↔ Aerospike) is in place, and the test matrix covers the important paths (empty map, partial omit, FILTERED_OUT synthesis, KEY_NOT_FOUND as hard error, mixed I/O + coverage error reporting). I built the branch, ran go vet, and ran the affected unit tests including -race on model — all green.
One operational concern worth a follow-up (not gating my approval, but flagging it before deploy):
The unbounded 1-second retry on setMinedChan (BlockValidation.go:553-567) is a pre-existing pattern, but this PR materially changes its risk profile. Before: failures were transient I/O. After: failures include "this block's historical state can never satisfy the new postcondition" — genuinely unrecoverable through retry alone. A node carrying any pre-existing corrupt block_ids data will spin at ~86,400 log lines/day per affected block on first restart, with no metric for operators to spot it.
Worth doing before this hits any environment with historical state:
- Bounded retries on
setMinedChanre-queue (e.g. exponential 1s→16s, drop after N attempts with amanual_intervention_requiredlog marker). block_validation_setmined_retry_total{blockhash}Prometheus counter for dashboard visibility.- A short note in
docs/topics/services/blockValidation.mddescribing the new failure class for on-call. - Pre-deploy: a one-shot diagnostic scan over recent
block_idsto catch any drift before flipping the stricter check on.
Happy to land this as-is if the retry hardening is queued as a follow-up PR with a clear owner.
Minor housekeeping for the description before merge: the TxMetaCache.SetMinedMulti blurb still says "cache failure after a successful store write returns an error" — the code now evicts via Delete, which can't fail. The "Batch-coverage postcondition (both Aerospike paths)" line is also slightly misleading; the postcondition is enforced at the model layer, with the Aerospike paths contributing the inputs (KEY_NOT_FOUND as hard error, FILTERED_OUT synthesis). Doesn't change correctness, but easier on future readers.
Align trailing comments on the new setMinedErrorCount / coverageGapCount declarations with the longer oldBlockIDs line so the gci formatter is happy. Pure whitespace, no behavior change.
…xMetaCache The SetMinedMulti interface contract added in this PR promises that, on a nil error with !UnsetMined, every submitted hash is present in the result map and every slice contains minedBlockInfo.BlockID. SQL already enforced this; Aerospike and TxMetaCache silently passed through coverage gaps. Aerospike: add an end-of-loop postcondition check in processBatchResultsForSetMinedExpressions that promotes uncovered hashes to a NewTxNotFoundError (or NewProcessingError when the slice exists but lacks the current blockID). Covers the nil-record/bins path and the empty-BlockIDs-bin path that previously returned nil. UnsetMined still tolerates gaps per the interface contract. TxMetaCache: re-verify coverage in SetMinedMulti before evicting cache entries so a regression in any wrapped backend cannot slip through the cache layer. Route SetMined through SetMinedMulti to share the check. Tests pin the new behaviour: - Aerospike: CoverageGap_NilRecord, CoverageGap_EmptyBlockIDs, UnsetMinedToleratesGap. - TxMetaCache: PostconditionMissingHash, PostconditionMissingBlockID, UnsetMinedToleratesGap. - Stale comment on TestParseSetMinedState_ExplicitEmptyList rewritten to reflect the new store-side enforcement.
…odel, cache, and block validation
|
oskarszoon
left a comment
There was a problem hiding this comment.
All three blockers from the previous review are closed:
- Bounded retry on
setMinedChan:setMinedMaxRetries=10with exponential backoff 1s→16s and amanual_intervention_requiredlog marker on final drop (99098a1fc). - Prometheus counters:
setmined_retry_total{blockhash}andsetmined_drops_total{blockhash}inmetrics.go. - Operator-runbook section in
docs/topics/services/blockValidation.mdcovering the new failure class, backoff table, counters, and repair procedure.
TestSetMinedRetryBackoff pins the 111s total budget assertion correctly. PR description is updated. ordishs's typed-error-bucket split (e3d35e68f) and SQL postcondition (a941c7475) are clean; icellan's routing of SetMined through SetMinedMulti and the TxMetaCache defensive check in 05aed56ec are correct.
Two non-blocking follow-ups worth a separate ticket:
- The non-expressions Aerospike path (
processBatchResultsForSetMined) has no end-of-loop coverage check, unlike the expressions path. Covered downstream by the TxMetaCache and model-layer postcondition checks, so safe, but the asymmetry between the two batch paths is worth closing. time.Sleep(backoff)blocks the singlesetMinedChanworker for up to 16s per retry. Operationally acceptable given retries should be rare, but worth a one-line note in the runbook ("during recovery the setMined worker is single-threaded; concurrent stuck blocks serialise").
Conflict: main's #850 (setMinedMulti coverage invariant) reworked handleBatchRecordError to branch on a new unsetMined flag, while the PR side had added a batchRecord parameter for richer diagnostic logging. Merged the two: signature is now handleBatchRecordError(batchRecord, err, hash, unsetMined) - KEY_NOT_FOUND_ERROR returns nil under unsetMined (main's invariant) or TxNotFoundError otherwise. - All other errors keep the PR's batchRecord-aware storage error message. Single call site in set_mined.go updated; refactored_methods_test.go calls updated to pass nil batchRecord (the unit-level tests don't need the diagnostic context). Also retargeted main's new test-file imports from upstream aerospike-client-go to the BSV fork, matching the rest of the PR.



Summary
Closes the gap that allowed a block to reach
SetBlockMinedSet(true)without every non-coinbase transaction being durably tagged with the currentblockIDin the UTXO txmeta store. Adds a defence-in-depth invariant enforced at every layer (store, cache wrapper, caller, block-validation), plus a bounded retry loop so a node carrying historical-corrupt state cannot spin forever on the new check.Coverage invariant — enforced at every layer
Contract (
stores/utxo/Interface.go): whenUnsetMined=falseandSetMinedMultireturns nil, every submitted hash MUST be a key in the returned map and every returned slice MUST containminedBlockInfo.BlockID. Implementations that cannot prove this MUST return a non-nil error.Implementations:
handleBatchRecordErrorpromotesKEY_NOT_FOUND_ERRORtoTxNotFoundErrorfor normal set-mined (only theUnsetMinedpath tolerates a missing record).FILTERED_OUTsynthesizes{hash: [blockID]}(the filter condition already proves the durable list contains the current blockID). A new end-of-loop coverage check promotes nil-record/empty-BlockIDs paths toTxNotFoundError/ProcessingErrorso the expression backend fails closed identically to the UDF + SQL backends.SetMinedMultire-verifies coverage on the wrapped store's result before touching the cache, then evicts each hash from the cache (mined txs are not cacheable per theGetMetapolicy:len(BlockIDs) == 0 && !Conflicting).SetMinednow routes throughSetMinedMultito share the check.model.UpdateTxMinedStatusre-verifies coverage in a closure that piggybacks on the existingblockIDsMapiteration (no new nested loop). Two separate counters distinguish I/O failures (setMinedErrorCount) from postcondition gaps (coverageGapCount); the final error message reports both.BlockValidation.processBlockMinedNotSetno longer emits the "processed block mined and set mined_set" log on the failure path. The primarysetTxMinedStatusflow already ordersUpdateTxMinedStatusstrictly beforeSetBlockMinedSet.Bounded
setMinedChanretry (operational gate per @oskarszoon)Pre-PR, the
setMinedChanworker retried failures forever at 1Hz. After this PR the postcondition check surfaces previously-silent failures, so an unbounded loop becomes operationally hostile on nodes carrying historical-corruptblock_idsdata.ERRORlog containing themanual_intervention_requiredmarker. Counter resets on success,MinedSetshortcut,ErrBlockNotFound, or drop.teranode_blockvalidation_setmined_retry_total{blockhash}— alert when a single label series approaches the retry ceiling.teranode_blockvalidation_setmined_drops_total{blockhash}— any non-zero value is page-worthy.docs/topics/services/blockValidation.md§ 2.3 covers the new failure mode (how to identify, repair, and re-trigger; cluster-wide drop rate is a stop-the-rollout signal).Files changed
stores/utxo/Interface.gostores/utxo/aerospike/set_mined.goKEY_NOT_FOUND→TxNotFoundErrorfor set-minedstores/utxo/aerospike/set_mined_expressions.goFILTERED_OUTsynthesizes map entry; end-of-loop coverage check for all backendsstores/utxo/mock.goMockUtxostore.SetMinedMultiaccepts a function return for realistic test mapsstores/txmetacache/txmetacache.goSetMinedMultire-verifies + evicts;SetMinedroutes throughSetMinedMultimodel/update-tx-mined.gocheckBatchResultsclosure folds coverage check into the existing double-spend pass; split counters for I/O vs coverageservices/blockvalidation/BlockValidation.gosetMinedChanwith exponential backoff +manual_intervention_requiredmarker; success log only on success pathservices/blockvalidation/metrics.gosetmined_retry_total/setmined_drops_totalcountersdocs/topics/services/blockValidation.mdTests
TestHandleBatchRecordError_KeyNotFound(set / unset),TestParseSetMinedState_ExplicitEmptyList,TestProcessBatchResultsForSetMinedExpressions_FilteredOutSynthesizesMapEntry,_KeyNotFoundIsHardError,_CoverageGap_NilRecord,_CoverageGap_EmptyBlockIDs,_UnsetMinedToleratesGap.Test_updateTxMinedStatus_Internalwith subtests for empty-map coverage gap, partial omission (expression path empty BlockIDs scenario), missing-blockID-in-slice, andUnsetMinedtolerance.TestTxMetaCacheSetMinedMulti_DelegatesToStoreAndEvictsproves underlying-store delegation + cache eviction;_PropagatesStoreErrorproves short-circuit on store error and cache preservation.TestSetMinedRetryBackoffpins the backoff curve and the worst-case budget.Verification
TestContainers-based integration tests (Aerospike, Postgres) require Docker which isn't available in dev env — CI will exercise them.
Risks
block_idsdrift will hit the new check on first restart. The bounded retry +manual_intervention_requiredmarker + Prometheus drops counter make this visible to operators instead of an unbounded log flood. Pre-deploy: a one-shot diagnostic scan over recentblock_idsis advisable to catch drift before flipping the stricter check on across a cluster.count(blockID in blockIDs) == 0). The inline comment ties the assumption to the filter, and the end-of-loop coverage check would catch a regression if the filter expression ever changes.