Skip to content

fix(setMinedMulti): enforce coverage invariant across store, model, cache, and block validation#850

Merged
icellan merged 12 commits into
mainfrom
fix/set-tx-mined-invariant
May 18, 2026
Merged

fix(setMinedMulti): enforce coverage invariant across store, model, cache, and block validation#850
icellan merged 12 commits into
mainfrom
fix/set-tx-mined-invariant

Conversation

@icellan

@icellan icellan commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the gap that allowed a block to reach SetBlockMinedSet(true) without every non-coinbase transaction being durably tagged with the current blockID in the UTXO txmeta store. Adds a defence-in-depth invariant enforced at every layer (store, cache wrapper, caller, block-validation), plus a bounded retry loop so a node carrying historical-corrupt state cannot spin forever on the new check.

Coverage invariant — enforced at every layer

Contract (stores/utxo/Interface.go): when UnsetMined=false and SetMinedMulti returns nil, every submitted hash MUST be a key in the returned map and every returned slice MUST contain minedBlockInfo.BlockID. Implementations that cannot prove this MUST return a non-nil error.

Implementations:

  • Aerospike UDF pathhandleBatchRecordError promotes KEY_NOT_FOUND_ERROR to TxNotFoundError for normal set-mined (only the UnsetMined path tolerates a missing record).
  • Aerospike expression pathFILTERED_OUT synthesizes {hash: [blockID]} (the filter condition already proves the durable list contains the current blockID). A new end-of-loop coverage check promotes nil-record/empty-BlockIDs paths to TxNotFoundError / ProcessingError so the expression backend fails closed identically to the UDF + SQL backends.
  • TxMetaCache wrapperSetMinedMulti re-verifies coverage on the wrapped store's result before touching the cache, then evicts each hash from the cache (mined txs are not cacheable per the GetMeta policy: len(BlockIDs) == 0 && !Conflicting). SetMined now routes through SetMinedMulti to share the check.
  • Callermodel.UpdateTxMinedStatus re-verifies coverage in a closure that piggybacks on the existing blockIDsMap iteration (no new nested loop). Two separate counters distinguish I/O failures (setMinedErrorCount) from postcondition gaps (coverageGapCount); the final error message reports both.
  • Block validationBlockValidation.processBlockMinedNotSet no longer emits the "processed block mined and set mined_set" log on the failure path. The primary setTxMinedStatus flow already orders UpdateTxMinedStatus strictly before SetBlockMinedSet.

Bounded setMinedChan retry (operational gate per @oskarszoon)

Pre-PR, the setMinedChan worker retried failures forever at 1Hz. After this PR the postcondition check surfaces previously-silent failures, so an unbounded loop becomes operationally hostile on nodes carrying historical-corrupt block_ids data.

  • Bounded exponential backoff per block hash: 1s → 2s → 4s → 8s → 16s × 6 (capped). After 10 consecutive failures (worst-case 111s total) the worker drops the block hash with an ERROR log containing the manual_intervention_required marker. Counter resets on success, MinedSet shortcut, ErrBlockNotFound, or drop.
  • Prometheus metrics:
    • teranode_blockvalidation_setmined_retry_total{blockhash} — alert when a single label series approaches the retry ceiling.
    • teranode_blockvalidation_setmined_drops_total{blockhash} — any non-zero value is page-worthy.
  • Operator runbook in docs/topics/services/blockValidation.md § 2.3 covers the new failure mode (how to identify, repair, and re-trigger; cluster-wide drop rate is a stop-the-rollout signal).
  • Per-blockhash label cardinality is intentional: an alert on a single stuck block is the only useful operator signal.

Files changed

File Change
stores/utxo/Interface.go Documents the SetMinedMulti coverage postcondition
stores/utxo/aerospike/set_mined.go UDF KEY_NOT_FOUNDTxNotFoundError for set-mined
stores/utxo/aerospike/set_mined_expressions.go FILTERED_OUT synthesizes map entry; end-of-loop coverage check for all backends
stores/utxo/mock.go MockUtxostore.SetMinedMulti accepts a function return for realistic test maps
stores/txmetacache/txmetacache.go SetMinedMulti re-verifies + evicts; SetMined routes through SetMinedMulti
model/update-tx-mined.go checkBatchResults closure folds coverage check into the existing double-spend pass; split counters for I/O vs coverage
services/blockvalidation/BlockValidation.go Bounded retry on setMinedChan with exponential backoff + manual_intervention_required marker; success log only on success path
services/blockvalidation/metrics.go Adds setmined_retry_total / setmined_drops_total counters
docs/topics/services/blockValidation.md Coverage postcondition + retry loop + operator runbook

Tests

  • Aerospike (unit)TestHandleBatchRecordError_KeyNotFound (set / unset), TestParseSetMinedState_ExplicitEmptyList, TestProcessBatchResultsForSetMinedExpressions_FilteredOutSynthesizesMapEntry, _KeyNotFoundIsHardError, _CoverageGap_NilRecord, _CoverageGap_EmptyBlockIDs, _UnsetMinedToleratesGap.
  • Model — extended Test_updateTxMinedStatus_Internal with subtests for empty-map coverage gap, partial omission (expression path empty BlockIDs scenario), missing-blockID-in-slice, and UnsetMined tolerance.
  • TxMetaCacheTestTxMetaCacheSetMinedMulti_DelegatesToStoreAndEvicts proves underlying-store delegation + cache eviction; _PropagatesStoreError proves short-circuit on store error and cache preservation.
  • BlockValidationTestSetMinedRetryBackoff pins the backoff curve and the worst-case budget.

Verification

go build ./...                                                     # clean
go vet ./model ./stores/utxo/aerospike ./stores/txmetacache ./services/blockvalidation  # clean
go test ./model                                                    # pass (incl. race)
go test ./services/blockvalidation -run TestSetMined               # pass
go test ./stores/txmetacache -run TestTxMetaCache                  # pass
go test ./stores/utxo/aerospike -run "TestHandle|TestParse|TestProcess"  # pass

TestContainers-based integration tests (Aerospike, Postgres) require Docker which isn't available in dev env — CI will exercise them.

Risks

  • Historical corrupt state will now fail loudly. A node carrying pre-existing block_ids drift will hit the new check on first restart. The bounded retry + manual_intervention_required marker + Prometheus drops counter make this visible to operators instead of an unbounded log flood. Pre-deploy: a one-shot diagnostic scan over recent block_ids is advisable to catch drift before flipping the stricter check on across a cluster.
  • Expression-mode coverage synthesis is sound under the current filter condition (count(blockID in blockIDs) == 0). The inline comment ties the assumption to the filter, and the end-of-loop coverage check would catch a regression if the filter expression ever changes.
  • Per-blockhash Prometheus label cardinality is bounded by the number of historically-stuck blocks; once repaired, series fall out of recent windows.

…ache, and block validation

Make it impossible for a block to reach SetBlockMinedSet without every
non-coinbase transaction being durably tagged with the current block ID
in the UTXO txmeta store.

- aerospike UDF path: handleBatchRecordError no longer swallows
  KEY_NOT_FOUND_ERROR for normal set-mined; only the UnsetMined path
  tolerates a missing record.
- aerospike expression path: FILTERED_OUT now synthesizes {hash: [blockID]}
  in the returned map (the filter's truth condition already proves the
  blockID is durably present), so downstream coverage checks see it.
- both Aerospike paths: batch-coverage postcondition verifies every
  submitted hash is in the returned map and each slice contains
  minedBlockInfo.BlockID for non-UnsetMined calls.
- UpdateTxMinedStatus: defensive caller-level postcondition mirrors the
  store invariant via verifyMinedBatchCoverage and existing
  setMinedErrorCount accounting.
- TxMetaCache.SetMinedMulti: delegates to the underlying utxo.Store
  first; cache update failure after store success returns an error to
  force retry instead of silently masking missing durable state.
- BlockValidation.processBlockMinedNotSet: the "processed block mined
  and set mined_set" log now only fires when setTxMinedStatus succeeded
  (previously emitted unconditionally even after requeue).
- utxo.Store interface: SetMinedMulti docstring documents the new
  postcondition contract for all implementations.
- MockUtxostore.SetMinedMulti accepts a function return so tests can
  synthesize realistic blockIDsMaps from submitted hashes.

Tests cover store-level KEY_NOT_FOUND semantics, caller-level coverage
gaps (missing hash and missing blockID), UnsetMined tolerance, and
TxMetaCache delegation + error propagation.
@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

All previously reported issues have been resolved in this version.

History:

  • ✅ Fixed: Separate counters for I/O failures vs coverage gaps (model/update-tx-mined.go)
  • ✅ Fixed: Postcondition enforcement in Aerospike expression path
  • ✅ Fixed: Coverage validation in TxMetaCache wrapper
  • ✅ Fixed: All store implementations now enforce the SetMinedMulti contract

Current Review:
No issues found. The implementation is well-structured and comprehensive:

  • Coverage invariant properly enforced across all layers (store, cache, model, block validation)
  • Bounded retry logic with exponential backoff prevents operational issues
  • Comprehensive test coverage includes edge cases for coverage gaps, mixed failures, and retry behavior
  • Documentation accurately reflects implementation behavior
  • Metrics and observability provide clear operator signals for intervention

The PR successfully closes the gap that could allow blocks to reach MinedSet=true without proper transaction tagging.

@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-850 (202cfe5)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.612µ 1.554µ ~ 0.100
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.25n 71.11n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.15n 71.25n ~ 0.200
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.01n 71.01n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 31.86n 32.19n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 53.51n 54.43n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 129.6n 129.5n ~ 1.000
MiningCandidate_Stringify_Short-4 223.2n 216.7n ~ 0.100
MiningCandidate_Stringify_Long-4 1.595µ 1.602µ ~ 0.600
MiningSolution_Stringify-4 830.9n 840.1n ~ 0.100
BlockInfo_MarshalJSON-4 1.704µ 1.701µ ~ 0.300
NewFromBytes-4 127.2n 127.7n ~ 0.600
Mine_EasyDifficulty-4 60.66µ 60.50µ ~ 0.700
Mine_WithAddress-4 6.753µ 6.782µ ~ 0.400
DirectSubtreeAdd/4_per_subtree-4 56.89n 61.09n ~ 0.200
DirectSubtreeAdd/64_per_subtree-4 28.39n 28.42n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 27.77n 27.41n ~ 0.700
DirectSubtreeAdd/1024_per_subtree-4 26.10n 26.25n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 25.78n 26.31n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 279.2n 283.3n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 272.8n 272.2n ~ 0.400
SubtreeProcessorAdd/256_per_subtree-4 275.5n 274.9n ~ 1.000
SubtreeProcessorAdd/1024_per_subtree-4 267.1n 267.0n ~ 1.000
SubtreeProcessorAdd/2048_per_subtree-4 271.1n 266.5n ~ 0.200
SubtreeProcessorRotate/4_per_subtree-4 273.9n 282.7n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 271.2n 272.1n ~ 0.500
SubtreeProcessorRotate/256_per_subtree-4 269.5n 271.5n ~ 0.200
SubtreeProcessorRotate/1024_per_subtree-4 273.6n 270.6n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 54.03n 54.64n ~ 0.200
SubtreeNodeAddOnly/64_per_subtree-4 34.22n 34.35n ~ 0.200
SubtreeNodeAddOnly/256_per_subtree-4 33.20n 33.52n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 32.66n 33.00n ~ 0.100
SubtreeCreationOnly/4_per_subtree-4 112.2n 115.2n ~ 0.400
SubtreeCreationOnly/64_per_subtree-4 401.4n 393.3n ~ 0.700
SubtreeCreationOnly/256_per_subtree-4 1.309µ 1.322µ ~ 0.400
SubtreeCreationOnly/1024_per_subtree-4 4.335µ 4.347µ ~ 1.000
SubtreeCreationOnly/2048_per_subtree-4 7.832µ 7.816µ ~ 1.000
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 270.1n 271.9n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 268.8n 269.6n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 800.0µ 798.7µ ~ 1.000
ParallelGetAndSetIfNotExists/10k_nodes-4 1.527m 1.531m ~ 1.000
ParallelGetAndSetIfNotExists/50k_nodes-4 6.692m 6.536m ~ 0.200
ParallelGetAndSetIfNotExists/100k_nodes-4 13.54m 13.78m ~ 0.400
SequentialGetAndSetIfNotExists/1k_nodes-4 648.3µ 664.8µ ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 2.843m 2.844m ~ 1.000
SequentialGetAndSetIfNotExists/50k_nodes-4 10.44m 10.54m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 19.91m 20.23m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 617.4µ 622.8µ ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.147m 4.117m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 16.44m 16.40m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 683.3µ 688.5µ ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 5.944m 5.685m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 39.36m 38.11m ~ 0.100
BlockAssembler_AddTx-4 0.02856n 0.03087n ~ 0.400
AddNode-4 11.58 11.83 ~ 1.000
AddNodeWithMap-4 11.82 12.28 ~ 0.400
DiskTxMap_SetIfNotExists-4 4.220µ 4.244µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 4.008µ 3.893µ ~ 0.100
DiskTxMap_ExistenceOnly-4 449.1n 539.7n ~ 0.700
Queue-4 205.9n 201.0n ~ 0.100
AtomicPointer-4 8.179n 8.139n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 794.8µ 793.0µ ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/10K-4 742.5µ 737.9µ ~ 0.400
ReorgOptimizations/AllMarkFalse/Old/10K-4 111.7µ 112.1µ ~ 1.000
ReorgOptimizations/AllMarkFalse/New/10K-4 58.33µ 58.13µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 58.59µ 56.98µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/10K-4 11.80µ 11.81µ ~ 1.000
ReorgOptimizations/NodeFlags/Old/10K-4 5.230µ 5.031µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.836µ 1.728µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 10.98m 11.39m ~ 0.400
ReorgOptimizations/DedupFilterPipeline/New/100K-4 12.30m 12.03m ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.185m 1.155m ~ 0.100
ReorgOptimizations/AllMarkFalse/New/100K-4 737.0µ 728.9µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/100K-4 580.8µ 556.9µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 332.0µ 331.6µ ~ 0.400
ReorgOptimizations/NodeFlags/Old/100K-4 50.25µ 49.53µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 16.78µ 17.16µ ~ 0.200
TxMapSetIfNotExists-4 51.77n 50.30n ~ 0.200
TxMapSetIfNotExistsDuplicate-4 43.66n 43.27n ~ 0.100
ChannelSendReceive-4 673.0n 694.3n ~ 0.100
CalcBlockWork-4 503.6n 511.4n ~ 0.100
CalculateWork-4 684.5n 681.8n ~ 1.000
BuildBlockLocatorString_Helpers/Size_10-4 1.197µ 1.052µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 9.958µ 10.124µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 98.71µ 110.63µ ~ 0.100
CatchupWithHeaderCache-4 103.8m 103.8m ~ 1.000
_BufferPoolAllocation/16KB-4 3.646µ 3.884µ ~ 0.100
_BufferPoolAllocation/32KB-4 7.506µ 7.423µ ~ 0.700
_BufferPoolAllocation/64KB-4 15.84µ 18.82µ ~ 0.700
_BufferPoolAllocation/128KB-4 26.64µ 24.08µ ~ 0.400
_BufferPoolAllocation/512KB-4 115.8µ 107.0µ ~ 0.400
_BufferPoolConcurrent/32KB-4 18.52µ 17.88µ ~ 0.200
_BufferPoolConcurrent/64KB-4 27.65µ 29.99µ ~ 0.100
_BufferPoolConcurrent/512KB-4 142.1µ 154.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 609.3µ 652.9µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 611.1µ 647.3µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 616.0µ 657.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 595.3µ 655.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 612.0µ 645.8µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.74m 36.39m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.45m 36.52m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.46m 36.62m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.32m 36.12m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.47m 36.07m ~ 0.100
_PooledVsNonPooled/Pooled-4 736.3n 742.7n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.057µ 7.872µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.528µ 6.800µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.371µ 11.356µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.143µ 10.202µ ~ 0.100
_prepareTxsPerLevel-4 424.4m 428.3m ~ 0.400
_prepareTxsPerLevelOrdered-4 4.109m 4.556m ~ 0.700
_prepareTxsPerLevel_Comparison/Original-4 422.8m 428.0m ~ 0.200
_prepareTxsPerLevel_Comparison/Optimized-4 3.930m 3.904m ~ 0.400
SubtreeSizes/10k_tx_4_per_subtree-4 1.472m 1.402m ~ 0.100
SubtreeSizes/10k_tx_16_per_subtree-4 334.4µ 326.9µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 80.85µ 79.19µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 19.94µ 19.39µ ~ 0.100
SubtreeSizes/10k_tx_512_per_subtree-4 9.810µ 9.596µ ~ 0.100
SubtreeSizes/10k_tx_1024_per_subtree-4 4.903µ 4.743µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.431µ 2.375µ ~ 0.100
BlockSizeScaling/10k_tx_64_per_subtree-4 77.38µ 76.68µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 19.56µ 19.13µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.826µ 4.729µ ~ 0.100
BlockSizeScaling/50k_tx_64_per_subtree-4 402.7µ 399.4µ ~ 0.400
BlockSizeScaling/50k_tx_256_per_subtree-4 97.27µ 94.03µ ~ 0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.98µ 23.56µ ~ 0.100
SubtreeAllocations/small_subtrees_exists_check-4 164.6µ 163.6µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 171.0µ 168.2µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 340.1µ 329.1µ ~ 0.100
SubtreeAllocations/medium_subtrees_exists_check-4 9.876µ 9.554µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 10.064µ 9.766µ ~ 0.100
SubtreeAllocations/medium_subtrees_full_validation-4 19.76µ 19.06µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.336µ 2.274µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.459µ 2.395µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.877µ 4.785µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 322.5µ 329.1µ ~ 1.000
StoreBlock_Sequential/AboveCSVHeight-4 321.2µ 332.3µ ~ 0.200
GetUtxoHashes-4 254.9n 255.6n ~ 0.800
GetUtxoHashes_ManyOutputs-4 42.24µ 42.19µ ~ 0.700
_NewMetaDataFromBytes-4 234.3n 234.6n ~ 1.000
_Bytes-4 606.0n 601.5n ~ 0.100
_MetaBytes-4 547.5n 547.0n ~ 0.400

Threshold: >10% with p < 0.05 | Generated: 2026-05-18 09:55 UTC

@icellan icellan requested review from Copilot, freemans13 and ordishs and removed request for ordishs May 12, 2026 16:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens the “tx mined tagging” invariants so block validation fails closed unless every non-coinbase tx is durably marked as mined in the UTXO txmeta store for the current blockID, and aligns cache behavior/logging with that goal.

Changes:

  • Tighten Aerospike SetMinedMulti error handling (KEY_NOT_FOUND is only tolerated for UnsetMined) and improve expression-mode handling of FILTERED_OUT.
  • Add a model-layer postcondition check to detect coverage gaps in SetMinedMulti results and fail block processing closed.
  • Fix TxMetaCache.SetMinedMulti to delegate to the underlying store and evict from cache; adjust block-validation logging on failure.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
stores/utxo/mock.go Extends mock to allow function-based return values for realistic blockIDsMap synthesis in tests.
stores/utxo/Interface.go Documents a stronger SetMinedMulti postcondition/contract.
stores/utxo/aerospike/set_mined.go Makes KEY_NOT_FOUND a hard error for normal set-mined (no-op only for unset-mined).
stores/utxo/aerospike/set_mined_expressions.go Treats FILTERED_OUT as covered by synthesizing a map entry.
stores/utxo/aerospike/set_mined_expressions_test.go Adds coverage for parsing an explicitly empty blockIDs bin.
stores/utxo/aerospike/refactored_methods_test.go Adds regression coverage for KEY_NOT_FOUND handling behavior.
stores/txmetacache/txmetacache.go Ensures SetMinedMulti delegates to store then evicts cache entries.
stores/txmetacache/txmetacache_test.go Adds tests proving delegation/eviction and store-error short-circuiting.
services/blockvalidation/BlockValidation.go Prevents success log emission when setTxMinedStatus fails and block is requeued.
model/update-tx-mined.go Adds coverage-gap verification logic after SetMinedMulti.
model/update-tx-mined_test.go Updates existing tests and adds new subtests for coverage-gap erroring and unset-mined tolerance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread model/update-tx-mined.go
Comment thread stores/utxo/aerospike/set_mined_expressions.go
Comment thread stores/utxo/Interface.go
Comment thread stores/utxo/aerospike/set_mined_expressions_test.go Outdated
Comment thread stores/txmetacache/txmetacache.go

@oskarszoon oskarszoon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5-layer defense-in-depth shape is right and the fix correctly transforms a silent postcondition violation into a surfaced error. Store returns full blockIDs map → checkBatchResults walks coverage per submitted hash → cache pass-through (clean) → setTxMinedStatus surfaces as ProcessingError → setMinedChan retry loop. No peer-controllable DoS path; SQL and Lua backends produce equivalent semantics; cache is pure forwarding.

One operational blocker before merge:

Unbounded 1Hz retry loop on setMinedChan. BlockValidation.go:561-567 retries forever at 1-second intervals on any non-ErrBlockNotFound error. After this PR lands, a node carrying historical-corrupt block_ids data (records that lost a blockID via a prior Lua signal-handling crash, a pre-PR-4561-class bug, or an Aerospike eviction race) will hit the new coverage check on first restart and spin at ~86,400 ProcessingError log lines per day per affected block, with no way to drain the channel and no metric for operators to see "I'm stuck on this block" until log volume catches their eye. The fix is exactly the class of bug this PR is designed to surface, but the response shape is too aggressive for an unrecoverable error. Three required changes:

  • Add a max-retry counter on setMinedChan re-queues (e.g. 10 attempts with exponential backoff 1s→16s), then log WARN + drop with a manual_intervention_required metric/log marker.
  • Add block_validation_setmined_retry_total{blockhash} Prometheus counter so the stuck state is visible in dashboards.
  • Document the new failure mode in docs/topics/services/blockValidation.md so on-call understand "this loop is normal during corrupt-state recovery; here's how to escalate".

Pre-deploy operator coordination separately: worth a one-shot diagnostic scan across block_ids for recently-processed blocks before flipping this on for any environment with historical state. If drift exists, repair-then-deploy.

Non-gating follow-ups: differentiate setMinedErrorCount into typed buckets (storage / conversion / coverage) so the retry loop can apply policy; push the coverage invariant down into Aerospike Lua + SQL for atomic enforcement; add a fixture-driven table test for checkBatchResults exercising mixed-mode batch results.

Happy to re-approve once the bounded retry + Prometheus counter are in.

ordishs added 4 commits May 14, 2026 15:19
UpdateTxMinedStatus lumped two distinct failures under a single
setMinedErrorCount, surfacing both as "failed to set mined status for
N batches". An operator triaging a stalled block could not tell a
store I/O blip from a postcondition violation (SetMinedMulti returned
nil error but did not durably tag every submitted tx).

Track them in separate atomics — setMinedErrorCount for I/O failures,
coverageGapCount for postcondition violations — and switch on the two
counts at the end to emit a cause-specific error. The combined-failure
variant reports both counts so a mixed batch is still legible.

Adds three subtests pinning each error variant (pure I/O, pure
coverage, combined).
…_FOUND behavior

processBatchResultsForSetMinedExpressions synthesizes the result-map
entry on FILTERED_OUT (proof, from the filter condition, that the
current blockID is already in the durable list) and treats it as a
successful idempotent retry. There was no direct test covering that
branch, so a regression that dropped the synthesis or changed it to an
error would only have surfaced as a coverage-gap downstream.

Adds a focused test that drives the function with a fake BatchWrite
returning FILTERED_OUT and asserts the synthesized [blockID] slice.
Adds a paired regression guard confirming KEY_NOT_FOUND on the same
path remains a hard error and does not synthesize a map entry.
The SQL backend (postgres, sqlite, sqlitememory) silently dropped any
submitted hash that was absent from the transactions table: existence
was probed up front, the INSERT/SELECT path then operated only on the
intersection, and the missing hashes never appeared in the returned
map. The Interface.go postcondition added in PR-850 says the opposite
— for normal set-mined every submitted hash MUST be in the result.

The model layer's coverage check already caught this (the hash showed
up as a coverage gap and the block failed closed on retry), so the
visible behavior was correct. But the contract was wrong, and a caller
outside UpdateTxMinedStatus (or a future refactor that thins the
model-layer guard) would re-introduce the silent drop.

Bring SQL into parity with Aerospike's KEY_NOT_FOUND handling: when
!UnsetMined and any submitted hash is missing from transactions,
return TxNotFoundError and let the deferred rollback fire. UnsetMined
still tolerates missing rows (the tx is already gone).

Updates TestStore_SetMinedMultiChunk_NoExistingTransactions (which
previously pinned the silent-drop) to expect the new error, and adds
two new tests: one for UnsetMined=true tolerance, one for the
partial-miss case where some hashes exist and some don't.
@github-actions

Copy link
Copy Markdown
Contributor

Reviewed the Aerospike expression path postcondition enforcement:

Issue: processBatchResultsForSetMinedExpressions (stores/utxo/aerospike/set_mined_expressions.go:328-464) lacks postcondition validation. When batchRec.Record == nil || batchRec.Record.Bins == nil (lines 383-384), the hash is silently dropped from the result map without error tracking.

Impact: Violates the SetMinedMulti interface contract (stores/utxo/Interface.go:261-264) which requires every submitted hash to appear in the returned map with the current blockID when \!UnsetMined.

Suggested fix: Add postcondition check before line 463 return:

if \!minedBlockInfo.UnsetMined {
  for _, h := range hashes {
    blockIDs, exists := blockIDs[*h]
    if \!exists || \!containsBlockID(blockIDs, minedBlockInfo.BlockID) {
      errs = errors.Join(errs, errors.NewTxNotFoundError("coverage gap", h.String()))
      nrErrors++
    }
  }
}

Note: The model layer defensive check (model/update-tx-mined.go:289-297) catches this, but store implementations should enforce their own contracts per the interface docstring.

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the fix correctness — the 5-layer defense-in-depth is sound, backend parity (SQL ↔ Aerospike) is in place, and the test matrix covers the important paths (empty map, partial omit, FILTERED_OUT synthesis, KEY_NOT_FOUND as hard error, mixed I/O + coverage error reporting). I built the branch, ran go vet, and ran the affected unit tests including -race on model — all green.

One operational concern worth a follow-up (not gating my approval, but flagging it before deploy):

The unbounded 1-second retry on setMinedChan (BlockValidation.go:553-567) is a pre-existing pattern, but this PR materially changes its risk profile. Before: failures were transient I/O. After: failures include "this block's historical state can never satisfy the new postcondition" — genuinely unrecoverable through retry alone. A node carrying any pre-existing corrupt block_ids data will spin at ~86,400 log lines/day per affected block on first restart, with no metric for operators to spot it.

Worth doing before this hits any environment with historical state:

  • Bounded retries on setMinedChan re-queue (e.g. exponential 1s→16s, drop after N attempts with a manual_intervention_required log marker).
  • block_validation_setmined_retry_total{blockhash} Prometheus counter for dashboard visibility.
  • A short note in docs/topics/services/blockValidation.md describing the new failure class for on-call.
  • Pre-deploy: a one-shot diagnostic scan over recent block_ids to catch any drift before flipping the stricter check on.

Happy to land this as-is if the retry hardening is queued as a follow-up PR with a clear owner.

Minor housekeeping for the description before merge: the TxMetaCache.SetMinedMulti blurb still says "cache failure after a successful store write returns an error" — the code now evicts via Delete, which can't fail. The "Batch-coverage postcondition (both Aerospike paths)" line is also slightly misleading; the postcondition is enforced at the model layer, with the Aerospike paths contributing the inputs (KEY_NOT_FOUND as hard error, FILTERED_OUT synthesis). Doesn't change correctness, but easier on future readers.

icellan and others added 4 commits May 15, 2026 17:44
Align trailing comments on the new setMinedErrorCount / coverageGapCount
declarations with the longer oldBlockIDs line so the gci formatter is
happy. Pure whitespace, no behavior change.
…xMetaCache

The SetMinedMulti interface contract added in this PR promises that, on a
nil error with !UnsetMined, every submitted hash is present in the result
map and every slice contains minedBlockInfo.BlockID. SQL already enforced
this; Aerospike and TxMetaCache silently passed through coverage gaps.

Aerospike: add an end-of-loop postcondition check in
processBatchResultsForSetMinedExpressions that promotes uncovered hashes
to a NewTxNotFoundError (or NewProcessingError when the slice exists but
lacks the current blockID). Covers the nil-record/bins path and the
empty-BlockIDs-bin path that previously returned nil. UnsetMined still
tolerates gaps per the interface contract.

TxMetaCache: re-verify coverage in SetMinedMulti before evicting cache
entries so a regression in any wrapped backend cannot slip through the
cache layer. Route SetMined through SetMinedMulti to share the check.

Tests pin the new behaviour:
- Aerospike: CoverageGap_NilRecord, CoverageGap_EmptyBlockIDs,
  UnsetMinedToleratesGap.
- TxMetaCache: PostconditionMissingHash, PostconditionMissingBlockID,
  UnsetMinedToleratesGap.
- Stale comment on TestParseSetMinedState_ExplicitEmptyList rewritten to
  reflect the new store-side enforcement.
@sonarqubecloud

Copy link
Copy Markdown

@oskarszoon oskarszoon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three blockers from the previous review are closed:

  • Bounded retry on setMinedChan: setMinedMaxRetries=10 with exponential backoff 1s→16s and a manual_intervention_required log marker on final drop (99098a1fc).
  • Prometheus counters: setmined_retry_total{blockhash} and setmined_drops_total{blockhash} in metrics.go.
  • Operator-runbook section in docs/topics/services/blockValidation.md covering the new failure class, backoff table, counters, and repair procedure.

TestSetMinedRetryBackoff pins the 111s total budget assertion correctly. PR description is updated. ordishs's typed-error-bucket split (e3d35e68f) and SQL postcondition (a941c7475) are clean; icellan's routing of SetMined through SetMinedMulti and the TxMetaCache defensive check in 05aed56ec are correct.

Two non-blocking follow-ups worth a separate ticket:

  1. The non-expressions Aerospike path (processBatchResultsForSetMined) has no end-of-loop coverage check, unlike the expressions path. Covered downstream by the TxMetaCache and model-layer postcondition checks, so safe, but the asymmetry between the two batch paths is worth closing.
  2. time.Sleep(backoff) blocks the single setMinedChan worker for up to 16s per retry. Operationally acceptable given retries should be rare, but worth a one-line note in the runbook ("during recovery the setMined worker is single-threaded; concurrent stuck blocks serialise").

@icellan icellan merged commit fe3e73a into main May 18, 2026
30 checks passed
@icellan icellan deleted the fix/set-tx-mined-invariant branch May 18, 2026 10:12
icellan added a commit that referenced this pull request May 18, 2026
Conflict: main's #850 (setMinedMulti coverage invariant) reworked
handleBatchRecordError to branch on a new unsetMined flag, while the PR
side had added a batchRecord parameter for richer diagnostic logging.
Merged the two: signature is now
  handleBatchRecordError(batchRecord, err, hash, unsetMined)
- KEY_NOT_FOUND_ERROR returns nil under unsetMined (main's invariant) or
  TxNotFoundError otherwise.
- All other errors keep the PR's batchRecord-aware storage error message.
Single call site in set_mined.go updated; refactored_methods_test.go
calls updated to pass nil batchRecord (the unit-level tests don't need
the diagnostic context). Also retargeted main's new test-file imports
from upstream aerospike-client-go to the BSV fork, matching the rest
of the PR.
@icellan icellan self-assigned this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants