Skip to content

fix(utxo/aerospike): exclude data-state errors from spend circuit breaker#957

Merged
icellan merged 4 commits into
bsv-blockchain:mainfrom
icellan:fix/spend-circuit-breaker-data-state-953
May 28, 2026
Merged

fix(utxo/aerospike): exclude data-state errors from spend circuit breaker#957
icellan merged 4 commits into
bsv-blockchain:mainfrom
icellan:fix/spend-circuit-breaker-data-state-953

Conversation

@icellan

@icellan icellan commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • The spend circuit breaker was counting every per-record batch error as an infrastructure failure. During catch-up sync, aerospike returns KEY_NOT_FOUND_ERROR for descendant transactions whose parents are not yet present locally — the exact signal the orphanage is designed to absorb. With ~15k missing-parent results in a single block, the default 10-failure threshold tripped within seconds and every subsequent Spend short-circuited with SERVICE_UNAVAILABLE: [SPEND] circuit breaker open before ever reaching aerospike, stalling IBD until services were restarted.
  • Adds isInfrastructureFailure(err) in circuit_breaker.go that returns true only for an allow-list of aerospike infra ResultCodes (TIMEOUT, NETWORK_ERROR, NO_RESPONSE, MAX_RETRIES_EXCEEDED, NO_AVAILABLE_CONNECTIONS_TO_NODE, SERVER_NOT_AVAILABLE, SERVER_MEM_ERROR, SERVER_ERROR, DEVICE_OVERLOAD, BATCH_FAILED, GRPC_ERROR) plus stdlib context.DeadlineExceeded and net.Error timeouts. Gates the two RecordFailure() call sites — spend.go:handleBatchError and spend_expressions.go:processSpendBatchResultsExpressions — with it so KEY_NOT_FOUND and FILTERED_OUT no longer contribute to the failure counter.
  • Bias on unknown errors is non-infra: a false trip stalls sync, while a missed signal is recoverable by SpendWaitTimeout and health checks. The package doc-comment already documented this contract; the code now matches.
  • Updates docs/references/settings/stores/utxo_settings.md with the explicit infra trigger set so operators tuning SpendCircuitBreakerFailureCount know which errors actually increment it.

Closes #953

Test plan

  • go test -race -run 'TestIsInfrastructureFailure|TestCircuitBreaker' ./stores/utxo/aerospike/ — pass (new tests cover nil, 6 data-state codes stay non-infra, 11 infra codes including GRPC_ERROR trip, stdlib timeout / non-timeout net.Error, plain error, wrapped via Unwrap(), regression test driving 1000 KEY_NOT_FOUND_ERROR through the gated guard keeps the breaker CLOSED, and a safety-net test confirming 3 consecutive TIMEOUT still opens it)
  • make lint-new, go vet ./stores/utxo/aerospike/..., staticcheck ./stores/utxo/aerospike/ — clean
  • go build ./... — clean
  • Manual reproduction on a test node (issue's recipe): restart legacy + validator + subtreevalidation + blockvalidation against a peer ahead of local UTXO state; confirm catch-up proceeds without [SPEND] circuit breaker open and orphanage counts climb instead of stalling
  • Manual: inject a real infra failure (block aerospike port) and confirm the breaker still trips as expected — safety net preserved

Out of scope

icellan added 2 commits May 27, 2026 11:38
…aker

The spend circuit breaker was counting every per-record batch error as an
infrastructure failure. During catch-up sync, aerospike returns KEY_NOT_FOUND
for descendant transactions whose parents are not yet present locally — the
exact signal the orphanage is designed to absorb. With 15k missing-parent
results in a single block, the default 10-failure threshold tripped within
seconds and every subsequent Spend short-circuited before reaching aerospike,
stalling IBD until services were restarted.

Add isInfrastructureFailure() that returns true only for a known allow-list
of aerospike infra ResultCodes (TIMEOUT, NETWORK_ERROR, NO_RESPONSE,
MAX_RETRIES_EXCEEDED, NO_AVAILABLE_CONNECTIONS_TO_NODE, SERVER_NOT_AVAILABLE,
SERVER_MEM_ERROR, SERVER_ERROR, DEVICE_OVERLOAD, BATCH_FAILED) plus stdlib
context.DeadlineExceeded and net.Error timeouts. Gate the two RecordFailure
sites (handleBatchError in spend.go, processSpendBatchResultsExpressions in
spend_expressions.go) with it so KEY_NOT_FOUND and FILTERED_OUT no longer
contribute to the failure counter.

Bias on unknown errors is non-infra: a false trip stalls sync, while a
missed signal is recoverable by higher-level timeouts and health checks.
The package doc-comment already documented this contract; this change
makes the code match.

Closes bsv-blockchain#953
…GRPC_ERROR

- Document the infra-only failure-counter contract under
  SpendCircuitBreaker* in utxo_settings.md so operators tuning the
  threshold know which errors actually increment it.
- Append the same note to the circuit_breaker.go package doc-comment
  pointing to isInfrastructureFailure for the exact allow-list.
- Include types.GRPC_ERROR in the infra allow-list and matching test
  for future-proofing against aerospike gRPC mode.
- Drop the dead `_ = errors.NewStorageError` stub and unused teranode
  errors import in the test file.
@icellan icellan requested a review from Copilot May 27, 2026 10:28
@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Current Review: No issues found.

This PR correctly fixes issue #953 by preventing data-state errors (KEY_NOT_FOUND_ERROR, FILTERED_OUT) from tripping the spend circuit breaker during catch-up sync.

Implementation Quality:

  • Properly uses aerospike.Error interface to match both regular errors and const sentinels
  • Comprehensive test coverage including regression tests and edge cases
  • Clear documentation of infrastructure error allowlist in both code and settings docs
  • Correctly gates both call sites (spend.go:744 and spend_expressions.go:419-420)

Previous Issues (Now Resolved):

  • ✅ Copilot comment on circuit_breaker.go:93 - Fixed by using interface type instead of concrete type
  • ✅ Copilot comment on utxo_settings.md:34 - Fixed by moving section after table ends

The changes are minimal, well-tested, and correctly address the root cause without introducing new risks.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR narrows spend circuit breaker failure accounting so expected Aerospike data-state errors during catch-up sync do not trip the breaker and block orphanage recovery.

Changes:

  • Adds isInfrastructureFailure classification for spend breaker failures.
  • Gates Lua and expression spend breaker failure recording on the new classifier.
  • Adds regression/safety tests and documents the intended trigger set.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
stores/utxo/aerospike/circuit_breaker.go Adds infrastructure-error classifier and updates breaker documentation.
stores/utxo/aerospike/spend.go Gates Lua batch error breaker recording on infrastructure failures only.
stores/utxo/aerospike/spend_expressions.go Tracks infra errors separately from data-state spend errors.
stores/utxo/aerospike/circuit_breaker_test.go Adds classifier and breaker regression tests.
docs/references/settings/stores/utxo_settings.md Documents spend circuit breaker trigger semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread stores/utxo/aerospike/circuit_breaker.go Outdated
Comment thread docs/references/settings/stores/utxo_settings.md Outdated
@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-957 (a0b5d7e)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.752µ 1.761µ ~ 0.100
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.54n 61.54n ~ 0.800
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.49n 61.59n ~ 0.200
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.49n 61.36n ~ 0.300
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 29.83n 29.59n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 50.99n 50.50n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 107.3n 107.8n ~ 0.200
MiningCandidate_Stringify_Short-4 258.7n 257.6n ~ 0.500
MiningCandidate_Stringify_Long-4 1.895µ 1.897µ ~ 1.000
MiningSolution_Stringify-4 965.3n 966.4n ~ 0.700
BlockInfo_MarshalJSON-4 1.792µ 1.792µ ~ 0.700
NewFromBytes-4 129.4n 132.3n ~ 0.200
AddTxBatchColumnar_Validation-4 2.562µ 2.462µ ~ 0.400
OffsetValidationLoop-4 640.2n 642.4n ~ 0.200
Mine_EasyDifficulty-4 52.49µ 52.38µ ~ 0.700
Mine_WithAddress-4 5.829µ 5.831µ ~ 1.000
DirectSubtreeAdd/4_per_subtree-4 57.50n 57.90n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 31.49n 31.78n ~ 0.100
DirectSubtreeAdd/256_per_subtree-4 30.32n 30.96n ~ 0.700
DirectSubtreeAdd/1024_per_subtree-4 29.02n 29.35n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 28.60n 29.00n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 288.0n 281.5n ~ 0.700
SubtreeProcessorAdd/64_per_subtree-4 277.8n 283.7n ~ 0.700
SubtreeProcessorAdd/256_per_subtree-4 279.6n 282.0n ~ 0.400
SubtreeProcessorAdd/1024_per_subtree-4 276.2n 270.5n ~ 0.200
SubtreeProcessorAdd/2048_per_subtree-4 272.8n 271.6n ~ 0.400
SubtreeProcessorRotate/4_per_subtree-4 276.5n 277.7n ~ 1.000
SubtreeProcessorRotate/64_per_subtree-4 275.4n 276.4n ~ 1.000
SubtreeProcessorRotate/256_per_subtree-4 275.6n 277.2n ~ 0.800
SubtreeProcessorRotate/1024_per_subtree-4 278.3n 277.6n ~ 1.000
SubtreeNodeAddOnly/4_per_subtree-4 54.48n 54.41n ~ 0.700
SubtreeNodeAddOnly/64_per_subtree-4 34.40n 34.20n ~ 0.200
SubtreeNodeAddOnly/256_per_subtree-4 33.42n 33.20n ~ 0.200
SubtreeNodeAddOnly/1024_per_subtree-4 32.53n 32.64n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 115.2n 115.7n ~ 0.600
SubtreeCreationOnly/64_per_subtree-4 407.0n 411.1n ~ 1.000
SubtreeCreationOnly/256_per_subtree-4 1.370µ 1.367µ ~ 0.400
SubtreeCreationOnly/1024_per_subtree-4 4.437µ 4.479µ ~ 0.700
SubtreeCreationOnly/2048_per_subtree-4 8.178µ 8.378µ ~ 0.200
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 271.7n 279.0n ~ 0.200
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 271.0n 272.3n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.201m 2.198m ~ 1.000
ParallelGetAndSetIfNotExists/10k_nodes-4 5.441m 5.569m ~ 0.400
ParallelGetAndSetIfNotExists/50k_nodes-4 7.400m 7.853m ~ 0.400
ParallelGetAndSetIfNotExists/100k_nodes-4 10.09m 10.41m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 1.945m 1.939m ~ 1.000
SequentialGetAndSetIfNotExists/10k_nodes-4 4.383m 4.343m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 12.58m 12.11m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 22.24m 21.88m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.263m 2.251m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.328m 8.303m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.35m 13.35m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.979m 1.987m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 7.897m 7.653m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 40.72m 42.79m ~ 0.400
DiskTxMap_SetIfNotExists-4 3.715µ 3.601µ ~ 0.400
DiskTxMap_SetIfNotExists_Parallel-4 3.346µ 3.361µ ~ 1.000
DiskTxMap_ExistenceOnly-4 311.1n 338.6n ~ 0.100
Queue-4 188.8n 188.5n ~ 0.400
AtomicPointer-4 4.428n 4.622n ~ 1.000
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 879.0µ 899.3µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 784.3µ 793.5µ ~ 0.200
ReorgOptimizations/AllMarkFalse/Old/10K-4 103.0µ 103.2µ ~ 0.400
ReorgOptimizations/AllMarkFalse/New/10K-4 62.21µ 61.73µ ~ 0.200
ReorgOptimizations/HashSlicePool/Old/10K-4 54.62µ 57.62µ ~ 0.200
ReorgOptimizations/HashSlicePool/New/10K-4 11.76µ 12.12µ ~ 1.000
ReorgOptimizations/NodeFlags/Old/10K-4 5.105µ 5.320µ ~ 0.400
ReorgOptimizations/NodeFlags/New/10K-4 1.918µ 2.183µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 10.15m 10.52m ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.38m 10.04m ~ 0.400
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.092m 1.080m ~ 0.200
ReorgOptimizations/AllMarkFalse/New/100K-4 689.8µ 680.3µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/100K-4 664.7µ 587.7µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 322.7µ 315.3µ ~ 1.000
ReorgOptimizations/NodeFlags/Old/100K-4 46.04µ 50.30µ ~ 0.200
ReorgOptimizations/NodeFlags/New/100K-4 15.02µ 17.85µ ~ 0.100
TxMapSetIfNotExists-4 52.41n 52.38n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 40.13n 39.87n ~ 0.400
ChannelSendReceive-4 603.6n 633.4n ~ 0.100
BlockAssembler_AddTx-4 0.02567n 0.02823n ~ 0.700
AddNode-4 10.72 10.39 ~ 1.000
AddNodeWithMap-4 11.57 10.87 ~ 0.400
CalcBlockWork-4 481.0n 479.7n ~ 0.500
CalculateWork-4 636.1n 645.6n ~ 0.100
BuildBlockLocatorString_Helpers/Size_10-4 1.533µ 1.344µ ~ 1.000
BuildBlockLocatorString_Helpers/Size_100-4 12.75µ 12.82µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_1000-4 125.9µ 126.5µ ~ 0.100
CatchupWithHeaderCache-4 104.2m 104.2m ~ 1.000
SubtreeSizes/10k_tx_4_per_subtree-4 720.7µ 721.5µ ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 177.0µ 177.5µ ~ 0.400
SubtreeSizes/10k_tx_64_per_subtree-4 42.12µ 43.26µ ~ 0.200
SubtreeSizes/10k_tx_256_per_subtree-4 10.59µ 10.95µ ~ 0.100
SubtreeSizes/10k_tx_512_per_subtree-4 5.223µ 5.440µ ~ 0.100
SubtreeSizes/10k_tx_1024_per_subtree-4 2.653µ 2.696µ ~ 0.700
SubtreeSizes/10k_tx_2k_per_subtree-4 1.371µ 1.347µ ~ 0.700
BlockSizeScaling/10k_tx_64_per_subtree-4 41.64µ 42.90µ ~ 0.200
BlockSizeScaling/10k_tx_256_per_subtree-4 10.74µ 10.84µ ~ 0.700
BlockSizeScaling/10k_tx_1024_per_subtree-4 2.680µ 2.701µ ~ 0.400
BlockSizeScaling/50k_tx_64_per_subtree-4 217.6µ 223.0µ ~ 0.400
BlockSizeScaling/50k_tx_256_per_subtree-4 52.84µ 54.32µ ~ 0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4 13.23µ 13.06µ ~ 0.700
SubtreeAllocations/small_subtrees_exists_check-4 80.29µ 79.68µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 85.49µ 84.23µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 179.7µ 180.3µ ~ 0.400
SubtreeAllocations/medium_subtrees_exists_check-4 5.304µ 5.144µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 5.517µ 5.305µ ~ 0.100
SubtreeAllocations/medium_subtrees_full_validation-4 10.84µ 10.70µ ~ 0.200
SubtreeAllocations/large_subtrees_exists_check-4 1.277µ 1.261µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 1.367µ 1.296µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 2.709µ 2.722µ ~ 1.000
_prepareTxsPerLevel-4 395.8m 393.7m ~ 1.000
_prepareTxsPerLevelOrdered-4 3.410m 3.527m ~ 0.100
_prepareTxsPerLevel_Comparison/Original-4 390.3m 396.3m ~ 0.100
_prepareTxsPerLevel_Comparison/Optimized-4 3.472m 3.429m ~ 0.400
_BufferPoolAllocation/16KB-4 4.055µ 4.017µ ~ 0.800
_BufferPoolAllocation/32KB-4 12.228µ 7.820µ ~ 0.100
_BufferPoolAllocation/64KB-4 17.66µ 19.49µ ~ 0.700
_BufferPoolAllocation/128KB-4 32.09µ 28.94µ ~ 0.100
_BufferPoolAllocation/512KB-4 144.9µ 143.4µ ~ 0.400
_BufferPoolConcurrent/32KB-4 22.38µ 20.56µ ~ 0.100
_BufferPoolConcurrent/64KB-4 37.39µ 32.16µ ~ 0.100
_BufferPoolConcurrent/512KB-4 176.2µ 181.1µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/16KB-4 751.6µ 776.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 729.7µ 755.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 735.1µ 753.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 728.7µ 761.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 759.9µ 735.4µ ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/16KB-4 39.97m 39.14m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/32KB-4 39.45m 39.47m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/64KB-4 39.59m 39.33m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4 39.31m 39.18m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/512KB-4 39.21m 39.31m ~ 0.400
_PooledVsNonPooled/Pooled-4 830.0n 832.7n ~ 0.700
_PooledVsNonPooled/NonPooled-4 7.822µ 7.739µ ~ 0.400
_MemoryFootprint/Current_512KB_32concurrent-4 8.889µ 9.470µ ~ 0.700
_MemoryFootprint/Proposed_32KB_32concurrent-4 12.07µ 12.82µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 11.74µ 11.92µ ~ 0.200
StoreBlock_Sequential/BelowCSVHeight-4 251.8µ 251.9µ ~ 0.700
StoreBlock_Sequential/AboveCSVHeight-4 253.6µ 254.0µ ~ 1.000
GetUtxoHashes-4 259.6n 255.4n ~ 0.700
GetUtxoHashes_ManyOutputs-4 42.59µ 44.41µ ~ 0.200
_NewMetaDataFromBytes-4 225.6n 227.2n ~ 0.200
_Bytes-4 397.7n 402.6n ~ 0.200
_MetaBytes-4 136.1n 136.4n ~ 1.000

Threshold: >10% with p < 0.05 | Generated: 2026-05-28 09:16 UTC

@oskarszoon oskarszoon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. The catch-up-trips-breaker bug is correctly diagnosed and the allow-list approach is the right shape. Bias toward non-infra-on-unknown matches the documented contract — false trip stalls sync; missed signal recoverable via SpendWaitTimeout + health checks. 37 tests -race clean.

Worth adding while you're here

Three Aerospike infra ResultCodes are missing from the allow-list. All node-level failure modes; currently they fall through to the default: return false branch and silently fail to trip the breaker:

  • MAX_ERROR_RATE (-15) — node error rate ceiling exceeded
  • PARTITION_UNAVAILABLE (11) — partition unreachable during rebalance or node loss
  • INVALID_NODE_ERROR (-3) — chosen node no longer active

Either add them, or add an explicit comment in isInfrastructureFailure documenting why they're deliberately excluded.

Follow-up (not blocking)

isInfrastructureFailure's default: return false branch is silent. A future Aerospike client version adding a ResultCode for a real infra condition would be invisible — the breaker wouldn't trip and operators wouldn't know. Adding a Warnf("unrecognised aerospike error: %v", err) at the unknown branch would surface them. The function is package-level with no logger; would need threading a logger in or converting to a method.

icellan and others added 2 commits May 27, 2026 18:38
…es (PR bsv-blockchain#957 review)

Address Copilot bot and oskarszoon review feedback on PR bsv-blockchain#957:

- Match the aerospike.Error interface instead of *AerospikeError so the
  client's const-sentinel errors (*constAerospikeError: ErrTimeout,
  ErrNetwork, ErrMaxRetriesExceeded, ErrConnectionPoolEmpty, etc.) are
  classified consistently. The previous *AerospikeError type assertion
  silently treated those per-record sentinels as non-infra — same class of
  bug as send_store_batch_test.go:128-140 documents. Use aerospike.Error's
  Matches() method, which is the documented public API.
- Add three node-level infra ResultCodes that were missing from the
  allow-list: MAX_ERROR_RATE (-15), PARTITION_UNAVAILABLE (11),
  INVALID_NODE_ERROR (-3).
- Move the trigger-set documentation out of the middle of the settings
  table in utxo_settings.md (the inserted heading + paragraph terminated
  the table and broke rendering of subsequent rows) to its own section
  after the table.
- Add ConstSentinelErrorsAreClassified test covering both infra
  (ErrTimeout/Network/MaxRetriesExceeded/ConnectionPoolEmpty) and
  data-state (ErrKeyNotFound/FilteredOut) sentinels.
@sonarqubecloud

Copy link
Copy Markdown

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Correct fix, well-scoped, well-tested. The allow-list + bias-toward-false-negatives is the right design for the IBD-critical path.

Follow-ups before merge

  1. PR description drift. The body enumerates 11 infra codes; the implementation ships 14 (MAX_ERROR_RATE, INVALID_NODE_ERROR, PARTITION_UNAVAILABLE are also classified as infra). Docs file is already correct — please sync the PR description.
  2. Manual reproduction. The two unchecked items in the test plan (catch-up sync on a real node + real-infra-failure injection) are the behavioural validations for #953. Please confirm both were exercised before merge.

Optional / non-blocking

  • BATCH_FAILED worth a quick sanity check — confirm the client only emits it for cluster-level batch-dispatch failure, not as a wholesale code when all per-record results are data-state. If the latter is possible, the gate re-enters the failure mode at the wrapper level.
  • The regression test simulates the gated call site rather than exercising processSpendBatchResultsExpressions end-to-end. Consider one integration-style test that drives the function with a fabricated KEY_NOT_FOUND batch record so the gate placement is locked in against future refactors.
  • MAX_ERROR_RATE is per-node and self-recovers; including it is defensible but a flaky single node can briefly open the breaker. Flag for ops awareness.

@icellan icellan merged commit 92db8ca into bsv-blockchain:main May 28, 2026
25 checks passed
@icellan icellan self-assigned this May 28, 2026
icellan added a commit that referenced this pull request May 28, 2026
PR #957 landed on main using the stock aerospike-client-go/v8 import
path. pr-828's whole utxo/aerospike package uses the BSV fork
(github.com/bsv-blockchain/aerospike-client-go/v8), which adds the
TeranodeModifyOp wire opcode that the native-ops path needs. Mixing
stock and fork in the same package would break type compatibility
across files.

Same surgical edit as was applied to spend_expressions.go during the
prior conflict resolution. The BSV fork is API-superset of stock for
everything circuit_breaker.go touches, so the change is import-path-only.
icellan added a commit to icellan/teranode that referenced this pull request May 28, 2026
stores/utxo/aerospike/circuit_breaker.go and circuit_breaker_test.go
landed on main (PR bsv-blockchain#957) after this branch forked, and the subsequent
merge brought them in with the upstream aerospike-client-go/v8 import
path. Apply the same rename to keep the module-path swap consistent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spend circuit breaker counts KEY_NOT_FOUND as infrastructure failure, defeats orphanage during catch-up

4 participants