fix(utxo/aerospike): exclude data-state errors from spend circuit breaker#957
Conversation
…aker The spend circuit breaker was counting every per-record batch error as an infrastructure failure. During catch-up sync, aerospike returns KEY_NOT_FOUND for descendant transactions whose parents are not yet present locally — the exact signal the orphanage is designed to absorb. With 15k missing-parent results in a single block, the default 10-failure threshold tripped within seconds and every subsequent Spend short-circuited before reaching aerospike, stalling IBD until services were restarted. Add isInfrastructureFailure() that returns true only for a known allow-list of aerospike infra ResultCodes (TIMEOUT, NETWORK_ERROR, NO_RESPONSE, MAX_RETRIES_EXCEEDED, NO_AVAILABLE_CONNECTIONS_TO_NODE, SERVER_NOT_AVAILABLE, SERVER_MEM_ERROR, SERVER_ERROR, DEVICE_OVERLOAD, BATCH_FAILED) plus stdlib context.DeadlineExceeded and net.Error timeouts. Gate the two RecordFailure sites (handleBatchError in spend.go, processSpendBatchResultsExpressions in spend_expressions.go) with it so KEY_NOT_FOUND and FILTERED_OUT no longer contribute to the failure counter. Bias on unknown errors is non-infra: a false trip stalls sync, while a missed signal is recoverable by higher-level timeouts and health checks. The package doc-comment already documented this contract; this change makes the code match. Closes bsv-blockchain#953
…GRPC_ERROR - Document the infra-only failure-counter contract under SpendCircuitBreaker* in utxo_settings.md so operators tuning the threshold know which errors actually increment it. - Append the same note to the circuit_breaker.go package doc-comment pointing to isInfrastructureFailure for the exact allow-list. - Include types.GRPC_ERROR in the infra allow-list and matching test for future-proofing against aerospike gRPC mode. - Drop the dead `_ = errors.NewStorageError` stub and unused teranode errors import in the test file.
|
🤖 Claude Code Review Status: Complete Current Review: No issues found. This PR correctly fixes issue #953 by preventing data-state errors (KEY_NOT_FOUND_ERROR, FILTERED_OUT) from tripping the spend circuit breaker during catch-up sync. Implementation Quality:
Previous Issues (Now Resolved):
The changes are minimal, well-tested, and correctly address the root cause without introducing new risks. |
There was a problem hiding this comment.
Pull request overview
This PR narrows spend circuit breaker failure accounting so expected Aerospike data-state errors during catch-up sync do not trip the breaker and block orphanage recovery.
Changes:
- Adds
isInfrastructureFailureclassification for spend breaker failures. - Gates Lua and expression spend breaker failure recording on the new classifier.
- Adds regression/safety tests and documents the intended trigger set.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
stores/utxo/aerospike/circuit_breaker.go |
Adds infrastructure-error classifier and updates breaker documentation. |
stores/utxo/aerospike/spend.go |
Gates Lua batch error breaker recording on infrastructure failures only. |
stores/utxo/aerospike/spend_expressions.go |
Tracks infra errors separately from data-state spend errors. |
stores/utxo/aerospike/circuit_breaker_test.go |
Adds classifier and breaker regression tests. |
docs/references/settings/stores/utxo_settings.md |
Documents spend circuit breaker trigger semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-28 09:16 UTC |
oskarszoon
left a comment
There was a problem hiding this comment.
Approve. The catch-up-trips-breaker bug is correctly diagnosed and the allow-list approach is the right shape. Bias toward non-infra-on-unknown matches the documented contract — false trip stalls sync; missed signal recoverable via SpendWaitTimeout + health checks. 37 tests -race clean.
Worth adding while you're here
Three Aerospike infra ResultCodes are missing from the allow-list. All node-level failure modes; currently they fall through to the default: return false branch and silently fail to trip the breaker:
MAX_ERROR_RATE(-15) — node error rate ceiling exceededPARTITION_UNAVAILABLE(11) — partition unreachable during rebalance or node lossINVALID_NODE_ERROR(-3) — chosen node no longer active
Either add them, or add an explicit comment in isInfrastructureFailure documenting why they're deliberately excluded.
Follow-up (not blocking)
isInfrastructureFailure's default: return false branch is silent. A future Aerospike client version adding a ResultCode for a real infra condition would be invisible — the breaker wouldn't trip and operators wouldn't know. Adding a Warnf("unrecognised aerospike error: %v", err) at the unknown branch would surface them. The function is package-level with no logger; would need threading a logger in or converting to a method.
…es (PR bsv-blockchain#957 review) Address Copilot bot and oskarszoon review feedback on PR bsv-blockchain#957: - Match the aerospike.Error interface instead of *AerospikeError so the client's const-sentinel errors (*constAerospikeError: ErrTimeout, ErrNetwork, ErrMaxRetriesExceeded, ErrConnectionPoolEmpty, etc.) are classified consistently. The previous *AerospikeError type assertion silently treated those per-record sentinels as non-infra — same class of bug as send_store_batch_test.go:128-140 documents. Use aerospike.Error's Matches() method, which is the documented public API. - Add three node-level infra ResultCodes that were missing from the allow-list: MAX_ERROR_RATE (-15), PARTITION_UNAVAILABLE (11), INVALID_NODE_ERROR (-3). - Move the trigger-set documentation out of the middle of the settings table in utxo_settings.md (the inserted heading + paragraph terminated the table and broke rendering of subsequent rows) to its own section after the table. - Add ConstSentinelErrorsAreClassified test covering both infra (ErrTimeout/Network/MaxRetriesExceeded/ConnectionPoolEmpty) and data-state (ErrKeyNotFound/FilteredOut) sentinels.
|
ordishs
left a comment
There was a problem hiding this comment.
Approve. Correct fix, well-scoped, well-tested. The allow-list + bias-toward-false-negatives is the right design for the IBD-critical path.
Follow-ups before merge
- PR description drift. The body enumerates 11 infra codes; the implementation ships 14 (
MAX_ERROR_RATE,INVALID_NODE_ERROR,PARTITION_UNAVAILABLEare also classified as infra). Docs file is already correct — please sync the PR description. - Manual reproduction. The two unchecked items in the test plan (catch-up sync on a real node + real-infra-failure injection) are the behavioural validations for #953. Please confirm both were exercised before merge.
Optional / non-blocking
BATCH_FAILEDworth a quick sanity check — confirm the client only emits it for cluster-level batch-dispatch failure, not as a wholesale code when all per-record results are data-state. If the latter is possible, the gate re-enters the failure mode at the wrapper level.- The regression test simulates the gated call site rather than exercising
processSpendBatchResultsExpressionsend-to-end. Consider one integration-style test that drives the function with a fabricatedKEY_NOT_FOUNDbatch record so the gate placement is locked in against future refactors. MAX_ERROR_RATEis per-node and self-recovers; including it is defensible but a flaky single node can briefly open the breaker. Flag for ops awareness.
PR #957 landed on main using the stock aerospike-client-go/v8 import path. pr-828's whole utxo/aerospike package uses the BSV fork (github.com/bsv-blockchain/aerospike-client-go/v8), which adds the TeranodeModifyOp wire opcode that the native-ops path needs. Mixing stock and fork in the same package would break type compatibility across files. Same surgical edit as was applied to spend_expressions.go during the prior conflict resolution. The BSV fork is API-superset of stock for everything circuit_breaker.go touches, so the change is import-path-only.
stores/utxo/aerospike/circuit_breaker.go and circuit_breaker_test.go landed on main (PR bsv-blockchain#957) after this branch forked, and the subsequent merge brought them in with the upstream aerospike-client-go/v8 import path. Apply the same rename to keep the module-path swap consistent.



Summary
KEY_NOT_FOUND_ERRORfor descendant transactions whose parents are not yet present locally — the exact signal the orphanage is designed to absorb. With ~15k missing-parent results in a single block, the default 10-failure threshold tripped within seconds and every subsequentSpendshort-circuited withSERVICE_UNAVAILABLE: [SPEND] circuit breaker openbefore ever reaching aerospike, stalling IBD until services were restarted.isInfrastructureFailure(err)incircuit_breaker.gothat returns true only for an allow-list of aerospike infraResultCodes (TIMEOUT,NETWORK_ERROR,NO_RESPONSE,MAX_RETRIES_EXCEEDED,NO_AVAILABLE_CONNECTIONS_TO_NODE,SERVER_NOT_AVAILABLE,SERVER_MEM_ERROR,SERVER_ERROR,DEVICE_OVERLOAD,BATCH_FAILED,GRPC_ERROR) plus stdlibcontext.DeadlineExceededandnet.Errortimeouts. Gates the twoRecordFailure()call sites —spend.go:handleBatchErrorandspend_expressions.go:processSpendBatchResultsExpressions— with it soKEY_NOT_FOUNDandFILTERED_OUTno longer contribute to the failure counter.SpendWaitTimeoutand health checks. The package doc-comment already documented this contract; the code now matches.docs/references/settings/stores/utxo_settings.mdwith the explicit infra trigger set so operators tuningSpendCircuitBreakerFailureCountknow which errors actually increment it.Closes #953
Test plan
go test -race -run 'TestIsInfrastructureFailure|TestCircuitBreaker' ./stores/utxo/aerospike/— pass (new tests cover nil, 6 data-state codes stay non-infra, 11 infra codes includingGRPC_ERRORtrip, stdlib timeout / non-timeoutnet.Error, plain error, wrapped viaUnwrap(), regression test driving 1000KEY_NOT_FOUND_ERRORthrough the gated guard keeps the breaker CLOSED, and a safety-net test confirming 3 consecutiveTIMEOUTstill opens it)make lint-new,go vet ./stores/utxo/aerospike/...,staticcheck ./stores/utxo/aerospike/— cleango build ./...— cleanlegacy + validator + subtreevalidation + blockvalidationagainst a peer ahead of local UTXO state; confirm catch-up proceeds without[SPEND] circuit breaker openand orphanage counts climb instead of stallingOut of scope
BatchOperateerrors inexecuteSpendBatch(not currently recorded against the breaker)processTransactionsInLevelsrefuses to add to orphanage (issue calls this out as "related but separate")createUtxoschunking (legacy: createUtxos calls SetMinedMulti with unbounded slice — stalls aerospike on fat blocks (regression from #854) #936)