Summary
The aerospike spend circuit breaker (stores/utxo/aerospike/circuit_breaker.go) counts KEY_NOT_FOUND results from Spend batch operations as consecutive infrastructure failures. During catch-up sync — when descendant transactions are being validated before their parents are present in the local UTXO store — this causes the breaker to trip within seconds. Once open, all Spend calls fast-fail with SERVICE_UNAVAILABLE: [SPEND] circuit breaker open, the orphanage mechanism is bypassed entirely, and sync stalls indefinitely on whatever block was being processed.
The breaker is meant to track infrastructure failures (timeout, network error, server memory error, max-retries-exceeded). KEY_NOT_FOUND is data state, not infrastructure — it's the expected signal that orphanage is supposed to handle.
Observed
Host: bsva-ovh-teranode-eu-1, /mnt/data/teranode-quickstart deployment, mainnet sync.
Sequence (2026-05-27):
08:02:32 — legacy container restarted
08:02:39 — first [SPEND] circuit breaker open, rejecting request (7 seconds in). Block 950,880, 15,974 txs, peer 51.222.249.3:8333. prepareSubtrees finished in 2.38s with 15,166 errors logged from validateSubtree.
08:04:54 — FSM transitioned to RUNNING for subtreevalidation, validator, blob-deletion, stores/utxo/factory, etc. Subtreevalidation logs show Transaction X missing parent, but FSM not in RUNNING state - not adding to orphanage for thousands of txs before this point.
- Every subsequent attempt at block 950,880 — including ones with FSM=RUNNING — hits the open breaker and exits immediately without speaking to aerospike.
Aerospike itself is healthy throughout:
stop_writes=false, hwm_breached=false
- 456,714,549 objects, 428 GB / 1024 GB used (42%)
cache_read_pct=93%
client_connections=181 (low, no leak)
So nothing about the actual aerospike infrastructure is failing. The breaker is closed-loop tripping on its own miscategorisation.
The breaker chain
[validateSubtree][af3c…] found 15166 errors while processing subtree, added 0 to orphanage
→ [validateSubtree] failed to bless missing transaction: 507fd1f3…
→ [blessMissingTransaction][000…000/af3c…][507fd1f3…] failed to validate transaction
→ [Validate][507fd1f3…] error spending utxos
→ validator: UTXO Store spend failed for 507fd1f3…
→ UTXO_ERROR (79): error in aerospike spend (batched mode) - errors
→ SERVICE_UNAVAILABLE (50): [SPEND] circuit breaker open, rejecting request
The inner cause when the breaker first tripped was a flood of aerospike batched-mode results where the parent records were absent (i.e. KEY_NOT_FOUND for the parent outpoint key). The aerospike client returns these as per-record errors inside the batch result; the wrapper's failure counter doesn't filter them out before incrementing.
Why this defeats orphanage
The orphanage mechanism (processTransactionsInLevels → not adding to orphanage when FSM ≠ RUNNING; otherwise queue for retry after parents arrive) exists precisely to handle the catch-up case where descendants are seen before their parents. If the spend circuit breaker trips on the very condition orphanage is designed to absorb, sync can never make forward progress on any block whose descendants arrive before parents — which is most blocks during IBD.
In the eu-1 incident:
- Legacy starts processing block 950,880 immediately
- Subtreevalidation calls
Spend on 15,974 txs whose parents aren't yet in the store
- Aerospike returns 15,146 KEY_NOT_FOUND results inside the batch
- Wrapper counts these as consecutive failures
- Breaker opens after
SpendCircuitBreakerFailureCount (default value)
- The remaining ~800 txs that might have succeeded now fail with SERVICE_UNAVAILABLE
- Next retry: same block, same flood, breaker stays tripped through the cooldown probe
Suggested fix
In the wrapper that translates aerospike batch results into the failure counter, filter by ResultCode before incrementing:
// Only count infrastructure failures toward the breaker
switch resultCode {
case types.TIMEOUT, types.NETWORK_ERROR, types.SERVER_MEM_ERROR,
types.MAX_RETRIES_EXCEEDED, types.NO_AVAILABLE_CONNECTIONS_TO_NODE,
types.SERVER_ERROR:
breaker.RecordFailure()
case types.KEY_NOT_FOUND, types.FILTERED_OUT, types.RECORD_NOT_FOUND:
// data state, not infrastructure — let the caller handle (orphanage etc.)
default:
// unknown — record cautiously
}
The exact set of "infrastructure" result codes should match what the breaker is documented to protect against (the comment block at the top of circuit_breaker.go already enumerates them — make the code match the comment).
Related but separate
The catch-up scenario was made worse by legacy beginning block ingestion before downstream services finished their FSM transition to RUNNING — for the eu-1 incident, that gap was ~2 minutes (08:02:32 → 08:04:54). During that gap, missing-parent txs are explicitly not added to orphanage. The breaker tripping during this window is what got us stuck. Worth a separate issue, not in scope here.
Reproduction
- Restart legacy + validator/subtreevalidation/blockvalidation on a node that has UTXO data older than the chain tip (or wipe utxo-store, then point at a peer well ahead).
- Allow legacy to start consuming blocks before downstream services FSM-transition to RUNNING.
- Observe
[SPEND] circuit breaker open within seconds.
Alternatively, unit test directly: feed recordBatchResult with N consecutive KEY_NOT_FOUND results and assert breaker stays closed.
Verification
Affected hosts (current incidents)
bsva-ovh-teranode-eu-1 — workaround was docker compose restart legacy blockvalidation subtreevalidation after FSM had transitioned to RUNNING; on next sync the orphanage absorbed the missing parents and block 950,880 went through
Related
Summary
The aerospike spend circuit breaker (
stores/utxo/aerospike/circuit_breaker.go) countsKEY_NOT_FOUNDresults fromSpendbatch operations as consecutive infrastructure failures. During catch-up sync — when descendant transactions are being validated before their parents are present in the local UTXO store — this causes the breaker to trip within seconds. Once open, allSpendcalls fast-fail withSERVICE_UNAVAILABLE: [SPEND] circuit breaker open, the orphanage mechanism is bypassed entirely, and sync stalls indefinitely on whatever block was being processed.The breaker is meant to track infrastructure failures (timeout, network error, server memory error, max-retries-exceeded).
KEY_NOT_FOUNDis data state, not infrastructure — it's the expected signal that orphanage is supposed to handle.Observed
Host:
bsva-ovh-teranode-eu-1,/mnt/data/teranode-quickstartdeployment, mainnet sync.Sequence (2026-05-27):
08:02:32— legacy container restarted08:02:39— first[SPEND] circuit breaker open, rejecting request(7 seconds in). Block 950,880, 15,974 txs, peer51.222.249.3:8333.prepareSubtreesfinished in 2.38s with 15,166 errors logged fromvalidateSubtree.08:04:54— FSM transitioned to RUNNING for subtreevalidation, validator, blob-deletion, stores/utxo/factory, etc. Subtreevalidation logs showTransaction X missing parent, but FSM not in RUNNING state - not adding to orphanagefor thousands of txs before this point.Aerospike itself is healthy throughout:
stop_writes=false,hwm_breached=falsecache_read_pct=93%client_connections=181(low, no leak)So nothing about the actual aerospike infrastructure is failing. The breaker is closed-loop tripping on its own miscategorisation.
The breaker chain
The inner cause when the breaker first tripped was a flood of aerospike batched-mode results where the parent records were absent (i.e.
KEY_NOT_FOUNDfor the parent outpoint key). The aerospike client returns these as per-record errors inside the batch result; the wrapper's failure counter doesn't filter them out before incrementing.Why this defeats orphanage
The orphanage mechanism (
processTransactionsInLevels→not adding to orphanagewhen FSM ≠ RUNNING; otherwise queue for retry after parents arrive) exists precisely to handle the catch-up case where descendants are seen before their parents. If the spend circuit breaker trips on the very condition orphanage is designed to absorb, sync can never make forward progress on any block whose descendants arrive before parents — which is most blocks during IBD.In the eu-1 incident:
Spendon 15,974 txs whose parents aren't yet in the storeSpendCircuitBreakerFailureCount(default value)Suggested fix
In the wrapper that translates aerospike batch results into the failure counter, filter by
ResultCodebefore incrementing:The exact set of "infrastructure" result codes should match what the breaker is documented to protect against (the comment block at the top of
circuit_breaker.goalready enumerates them — make the code match the comment).Related but separate
The catch-up scenario was made worse by legacy beginning block ingestion before downstream services finished their FSM transition to RUNNING — for the eu-1 incident, that gap was ~2 minutes (08:02:32 → 08:04:54). During that gap, missing-parent txs are explicitly not added to orphanage. The breaker tripping during this window is what got us stuck. Worth a separate issue, not in scope here.
Reproduction
[SPEND] circuit breaker openwithin seconds.Alternatively, unit test directly: feed
recordBatchResultwith N consecutive KEY_NOT_FOUND results and assert breaker stays closed.Verification
Affected hosts (current incidents)
bsva-ovh-teranode-eu-1— workaround wasdocker compose restart legacy blockvalidation subtreevalidationafter FSM had transitioned to RUNNING; on next sync the orphanage absorbed the missing parents and block 950,880 went throughRelated
createUtxoschunking (different code path, but same general class of "stalled-on-one-block" symptoms)