spend circuit breaker counts KEY_NOT_FOUND as infrastructure failure, defeats orphanage during catch-up

## Summary

The aerospike spend circuit breaker (`stores/utxo/aerospike/circuit_breaker.go`) counts `KEY_NOT_FOUND` results from `Spend` batch operations as consecutive infrastructure failures. During catch-up sync — when descendant transactions are being validated before their parents are present in the local UTXO store — this causes the breaker to trip within seconds. Once open, all `Spend` calls fast-fail with `SERVICE_UNAVAILABLE: [SPEND] circuit breaker open`, the orphanage mechanism is bypassed entirely, and sync stalls indefinitely on whatever block was being processed.

The breaker is meant to track *infrastructure* failures (timeout, network error, server memory error, max-retries-exceeded). `KEY_NOT_FOUND` is data state, not infrastructure — it's the expected signal that orphanage is supposed to handle.

## Observed

Host: `bsva-ovh-teranode-eu-1`, `/mnt/data/teranode-quickstart` deployment, mainnet sync.

Sequence (2026-05-27):

- `08:02:32` — legacy container restarted
- `08:02:39` — first `[SPEND] circuit breaker open, rejecting request` (7 seconds in). Block 950,880, 15,974 txs, peer `51.222.249.3:8333`. `prepareSubtrees` finished in 2.38s with 15,166 errors logged from `validateSubtree`.
- `08:04:54` — FSM transitioned to RUNNING for subtreevalidation, validator, blob-deletion, stores/utxo/factory, etc. Subtreevalidation logs show `Transaction X missing parent, but FSM not in RUNNING state - not adding to orphanage` for thousands of txs before this point.
- Every subsequent attempt at block 950,880 — including ones with FSM=RUNNING — hits the open breaker and exits immediately without speaking to aerospike.

Aerospike itself is healthy throughout:
- `stop_writes=false`, `hwm_breached=false`
- 456,714,549 objects, 428 GB / 1024 GB used (42%)
- `cache_read_pct=93%`
- `client_connections=181` (low, no leak)

So nothing about the actual aerospike infrastructure is failing. The breaker is closed-loop tripping on its own miscategorisation.

## The breaker chain

```
[validateSubtree][af3c…] found 15166 errors while processing subtree, added 0 to orphanage
  → [validateSubtree] failed to bless missing transaction: 507fd1f3…
    → [blessMissingTransaction][000…000/af3c…][507fd1f3…] failed to validate transaction
      → [Validate][507fd1f3…] error spending utxos
        → validator: UTXO Store spend failed for 507fd1f3…
          → UTXO_ERROR (79): error in aerospike spend (batched mode) - errors
            → SERVICE_UNAVAILABLE (50): [SPEND] circuit breaker open, rejecting request
```

The inner cause when the breaker first tripped was a flood of aerospike batched-mode results where the parent records were absent (i.e. `KEY_NOT_FOUND` for the parent outpoint key). The aerospike client returns these as per-record errors inside the batch result; the wrapper's failure counter doesn't filter them out before incrementing.

## Why this defeats orphanage

The orphanage mechanism (`processTransactionsInLevels` → `not adding to orphanage` when FSM ≠ RUNNING; otherwise queue for retry after parents arrive) exists *precisely* to handle the catch-up case where descendants are seen before their parents. If the spend circuit breaker trips on the very condition orphanage is designed to absorb, sync can never make forward progress on any block whose descendants arrive before parents — which is most blocks during IBD.

In the eu-1 incident:
1. Legacy starts processing block 950,880 immediately
2. Subtreevalidation calls `Spend` on 15,974 txs whose parents aren't yet in the store
3. Aerospike returns 15,146 KEY_NOT_FOUND results inside the batch
4. Wrapper counts these as consecutive failures
5. Breaker opens after `SpendCircuitBreakerFailureCount` (default value)
6. The remaining ~800 txs that *might* have succeeded now fail with SERVICE_UNAVAILABLE
7. Next retry: same block, same flood, breaker stays tripped through the cooldown probe

## Suggested fix

In the wrapper that translates aerospike batch results into the failure counter, filter by `ResultCode` before incrementing:

```go
// Only count infrastructure failures toward the breaker
switch resultCode {
case types.TIMEOUT, types.NETWORK_ERROR, types.SERVER_MEM_ERROR,
     types.MAX_RETRIES_EXCEEDED, types.NO_AVAILABLE_CONNECTIONS_TO_NODE,
     types.SERVER_ERROR:
    breaker.RecordFailure()
case types.KEY_NOT_FOUND, types.FILTERED_OUT, types.RECORD_NOT_FOUND:
    // data state, not infrastructure — let the caller handle (orphanage etc.)
default:
    // unknown — record cautiously
}
```

The exact set of "infrastructure" result codes should match what the breaker is documented to protect against (the comment block at the top of `circuit_breaker.go` already enumerates them — make the code match the comment).

## Related but separate

The catch-up scenario was made worse by legacy beginning block ingestion before downstream services finished their FSM transition to RUNNING — for the eu-1 incident, that gap was ~2 minutes (08:02:32 → 08:04:54). During that gap, missing-parent txs are explicitly *not* added to orphanage. The breaker tripping during this window is what got us stuck. Worth a separate issue, not in scope here.

## Reproduction

1. Restart legacy + validator/subtreevalidation/blockvalidation on a node that has UTXO data older than the chain tip (or wipe utxo-store, then point at a peer well ahead).
2. Allow legacy to start consuming blocks before downstream services FSM-transition to RUNNING.
3. Observe `[SPEND] circuit breaker open` within seconds.

Alternatively, unit test directly: feed `recordBatchResult` with N consecutive KEY_NOT_FOUND results and assert breaker stays closed.

## Verification

- [ ] New unit test: KEY_NOT_FOUND-only batch results don't increment failure counter
- [ ] Existing tests for genuine infrastructure failures (timeout, server-mem) still trip the breaker as expected
- [ ] Manual: replay the eu-1 scenario on a test deployment, confirm breaker stays closed and txs flow through orphanage instead

## Affected hosts (current incidents)

- `bsva-ovh-teranode-eu-1` — workaround was `docker compose restart legacy blockvalidation subtreevalidation` after FSM had transitioned to RUNNING; on next sync the orphanage absorbed the missing parents and block 950,880 went through

## Related

- #936 — `createUtxos` chunking (different code path, but same general class of "stalled-on-one-block" symptoms)
- #941 — connection pool sizing (a separate vector that can also indirectly trigger the breaker if pool saturation causes TIMEOUT — distinct from the KEY_NOT_FOUND case here)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spend circuit breaker counts KEY_NOT_FOUND as infrastructure failure, defeats orphanage during catch-up #953

Summary

Observed

The breaker chain

Why this defeats orphanage

Suggested fix

Related but separate

Reproduction

Verification

Affected hosts (current incidents)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

spend circuit breaker counts KEY_NOT_FOUND as infrastructure failure, defeats orphanage during catch-up #953

Description

Summary

Observed

The breaker chain

Why this defeats orphanage

Suggested fix

Related but separate

Reproduction

Verification

Affected hosts (current incidents)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions