Skip to content

spend circuit breaker counts KEY_NOT_FOUND as infrastructure failure, defeats orphanage during catch-up #953

@oskarszoon

Description

@oskarszoon

Summary

The aerospike spend circuit breaker (stores/utxo/aerospike/circuit_breaker.go) counts KEY_NOT_FOUND results from Spend batch operations as consecutive infrastructure failures. During catch-up sync — when descendant transactions are being validated before their parents are present in the local UTXO store — this causes the breaker to trip within seconds. Once open, all Spend calls fast-fail with SERVICE_UNAVAILABLE: [SPEND] circuit breaker open, the orphanage mechanism is bypassed entirely, and sync stalls indefinitely on whatever block was being processed.

The breaker is meant to track infrastructure failures (timeout, network error, server memory error, max-retries-exceeded). KEY_NOT_FOUND is data state, not infrastructure — it's the expected signal that orphanage is supposed to handle.

Observed

Host: bsva-ovh-teranode-eu-1, /mnt/data/teranode-quickstart deployment, mainnet sync.

Sequence (2026-05-27):

  • 08:02:32 — legacy container restarted
  • 08:02:39 — first [SPEND] circuit breaker open, rejecting request (7 seconds in). Block 950,880, 15,974 txs, peer 51.222.249.3:8333. prepareSubtrees finished in 2.38s with 15,166 errors logged from validateSubtree.
  • 08:04:54 — FSM transitioned to RUNNING for subtreevalidation, validator, blob-deletion, stores/utxo/factory, etc. Subtreevalidation logs show Transaction X missing parent, but FSM not in RUNNING state - not adding to orphanage for thousands of txs before this point.
  • Every subsequent attempt at block 950,880 — including ones with FSM=RUNNING — hits the open breaker and exits immediately without speaking to aerospike.

Aerospike itself is healthy throughout:

  • stop_writes=false, hwm_breached=false
  • 456,714,549 objects, 428 GB / 1024 GB used (42%)
  • cache_read_pct=93%
  • client_connections=181 (low, no leak)

So nothing about the actual aerospike infrastructure is failing. The breaker is closed-loop tripping on its own miscategorisation.

The breaker chain

[validateSubtree][af3c…] found 15166 errors while processing subtree, added 0 to orphanage
  → [validateSubtree] failed to bless missing transaction: 507fd1f3…
    → [blessMissingTransaction][000…000/af3c…][507fd1f3…] failed to validate transaction
      → [Validate][507fd1f3…] error spending utxos
        → validator: UTXO Store spend failed for 507fd1f3…
          → UTXO_ERROR (79): error in aerospike spend (batched mode) - errors
            → SERVICE_UNAVAILABLE (50): [SPEND] circuit breaker open, rejecting request

The inner cause when the breaker first tripped was a flood of aerospike batched-mode results where the parent records were absent (i.e. KEY_NOT_FOUND for the parent outpoint key). The aerospike client returns these as per-record errors inside the batch result; the wrapper's failure counter doesn't filter them out before incrementing.

Why this defeats orphanage

The orphanage mechanism (processTransactionsInLevelsnot adding to orphanage when FSM ≠ RUNNING; otherwise queue for retry after parents arrive) exists precisely to handle the catch-up case where descendants are seen before their parents. If the spend circuit breaker trips on the very condition orphanage is designed to absorb, sync can never make forward progress on any block whose descendants arrive before parents — which is most blocks during IBD.

In the eu-1 incident:

  1. Legacy starts processing block 950,880 immediately
  2. Subtreevalidation calls Spend on 15,974 txs whose parents aren't yet in the store
  3. Aerospike returns 15,146 KEY_NOT_FOUND results inside the batch
  4. Wrapper counts these as consecutive failures
  5. Breaker opens after SpendCircuitBreakerFailureCount (default value)
  6. The remaining ~800 txs that might have succeeded now fail with SERVICE_UNAVAILABLE
  7. Next retry: same block, same flood, breaker stays tripped through the cooldown probe

Suggested fix

In the wrapper that translates aerospike batch results into the failure counter, filter by ResultCode before incrementing:

// Only count infrastructure failures toward the breaker
switch resultCode {
case types.TIMEOUT, types.NETWORK_ERROR, types.SERVER_MEM_ERROR,
     types.MAX_RETRIES_EXCEEDED, types.NO_AVAILABLE_CONNECTIONS_TO_NODE,
     types.SERVER_ERROR:
    breaker.RecordFailure()
case types.KEY_NOT_FOUND, types.FILTERED_OUT, types.RECORD_NOT_FOUND:
    // data state, not infrastructure — let the caller handle (orphanage etc.)
default:
    // unknown — record cautiously
}

The exact set of "infrastructure" result codes should match what the breaker is documented to protect against (the comment block at the top of circuit_breaker.go already enumerates them — make the code match the comment).

Related but separate

The catch-up scenario was made worse by legacy beginning block ingestion before downstream services finished their FSM transition to RUNNING — for the eu-1 incident, that gap was ~2 minutes (08:02:32 → 08:04:54). During that gap, missing-parent txs are explicitly not added to orphanage. The breaker tripping during this window is what got us stuck. Worth a separate issue, not in scope here.

Reproduction

  1. Restart legacy + validator/subtreevalidation/blockvalidation on a node that has UTXO data older than the chain tip (or wipe utxo-store, then point at a peer well ahead).
  2. Allow legacy to start consuming blocks before downstream services FSM-transition to RUNNING.
  3. Observe [SPEND] circuit breaker open within seconds.

Alternatively, unit test directly: feed recordBatchResult with N consecutive KEY_NOT_FOUND results and assert breaker stays closed.

Verification

  • New unit test: KEY_NOT_FOUND-only batch results don't increment failure counter
  • Existing tests for genuine infrastructure failures (timeout, server-mem) still trip the breaker as expected
  • Manual: replay the eu-1 scenario on a test deployment, confirm breaker stays closed and txs flow through orphanage instead

Affected hosts (current incidents)

  • bsva-ovh-teranode-eu-1 — workaround was docker compose restart legacy blockvalidation subtreevalidation after FSM had transitioned to RUNNING; on next sync the orphanage absorbed the missing parents and block 950,880 went through

Related

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions