Skip to content

fix(utxo): drain in-flight batched writes on shutdown#928

Merged
freemans13 merged 12 commits into
bsv-blockchain:mainfrom
freemans13:stu/utxo-graceful-shutdown
Jun 5, 2026
Merged

fix(utxo): drain in-flight batched writes on shutdown#928
freemans13 merged 12 commits into
bsv-blockchain:mainfrom
freemans13:stu/utxo-graceful-shutdown

Conversation

@freemans13

Copy link
Copy Markdown
Collaborator

The bug

Restarting teranode during legacy catchup can leave the UTXO store inconsistent. After redeploying a testnet node mid-catchup we observed:

```
ERROR | netsync/handle_block.go:1018 | legacy| [extendTransactions] called for block ... height 54494 DONE in 817.351µs with error: PROCESSING (4): failed to decorate previous outputs for tx a1f6a4ff...c948afcca -> PROCESSING (4): failed to decorate previous outputs: 1 missing
INFO | legacy/netsync/manager.go:886 | legacy| Resetting header sync state at height 54493
```

Looping forever: peer sends block → tx references a parent UTXO that's not in our store → fail → header-resync → retry → same error. The only recovery is a DB wipe.

The cause

All `utxo.Store` backends (aerospike, sql, and the queue/postgres store on a sibling PR) front their writes with `go-batcher` in `background=true` mode. Every `Create`/`Spend`/`Get`/`Unlock` call enqueues an item and returns immediately; the actual DB or Aerospike write happens in a worker goroutine.

When the process receives SIGTERM:

  1. `util/servicemanager/service_manager.go:66` cancels the global context.
  2. Services unwind via context cancellation and `sm.Wait()` returns.
  3. `daemon/daemon.go:376` takes the `case err = <-waitErr` branch.
  4. The function returns; the process exits.

Nothing in this path closes the utxo store, so the batcher workers are killed mid-flight. Items still in the input channel — and any batch dispatched but not yet committed by Aerospike/Postgres — are lost. The caller had already received a successful `Create`/`Spend` return (background mode acks on enqueue), the parent block had been committed to the blockchain store via gRPC, and on restart the chain advances past blocks whose UTXOs never reached durable storage. Subsequent blocks that spend those outputs fail with `failed to decorate previous outputs: N missing`.

Note that the existing `closeStores` in `daemon.go` only ran on the `<-d.doneCh` shutdown branch (explicit `daemon.Stop()`) and only closed the tx/subtree/temp blob stores. The OS-signal path never called it at all.

The fix

  • Add `Close(ctx context.Context) error` to `utxo.Store` (matching blob store naming).
  • Implement `Close` on every backend:
    • aerospike: closes all 7 batchers in dependency order (metadata writers first, durable state-mutating writers last), then the Aerospike client. `batcherIfc` extended with `Close()`.
    • sql: closes the 4 batchers, then `s.db.Close()`.
    • nullstore: no-op (owns no batchers).
    • logger / txmetacache wrappers: delegate to the inner store.
    • mocks updated.
  • Daemon shutdown:
    • `defer d.closeStores(logger)` at the top of the main body so it runs on BOTH shutdown paths.
    • `closeStores` now uses its own `context.WithTimeout(30s)` instead of the already-cancelled `sm.Ctx` — without this the drain returns immediately with `context.Canceled`.
    • `closeStores` now closes the utxo store before the blob stores so utxo writes referencing subtree blob data finish in the right order.

Each batcher's `Close()` blocks until the input channel is drained and the residual batch dispatched (go-batcher v2 batcher.go:437–449: the worker selects on done, closes the input channel, drains it, and dispatches the remainder with reason `shutdown` before returning).

Test plan

  • `go build ./...` clean (also with `-tags aerospike`)
  • 1221 unit tests pass across `stores/utxo/...`, `stores/txmetacache`, `services/blockpersister`, `daemon/...`
  • Validate on a live mainnet node: redeploy via the usual SCP+restart path, confirm log shows the new "closing utxo store" debug line and the next-block-after-restart processes without missing-output errors.

Loss-of-write on shutdown — the smoking gun for missing-UTXO errors
during legacy catchup after a teranode restart.

All utxo.Store backends (aerospike, sql, postgres on the
queue-utxo-store branch) front their writes with go-batcher in
background=true mode. Every Create/Spend/Get/Unlock call enqueues an
item and returns immediately; the actual DB or Aerospike write happens
in a worker goroutine that consumes the batcher's channel.

When the process receives SIGTERM:

  1. util/servicemanager/service_manager.go:66 cancels the global
     context.
  2. Services unwind via context cancellation and sm.Wait() returns.
  3. daemon/daemon.go:376 takes the 'case err = <-waitErr' branch.
  4. The function returns; the process exits.

Nothing in this path closes the utxo store, so the batcher workers
are killed mid-flight. Items still in the input channel — and any
batch dispatched but not yet acked by Aerospike/Postgres — are lost.
The caller had already received a successful Create/Spend return, the
parent block had been committed to the blockchain store via gRPC, and
on restart the chain advances past blocks whose UTXOs never reached
durable storage. Subsequent blocks that spend those outputs fail with
'failed to decorate previous outputs: N missing' and the node is
stuck in an unrecoverable header-resync retry loop until the DB is
wiped.

This commit fixes the shutdown path:

- Adds Close(ctx) to the utxo.Store interface (consistent naming with
  blob stores' existing Close(ctx)).

- Implements Close on every backend:
  * aerospike: closes all 7 batchers in dependency order (metadata
    writers first, durable state-mutating writers last), then the
    Aerospike client.
  * sql: closes the 4 batchers, then s.db.Close().
  * nullstore: no-op (owns no batchers).
  * logger / txmetacache wrappers: delegate to the inner store.
  * mock / test mocks updated.

- Daemon shutdown changes:
  * daemon/daemon.go now defers closeStores() at the top of the main
    body so it runs on BOTH shutdown paths (signal-driven and
    explicit Stop()), not just doneCh.
  * closeStores allocates a fresh 30 s context instead of reusing
    the already-cancelled sm.Ctx — the cancelled context is what
    drove the service exit in the first place, and we want the drains
    to actually run.
  * closeStores now closes the utxo store before tx/subtree/temp
    stores so that any utxo writes referencing blob-stored subtree
    data finish before the blob stores tear down.

Each batcher's Close() blocks until the input channel is drained and
the residual batch has been dispatched (see go-batcher v2 batcher.go
~line 437–449: the worker selects on done, closes the input channel,
drains it into the current batch, and dispatches the remainder with
reason 'shutdown' before returning).

Tests: 1221 unit tests across stores/utxo/..., stores/txmetacache,
services/blockpersister, daemon/... still pass.
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Summary:

This PR addresses a critical data loss bug where in-flight batched UTXO writes were lost on shutdown, leading to corrupted state and stuck nodes. The fix adds Close() to all UTXO stores, ensures proper drain of background workers, and updates the daemon shutdown path.

Code Quality: The implementation is thorough, well-documented, and includes regression tests. The fixes from multiple review rounds (batcher close order, self-requeue elimination, go-batcher v2.0.4 upgrade) demonstrate careful iteration.

Findings:

Several documentation clarity issues remain from previous review rounds. Most are minor wording improvements. The core shutdown logic is sound.

See existing inline comment threads for details on:

  • CloseAerospikeClient parameter documentation (util/aerospike.go)
  • Shared client coupling concern (util/aerospike.go:121)
  • Comment accuracy in Close implementations

Recommendation: The critical shutdown bug fix is correct and well-tested. Outstanding items are documentation improvements that do not block functionality.

return nil
case <-ctx.Done():
return ctx.Err()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Close() implementations in aerospike.go and sql.go handle context timeout by returning ctx.Err() while allowing the drain to continue in the background goroutine. However, this creates a potential issue: if the context times out, the function returns an error but the goroutine continues running and may still attempt to close s.client or s.db after the function has already returned.

This could lead to:

  1. Races if the caller doesn't expect resources to be closed after an error return
  2. Delayed resource cleanup with no visibility to the caller

Consider either:

  • Canceling the drain when context times out (accept partial loss with timeout)
  • Waiting for drain completion regardless of context timeout, but logging the timeout
  • Returning success even on timeout if drain completes (current behavior is arguably correct but undocumented)

The current behavior may be intentional ("best-effort drain"), but it should be documented in the function comment that resource cleanup continues in the background even after error return.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved: This concern has been addressed. Both implementations now document the best-effort drain behavior:

  • aerospike.go:476-479: "if the context deadline expires first, an error is returned but draining continues best-effort"
  • sql.go:302-304: "If the context expires, the function returns its error but the drain continues best-effort"

The behavior is intentional and now properly documented.

freemans13 added a commit to freemans13/teranode that referenced this pull request May 21, 2026
Integration-only adapter. stu/utxo-graceful-shutdown (PR bsv-blockchain#928) added
Close(ctx) to the utxo.Store interface. The postgres store on this
branch (queue-utxo-store PR bsv-blockchain#684) still had the older Stop() name.
Rename + adapt signature + update the cleanup callsite.

Will be dropped/folded when both PRs land upstream.
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-928 (8b0754b)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 148
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.756µ 1.782µ ~ 0.200
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.73n 61.86n ~ 0.100
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.85n 61.83n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.69n 61.80n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 31.06n 31.08n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 51.27n 52.41n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 106.8n 107.7n ~ 0.400
MiningCandidate_Stringify_Short-4 258.1n 258.3n ~ 0.700
MiningCandidate_Stringify_Long-4 1.895µ 1.897µ ~ 0.800
MiningSolution_Stringify-4 969.3n 979.5n ~ 0.100
BlockInfo_MarshalJSON-4 1.804µ 1.824µ ~ 0.700
NewFromBytes-4 129.1n 130.2n ~ 0.200
AddTxBatchColumnar_Validation-4 2.481µ 2.468µ ~ 1.000
OffsetValidationLoop-4 641.6n 635.7n ~ 0.400
Mine_EasyDifficulty-4 66.95µ 67.89µ ~ 0.400
Mine_WithAddress-4 6.959µ 6.954µ ~ 1.000
DiskTxMap_SetIfNotExists-4 3.654µ 3.809µ ~ 0.100
DiskTxMap_SetIfNotExists_Parallel-4 3.572µ 3.417µ ~ 0.400
DiskTxMap_ExistenceOnly-4 342.5n 415.3n ~ 0.700
Queue-4 191.8n 189.9n ~ 0.100
AtomicPointer-4 5.156n 4.856n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 888.5µ 891.8µ ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4 808.3µ 808.6µ ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/10K-4 122.8µ 109.0µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 62.44µ 62.27µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 56.67µ 63.84µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/10K-4 11.75µ 11.59µ ~ 0.400
ReorgOptimizations/NodeFlags/Old/10K-4 4.877µ 5.379µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.674µ 2.152µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 10.08m 10.47m ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.27m 10.95m ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.151m 1.239m ~ 0.100
ReorgOptimizations/AllMarkFalse/New/100K-4 682.5µ 678.6µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/100K-4 611.3µ 668.1µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/100K-4 320.6µ 337.7µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/100K-4 51.38µ 54.50µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 18.20µ 18.77µ ~ 0.200
TxMapSetIfNotExists-4 52.43n 52.55n ~ 0.700
TxMapSetIfNotExistsDuplicate-4 40.41n 40.50n ~ 0.100
ChannelSendReceive-4 627.1n 623.6n ~ 1.000
BlockAssembler_AddTx-4 0.02965n 0.02930n ~ 0.700
AddNode-4 12.44 12.63 ~ 0.400
AddNodeWithMap-4 13.16 12.56 ~ 0.400
DirectSubtreeAdd/4_per_subtree-4 77.19n 74.41n ~ 0.700
DirectSubtreeAdd/64_per_subtree-4 41.57n 41.61n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 40.22n 40.38n ~ 0.100
DirectSubtreeAdd/1024_per_subtree-4 38.81n 38.86n ~ 0.300
DirectSubtreeAdd/2048_per_subtree-4 38.45n 38.38n ~ 0.700
SubtreeProcessorAdd/4_per_subtree-4 386.3n 381.2n ~ 0.700
SubtreeProcessorAdd/64_per_subtree-4 374.6n 373.6n ~ 1.000
SubtreeProcessorAdd/256_per_subtree-4 359.7n 365.3n ~ 0.400
SubtreeProcessorAdd/1024_per_subtree-4 360.0n 363.5n ~ 0.200
SubtreeProcessorAdd/2048_per_subtree-4 364.3n 379.8n ~ 0.200
SubtreeProcessorRotate/4_per_subtree-4 368.2n 393.8n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 360.6n 371.0n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 359.9n 388.2n ~ 0.200
SubtreeProcessorRotate/1024_per_subtree-4 358.5n 382.3n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 87.84n 88.42n ~ 0.100
SubtreeNodeAddOnly/64_per_subtree-4 65.06n 65.18n ~ 1.000
SubtreeNodeAddOnly/256_per_subtree-4 64.02n 64.53n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 63.56n 63.83n ~ 0.100
SubtreeCreationOnly/4_per_subtree-4 149.7n 148.3n ~ 0.400
SubtreeCreationOnly/64_per_subtree-4 533.4n 539.0n ~ 0.400
SubtreeCreationOnly/256_per_subtree-4 1.985µ 1.978µ ~ 0.700
SubtreeCreationOnly/1024_per_subtree-4 6.359µ 6.431µ ~ 0.700
SubtreeCreationOnly/2048_per_subtree-4 11.55µ 11.76µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 362.1n 376.9n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 363.0n 371.7n ~ 0.200
ParallelGetAndSetIfNotExists/1k_nodes-4 2.424m 2.446m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 6.973m 7.178m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 8.778m 9.005m ~ 0.200
ParallelGetAndSetIfNotExists/100k_nodes-4 12.08m 12.43m ~ 0.200
SequentialGetAndSetIfNotExists/1k_nodes-4 2.041m 2.006m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 6.018m 5.675m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 18.28m 17.18m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 32.49m 32.33m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.480m 2.510m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 10.21m 10.04m ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 14.76m 15.29m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 2.072m 2.077m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 10.02m 10.05m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 61.74m 61.44m ~ 1.000
CalcBlockWork-4 550.9n 547.6n ~ 0.700
CalculateWork-4 752.1n 734.9n ~ 0.100
CheckOldBlockIDs/on-chain-prefetch/1000-4 63.51µ 64.18µ ~ 0.100
CheckOldBlockIDs/off-chain-prefetch/1000-4 51.46µ 52.68µ ~ 0.400
CheckOldBlockIDs/on-chain-prefetch/10000-4 493.7µ 475.3µ ~ 1.000
CheckOldBlockIDs/off-chain-prefetch/10000-4 352.3µ 361.2µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_10-4 1.346µ 1.451µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 12.77µ 13.82µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 126.9µ 136.0µ ~ 0.100
CatchupWithHeaderCache-4 104.5m 104.8m ~ 0.100
_prepareTxsPerLevel-4 408.5m 407.3m ~ 0.700
_prepareTxsPerLevelOrdered-4 4.053m 3.780m ~ 0.700
_prepareTxsPerLevel_Comparison/Original-4 410.0m 414.1m ~ 0.200
_prepareTxsPerLevel_Comparison/Optimized-4 3.659m 3.649m ~ 0.400
SubtreeSizes/10k_tx_4_per_subtree-4 1.415m 1.341m ~ 0.400
SubtreeSizes/10k_tx_16_per_subtree-4 335.2µ 325.1µ ~ 0.100
SubtreeSizes/10k_tx_64_per_subtree-4 78.71µ 78.85µ ~ 1.000
SubtreeSizes/10k_tx_256_per_subtree-4 19.89µ 19.53µ ~ 0.100
SubtreeSizes/10k_tx_512_per_subtree-4 9.853µ 9.708µ ~ 0.200
SubtreeSizes/10k_tx_1024_per_subtree-4 4.843µ 4.790µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.441µ 2.402µ ~ 0.200
BlockSizeScaling/10k_tx_64_per_subtree-4 76.22µ 77.71µ ~ 0.100
BlockSizeScaling/10k_tx_256_per_subtree-4 19.44µ 19.21µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.879µ 4.747µ ~ 0.100
BlockSizeScaling/50k_tx_64_per_subtree-4 404.1µ 394.9µ ~ 0.100
BlockSizeScaling/50k_tx_256_per_subtree-4 97.61µ 95.48µ ~ 0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.63µ 23.20µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 165.3µ 158.8µ ~ 0.100
SubtreeAllocations/small_subtrees_data_fetch-4 168.2µ 172.5µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 333.4µ 327.9µ ~ 0.400
SubtreeAllocations/medium_subtrees_exists_check-4 9.936µ 9.427µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 10.58µ 10.35µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 19.36µ 19.08µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.389µ 2.262µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.463µ 2.536µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.846µ 4.770µ ~ 0.100
_BufferPoolAllocation/16KB-4 3.874µ 3.889µ ~ 0.700
_BufferPoolAllocation/32KB-4 8.224µ 9.218µ ~ 1.000
_BufferPoolAllocation/64KB-4 16.18µ 21.12µ ~ 0.100
_BufferPoolAllocation/128KB-4 32.73µ 35.74µ ~ 0.100
_BufferPoolAllocation/512KB-4 111.3µ 136.3µ ~ 0.100
_BufferPoolConcurrent/32KB-4 18.86µ 18.49µ ~ 0.400
_BufferPoolConcurrent/64KB-4 29.83µ 29.61µ ~ 0.700
_BufferPoolConcurrent/512KB-4 143.0µ 140.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 640.9µ 676.7µ ~ 0.200
_SubtreeDeserializationWithBufferSizes/32KB-4 701.1µ 698.6µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/64KB-4 703.9µ 695.8µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/128KB-4 658.4µ 704.2µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/512KB-4 604.9µ 613.2µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.29m 36.44m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.34m 36.63m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.24m 35.99m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.31m 36.15m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.81m 36.01m ~ 0.400
_PooledVsNonPooled/Pooled-4 832.9n 831.2n ~ 0.200
_PooledVsNonPooled/NonPooled-4 7.771µ 8.075µ ~ 0.400
_MemoryFootprint/Current_512KB_32concurrent-4 6.566µ 6.468µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.386µ 9.370µ ~ 1.000
_MemoryFootprint/Alternative_64KB_32concurrent-4 8.874µ 9.163µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 320.6µ 315.3µ ~ 0.700
StoreBlock_Sequential/AboveCSVHeight-4 315.8µ 316.3µ ~ 1.000
GetUtxoHashes-4 278.5n 278.6n ~ 0.800
GetUtxoHashes_ManyOutputs-4 50.46µ 45.74µ ~ 0.100
_NewMetaDataFromBytes-4 213.0n 213.4n ~ 0.100
_Bytes-4 396.4n 395.8n ~ 1.000
_MetaBytes-4 137.4n 141.3n ~ 0.200

Threshold: >10% with p < 0.05 | Generated: 2026-06-05 14:36 UTC

@oskarszoon oskarszoon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Drain on shutdown is the right fix and improves on the pre-fix "no close signal at all" behaviour.

Three items worth flagging:

  1. batcher.Close() is non-blocking — it just close(b.done) and returns. Store.Close() does NOT wait for drain completion. Drain goroutines in aerospike/sql call batcher.Close() on each batcher sequentially (instant), then exit and signal done. So Store.Close() can return while workers still dispatch their final batches. go-batcher v2 doesn't expose a Wait() or Done() channel. Practical risk is low (services have exited by then) but worth knowing.

  2. SQL drain order: unlock → get → spend → create. Comment says "state-mutating writers last" but spend (deletion) drains BEFORE create (insertion). For UTXO consistency the order is creates first, then spends. Low real-world impact given async nature, but the comment claim is wrong.

  3. Aerospike timeout path: if the 30s deadline fires before drain finishes, Store.Close returns early but the drain goroutine keeps running. s.client.Close() is only called in the <-done branch — on timeout the Aerospike client is never explicitly closed. Not a problem on process exit but worth documenting.

Test changes are interface-compliance stubs only — no test of the drain behaviour itself. Not blocking given production evidence, but a future regression here would be silent.

@freemans13 freemans13 self-assigned this May 22, 2026
Comment thread stores/utxo/sql/sql.go
done := make(chan struct{})
go func() {
defer close(done)
// Drain in dependency order: state-mutating writers last so they

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] The comment says "state-mutating writers last" but the actual order drains spend (line 318) before create (line 321). For UTXO correctness, creates should logically complete before spends. While the async batching may make this order difference irrelevant in practice, the comment claim is backwards.

Consider either:

  • Reordering to create then spend to match the comment
  • Clarifying the comment describes timeout-prioritization strategy rather than UTXO consistency ordering

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spend-before-create order is consistent across both sql.go and aerospike.go implementations, and is correct for this use case.

During shutdown, these batchers are draining independent queues from different blocks/transactions that were processed earlier. The spend batcher isn't spending outputs that the create batcher is currently creating — those cross-transaction dependencies were already resolved during block validation.

The comment "state-mutating writers last" means both spend and create (the durable state writers) are drained after the metadata/cache writers (unlock, get). The specific order between spend and create doesn't affect correctness here since they operate on disjoint UTXO sets from different historical transactions.

No change needed — the implementation and comment are both correct.

…own data race

closeStores (the deferred shutdown drain added for the batched-write fix) reads
d.daemonStores.main*Store under globalStoreMutex.RLock. Stores.Cleanup nils those
same fields during test-daemon teardown but took no lock, so the two ran
concurrently and the -race detector failed TestStore_TwoPhaseCommit and every
daemon-based integration suite (smoketest/sequential/prunertest/chainintegrity/
legacy-sync). Take the write lock in Cleanup so the store-pointer reads and writes
are synchronised. The drain-on-shutdown behaviour itself is unchanged.
…cess restart

util.GetAerospikeClient hands out a process-wide client cached by host. The new
Store.Close() (drain-on-shutdown) called s.client.Close() in place, leaving a
CLOSED client in the cache. A subsequent store for the same host -- e.g. an
in-process daemon restart that reuses the same Aerospike container -- then
'reused' the closed client and failed startup with INVALID_NODE_ERROR
(Failed to register udfLUA / check index existence), which is why
TestBlockAssemblyRestartWithExternalTransactionsAerospike failed on restart.

Add util.CloseAerospikeClient(host) to close AND evict the cached client, and
call it from Store.Close, so the next GetAerospikeClient builds a fresh client.
The batcher-drain behaviour (the actual write-loss fix) is unchanged.
@freemans13 freemans13 requested a review from ordishs June 2, 2026 18:20
…hutdown

# Conflicts:
#	stores/utxo/aerospike/aerospike.go

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the shutdown-drain fix. The root-cause diagnosis and surrounding plumbing (deferring closeStores onto both paths, the fresh 30s context, the Cleanup() mutex fix) are correct. But two blocking issues mean the fix does not actually drain writes as written, and the aerospike path can panic-crash shutdown. Details inline.

// client and a later store for the same host (e.g. an in-process
// daemon restart) would reuse it and fail with INVALID_NODE_ERROR.
if s.client != nil {
util.CloseAerospikeClient(s.url.Host)

@ordishs ordishs Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved] This has been fixed by upgrading to go-batcher v2.0.4 (see go.mod line 197 and latest commit 954f4ec).

go-batcher v2.0.4 implements blocking Close: the worker drains the input channel and dispatches the residual batch before returning (see batcher.go:Close). The select/done/30s-context now correctly waits for the drain to complete before tearing down the client.


Original concern (now fixed):
The pinned go-batcher/v2 v2.0.3 had fire-and-forget Close (just close(b.done), returns immediately). The worker drained asynchronously, so Close returned before workers finished and the client was torn down mid-flight, losing writes.

Comment thread stores/utxo/aerospike/aerospike.go Outdated
// state writers and are drained last to maximise their chance of
// committing before the deadline.
if s.setDAHBatcher != nil {
s.setDAHBatcher.Close()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — close order is inverted and can panic-crash shutdown. The spend batcher's dispatch callback feeds the setDAH and increment batchers: sendSpendBatchLuaprocessSpendBatchResults (spend.go:769/779/786) → SetDAHForChildRecords (s.setDAHBatcher.Put, spend.go:905) and IncrementSpentRecords (s.incrementBatcher.Put, spend.go:1089).

Here setDAHBatcher and incrementBatcher are closed before spendBatcher. When spend drains a residual batch that completes a tx (very likely during catchup shutdown), it Puts into the already-closed setDAH/increment channels. go-batcher docs: "any further send will panic with 'send on closed channel'." Worse, those Puts run in bare go func()s, so the panic escapes go-batcher's dispatchAndRecord recovery and crashes the process — meaning storeBatcher (closed last) never drains. Strictly worse than today for the create path.

Producers must close before the batchers they feed. Reorder to drain store/get/spend first, then setDAH/increment/outpoint/locked. (The inline rationale "drained last to maximise their chance of committing before the deadline" is also backwards.) Note setLockedBatch re-enqueues into lockedBatcher (locked.go:76) — a similar smaller self-requeue hazard.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Fixed] in b5d3a31.

Close now drains in dependency order — producers before consumers: storespendsetDAH/incrementget/outpoint/locked. The spend drain callback (sendSpendBatchLuahandleExtraRecords/handleSpendSignal) enqueues into setDAHBatcher/incrementBatcher, which are now still open when spend drains.

Added TestClose_DrainsQueuedSpendCrossBatcher, which fully spends a multi-record tx through a spend batcher kept queued until Close so the cross-batcher enqueues run during the drain. Verified against a real Aerospike testcontainer: it reproduces panic: send on closed channel on the pre-fix close order and passes on the new one.

Also fixed a related self-referential edge in the same commit: setLockedBatch re-Put child records into its own lockedBatcher (same panic class, plus a drain deadlock). It now writes child records inline within the callback, matching the create path. Covered by TestClose_DrainsQueuedSetLockedMultiRecord.

util.CloseAerospikeClient(s.url.Host)
}
return nil
case <-ctx.Done():

@ordishs ordishs Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved] This has been fixed. The client close/evict now runs inside the drain goroutine (aerospike.go:559-561), so it happens even when ctx expires first.

The drain goroutine always completes its work (drains batchers, closes external store, evicts Aerospike client) regardless of whether the parent context times out. If ctx.Done() fires first, Close returns ctx.Err() to signal the timeout, but the background cleanup continues and completes.


Original concern (now fixed):
If ctx.Done() won the select race, the drain goroutine kept running but the Aerospike client was never closed or evicted (it was only closed on the done path). Combined with the async-Close issue, this meant timeout = silent dirty teardown.

Comment thread stores/utxo/sql/sql.go Outdated
select {
case <-done:
if s.db != nil {
return s.db.Close()

@ordishs ordishs Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved] This has been fixed by upgrading to go-batcher v2.0.4 (see go.mod line 197 and latest commit 954f4ec).

go-batcher v2.0.4 implements blocking Close that drains synchronously. The done channel now closes only after all batchers have finished draining and the db.Close() has been called (sql.go:349-351 shows db.Close runs inside the drain goroutine after all batchers complete).


Original concern (now fixed):
go-batcher v2.0.3 Close was fire-and-forget, so done closed before workers finished. The same async-Close problem that affected aerospike also affected sql.

Comment thread util/aerospike.go
// example an in-process daemon restart that reuses the same Aerospike container)
// would "reuse" the closed client and fail with INVALID_NODE_ERROR. Evicting
// here forces the next GetAerospikeClient to build a fresh client.
func CloseAerospikeClient(host string) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — shared-client coupling. This closes and evicts the process-wide cached client for the host. If any other store/namespace on the same host shares this client, closing it breaks them too. Fine for a single-utxo-store daemon today, but worth a guard or an explicit comment that this assumes sole ownership of the host client.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Addressed] The current code at util/aerospike.go:115-120 now includes explicit documentation of the sole-ownership assumption:

"IMPORTANT — assumes sole ownership of the host's client. Because the cache is keyed per host with no reference counting, this closes the client for ALL users of that host. That is correct for the current single-UTXO-store daemon (one store owns the host), but if more than one store/namespace on the same host ever shares this client, closing it here would break the others. Add reference counting before relying on shared ownership."

This makes the limitation explicit for future maintainers.

return http.StatusOK, "OK", nil
}

func (m *MockUTXOStore) Close(context.Context) error { return nil }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "1221 unit tests pass" claim doesn't cover what's broken: there's no test that a populated store actually drains before its connection closes, nor (for aerospike) that a tx-completing spend survives Close without panicking. A regression test — enqueue writes, Close, assert they landed in the backing store; and drive a DAH-triggering spend through aerospike Close — would have caught both blocking findings.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Addressed] Comprehensive drain tests were added in subsequent commits:

Aerospike (commit b5d3a31, stores/utxo/aerospike/close_drain_test.go):

  • TestClose_DrainsQueuedSetLockedMultiRecord: verifies multi-record SetLocked is drained during Close
  • TestClose_DrainsQueuedSpendCrossBatcher: verifies spend drain callbacks that enqueue into setDAH/increment complete successfully

SQL (commit 954f4ec, stores/utxo/sql/close_drain_postgres_test.go):

  • TestClose_DrainsQueuedCreate_Postgres: verifies queued Create is persisted durably after Close drains

All tests run against real backend containers (Aerospike/Postgres) and verify persistence after Close.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a shutdown durability bug where UTXO writes queued in background batchers could be lost on SIGTERM, leaving the UTXO store inconsistent after restart (e.g., “failed to decorate previous outputs: N missing”). It does so by introducing an explicit Close(ctx) lifecycle on utxo.Store implementations and ensuring the daemon closes stores on all shutdown paths with a fresh timeout context.

Changes:

  • Add Close(ctx context.Context) error to the utxo.Store interface and implement it across UTXO backends and wrappers (aerospike/sql/nullstore/logger/txmetacache) plus update mocks.
  • Update daemon shutdown to always close stores (including UTXO) on both OS-signal and daemon.Stop() paths, using a dedicated 30s timeout context and closing UTXO before blob stores.
  • Add Aerospike client cache eviction helper (util.CloseAerospikeClient) to avoid reusing closed shared clients.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
util/aerospike.go Adds CloseAerospikeClient() to close+evict cached Aerospike clients.
stores/utxo/Interface.go Extends utxo.Store with Close(ctx) and documents shutdown expectations.
stores/utxo/aerospike/aerospike.go Implements Close(ctx) draining for Aerospike-backed UTXO store.
stores/utxo/sql/sql.go Implements Close(ctx) draining for SQL-backed UTXO store.
stores/utxo/nullstore/nullstore.go Adds no-op Close(ctx) for null store.
stores/utxo/logger/logger.go Adds Close(ctx) passthrough with debug logging.
stores/txmetacache/txmetacache.go Delegates Close(ctx) to wrapped UTXO store.
stores/utxo/mock.go Updates UTXO mock to implement new Close(ctx) method.
stores/utxo/logger/logger_test.go Updates logger-store test mock to implement Close(ctx).
stores/utxo/factory/utxo_test.go Updates factory test mock to implement Close(ctx).
services/blockpersister/Server_test.go Updates blockpersister test mock to implement Close(ctx).
daemon/daemon.go Defers store shutdown on all exit paths; adds timeout context and closes UTXO first.
daemon/daemon_stores.go Synchronizes Cleanup() with shutdown close via globalStoreMutex.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread stores/utxo/sql/sql.go Outdated
Comment on lines +306 to +309
// The batchers run in background=true mode for create/get/unlock and
// background=false for spend — every Put returns to the caller before the
// underlying SQL write commits. Closing each batcher invokes go-batcher's
// shutdown drain (see batcher.go:Close), which closes the input channel,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed the doc. The spend batcher is background=false, so its Put blocks until the batch callback finishes and nothing can be lost on its input channel. Reworded to state that only create/get/unlock are background=true (Put-acks-on-enqueue), while draining the spend batcher here just flushes a partial batch and releases its worker.

Comment thread stores/utxo/sql/sql.go Outdated
Comment on lines +339 to +343
select {
case <-done:
if s.db != nil {
return s.db.Close()
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. db.Close() now runs inside the drain goroutine, after the batchers drain, so the connection pool is closed even when ctx has already expired (it no longer hangs off the <-done branch only). The db.Close error is captured in a dbErr var and returned when the drain completes within the deadline; on ctx timeout we still return ctx.Err() but the db is closed best-effort in the background.

Comment thread daemon/daemon.go Outdated
Comment on lines +376 to +379
// finished using the stores. Deferring it here, after the select that
// waits for service exit, gives us a single ordered teardown point
// that runs even if the body returns through a signal-driven exit
// rather than an explicit Stop() call.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded the comment to be accurate: the defer is registered before the select but runs on function return, i.e. after the select has observed service exit on either path.

Comment thread stores/utxo/Interface.go Outdated
Comment on lines +256 to +260
// Implementations MUST wait for outstanding batched writes to complete
// before returning. Returning before pending writes have committed risks
// silently losing UTXO state: callers (block validation, legacy sync)
// will have already received successful responses for those writes and
// will have committed the parent block, but on restart the UTXOs will

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the contract wording to explicitly allow ctx-driven early return: implementations MUST wait for outstanding writes unless ctx expires first, and a returned ctx error means 'drain not confirmed complete' (the underlying drain/release may continue best-effort).

Comment thread stores/utxo/aerospike/aerospike.go Outdated
Comment on lines +540 to +544
select {
case <-done:
// Drains complete; close the Aerospike client. The client is shared
// per host via util's connection cache, so close-and-evict it rather
// than closing in place — otherwise the cache would keep a closed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed both. The client close/evict (util.CloseAerospikeClient) now runs inside the drain goroutine instead of on the <-done branch, so it happens even when ctx expires first — avoiding the stale cache entry that would break a later in-process restart with INVALID_NODE_ERROR. Also added s.externalStore.Close(ctx) after the batchers drain so the external blob store created in Store.New is no longer leaked.

- sql Close: close DB inside drain goroutine so the connection pool is not
  leaked on ctx timeout; correct doc re spend batcher (background=false).
- aerospike Close: close/evict client and external blob store inside the
  drain goroutine so they run even when ctx expires first; close the
  external store that Store.New creates.
- utxo Interface Close: reconcile MUST-wait wording with ctx-driven early
  return.
- daemon: correct misleading comment about defer placement vs the select.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Comment thread stores/utxo/sql/sql.go
Comment on lines +333 to +334
// Drain in dependency order: state-mutating writers last so they
// have the best chance of committing before the deadline.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor wording imprecision. The comment says "state-mutating writers last" but unlock (SetLocked) is closed first. However, the actual order is sound - unlock is a best-effort metadata operation while create/spend are the critical durable UTXO writes that should have maximum time to complete. Suggested clarification: "drain best-effort metadata operations first, then critical create/spend writes last".

Comment thread stores/utxo/sql/sql.go
Comment on lines +306 to +308
// The create/get/unlock batchers run in background=true mode — their Put
// returns to the caller before the underlying SQL write commits, so a
// SIGTERM mid-flight would silently lose queued writes without this drain.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remains a valid concern. The comment at sql.go:306-308 states operations "return to the caller before the underlying SQL write commits" but Create/Get/SetLocked actually wait on done channels for batch completion. The actual issue being fixed is flushing partially-filled batches during shutdown, not preventing premature caller success. Suggested rewording for accuracy.

Comment on lines +496 to +500
// "shutdown"). Without this drain, a SIGTERM mid-flight silently loses
// the in-channel items: the caller has already received a successful
// Create/Spend/etc. return because background batchers ack on enqueue,
// not on dispatch, so the parent block gets committed elsewhere but the
// UTXO write never reaches Aerospike. On restart, blocks that spend

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remains a valid concern. The comment at aerospike.go:529 states "background batchers ack on enqueue" but the actual Create/Spend implementations wait on per-item result channels (e.g., Create waits on resultCh, Spend waits on errCh). The real risk is partial/unflushed batches during shutdown, not that callers got premature success responses. Suggested rewording for accuracy.

Comment thread stores/utxo/Interface.go
Comment on lines +258 to +262
// before pending writes have committed risks silently losing UTXO state:
// callers (block validation, legacy sync) will have already received
// successful responses for those writes and will have committed the
// parent block, but on restart the UTXOs will be missing — breaking
// subsequent blocks that spend them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remains a valid concern. The Interface.Close contract at line 259 states "callers … will have already received successful responses" but this is not universally true - both Aerospike and SQL implementations wait for batch callbacks before returning to callers. Suggested rewording: "callers may have observed success" or "higher-level state may have been committed" to maintain accuracy across implementations.

Comment thread util/aerospike.go
Comment on lines +103 to +106
// CloseAerospikeClient closes the cached aerospike client for the given host and
// removes it from the process-wide connection cache. GetAerospikeClient hands
// out a shared client per host, so closing the client without evicting it would
// leave a closed client in the cache: a subsequent store for the same host (for

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Addressed] The current code at util/aerospike.go:111-113 now includes explicit documentation clarifying that host MUST be url.Host (may include port), not just a hostname. The comment warns: "pass the same url.Host, not a bare hostname, or the eviction silently no-ops."

Comment on lines +512 to +569
func (s *Store) Close(ctx context.Context) error {
done := make(chan struct{})

var extErr error

go func() {
defer close(done)
// Order is intentional: stop accepting new write ops first (no-op
// here because Put/PutCtx don't have a separate close-gate), then
// drain in dependency order. setDAH/locked/outpoint/increment are
// best-effort metadata writers; store/spend are the durable UTXO
// state writers and are drained last to maximise their chance of
// committing before the deadline.
if s.setDAHBatcher != nil {
s.setDAHBatcher.Close()
}
if s.lockedBatcher != nil {
s.lockedBatcher.Close()
}
if s.outpointBatcher != nil {
s.outpointBatcher.Close()
}
if s.incrementBatcher != nil {
s.incrementBatcher.Close()
}
if s.getBatcher != nil {
s.getBatcher.Close()
}
if s.spendBatcher != nil {
s.spendBatcher.Close()
}
if s.storeBatcher != nil {
s.storeBatcher.Close()
}

// Drains complete; close the external blob store (created in
// Store.New) so its handles/connections are not leaked.
if s.externalStore != nil {
extErr = s.externalStore.Close(ctx)
}

// Close the Aerospike client. The client is shared per host via
// util's connection cache, so close-and-evict it rather than closing
// in place — otherwise the cache would keep a closed client and a
// later store for the same host (e.g. an in-process daemon restart)
// would reuse it and fail with INVALID_NODE_ERROR. Done inside the
// goroutine so it still runs even when ctx has already expired.
if s.client != nil {
util.CloseAerospikeClient(s.url.Host)
}
}()

select {
case <-done:
return extErr
case <-ctx.Done():
return ctx.Err()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Addressed] Test coverage was added in commit 954f4ec and b5d3a31:

  • stores/utxo/aerospike/close_drain_test.go added TestClose_DrainsQueuedSetLockedMultiRecord and TestClose_DrainsQueuedSpendCrossBatcher (lines 31-172)
  • stores/utxo/sql/close_drain_postgres_test.go added TestClose_DrainsQueuedCreate_Postgres (lines 62-133)

These tests verify Close() drains queued writes before connection teardown.

Comment thread stores/utxo/sql/sql.go
Comment on lines +326 to +360
func (s *Store) Close(ctx context.Context) error {
done := make(chan struct{})

var dbErr error

go func() {
defer close(done)
// Drain in dependency order: state-mutating writers last so they
// have the best chance of committing before the deadline.
if s.unlockBatcher != nil {
s.unlockBatcher.Close()
}
if s.getBatcher != nil {
s.getBatcher.Close()
}
if s.spendBatcher != nil {
s.spendBatcher.Close()
}
if s.createBatcher != nil {
s.createBatcher.Close()
}
// Always close the DB after the batchers drain, even if ctx has
// already expired, so the connection pool is not leaked.
if s.db != nil {
dbErr = s.db.Close()
}
}()

select {
case <-done:
return dbErr
case <-ctx.Done():
return ctx.Err()
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Addressed] Test coverage was added in commit 954f4ec. See stores/utxo/sql/close_drain_postgres_test.go:62-133, TestClose_DrainsQueuedCreate_Postgres. This test verifies that:

  1. A queued Create in the background batcher is drained by Close()
  2. The transaction persists durably (verified by reopening the DB)
  3. Close() completes without error even when the batcher would otherwise timeout

@socket-security

socket-security Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedgolang/​github.com/​bsv-blockchain/​go-batcher/​v2@​v2.0.3 ⏵ v2.0.410010010010080

View full report

The graceful-shutdown fix relies on batcher.Close() blocking until queued
writes are drained. go-batcher v2.0.3 Close() was fire-and-forget, so the
store closed its DB/Aerospike connection before the batcher worker
dispatched the residual batch — the very writes this PR aims to preserve
could still be lost.

go-batcher v2.0.4 (bsv-blockchain/go-batcher#111) makes Close() block until
the worker has drained the channel, dispatched the residual batch through
fn, and all background dispatch goroutines have finished; it is also now
idempotent. With the bump, the store Close() methods already in this PR are
correct as written.

Adds TestClose_DrainsQueuedCreate_Postgres: configures the create batcher so
it only flushes on shutdown, enqueues a Create, calls Close, then re-opens
the database and confirms the transaction persisted. Runs under the Postgres
testcontainer (CI/GoFortress); skipped in -short.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@freemans13 freemans13 force-pushed the stu/utxo-graceful-shutdown branch from ed55281 to 954f4ec Compare June 4, 2026 14:06
Comment thread stores/utxo/aerospike/aerospike.go Outdated
// best-effort metadata writers; store/spend are the durable UTXO
// state writers and are drained last to maximise their chance of
// committing before the deadline.
if s.setDAHBatcher != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Batcher close order inverted — will panic-crash on shutdown

The spend batcher dispatch callback calls SetDAHForChildRecords and IncrementSpentRecords (spend.go:905, 1089), which enqueue into setDAHBatcher and incrementBatcher. Closing these consumer batchers (lines 525-536) BEFORE the producer spendBatcher (line 540-542) means when spend drains its residual batch, it will Put into already-closed channels.

go-batcher behavior: when a batcher is closed, subsequent Put calls panic with send on closed channel. Those Puts run in bare goroutines without recovery, so the panic escapes and crashes the process.

Fix: Reverse the order. Close producers (store/spend/get) BEFORE consumers (setDAH/increment/outpoint/locked).

This was previously flagged by ordishs in comment 3349987293 but hasn't been addressed yet.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Fixed] in b5d3a31.

Close now drains in dependency order — producers before consumers: storespendsetDAH/incrementget/outpoint/locked. The spend drain callback (sendSpendBatchLuahandleExtraRecords/handleSpendSignal) enqueues into setDAHBatcher/incrementBatcher, which are now still open when spend drains.

Added TestClose_DrainsQueuedSpendCrossBatcher, which fully spends a multi-record tx through a spend batcher kept queued until Close so the cross-batcher enqueues run during the drain. Verified against a real Aerospike testcontainer: it reproduces panic: send on closed channel on the pre-fix close order and passes on the new one.

Also fixed a related self-referential edge in the same commit: setLockedBatch re-Put child records into its own lockedBatcher (same panic class, plus a drain deadlock). It now writes child records inline within the callback, matching the create path. Covered by TestClose_DrainsQueuedSetLockedMultiRecord.

Addresses review findings on PR bsv-blockchain#928 surfaced once go-batcher v2.0.4 made
Close block until drain. With a real draining Close, a batcher whose drain
callback enqueues into another batcher panics ("send on closed channel") if
the downstream batcher is already closed.

1. Close order (Critical, flagged by reviewers): the spend batcher's callback
   (sendSpendBatchLua -> handleExtraRecords/handleSpendSignal) enqueues into
   setDAHBatcher and incrementBatcher on a fully-spent tx. Close closed those
   consumers before spend. Reordered to drain producers first (store, spend),
   then spend's consumers (setDAH, increment), then the independent batchers
   (get, outpoint, locked).

2. lockedBatcher self-requeue (panic + deadlock): setLockedBatch re-Put child
   records into its own batcher for multi-record txs and blocked on the
   result. During a draining Close that Put hits a closed channel and could
   never be serviced. Fixed by handling child records inline within the
   callback (one extra BatchOperate), matching how the create path writes a
   tx's extra/external records. Removes the only self-referential batcher edge.

3. CloseAerospikeClient: documented the sole-ownership assumption (closes the
   process-wide shared client for the host) and that the arg must be url.Host.

Adds aerospike Close regression tests (verified against a real Aerospike
testcontainer; each confirmed to panic on the pre-fix code):
- TestClose_DrainsQueuedSetLockedMultiRecord
- TestClose_DrainsQueuedSpendCrossBatcher

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@freemans13 freemans13 requested a review from ordishs June 4, 2026 17:33

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve — solid, careful fix for a real data-integrity bug

Verified the critical claims against the codebase rather than the PR description alone. The hard parts hold up.

Verified correct

  • Aerospike close-order. Traced the edges: sendSpendBatchLua → processSpendBatchResults → handleSpendSignal → handleExtraRecords/SetDAHForChildRecords enqueue into setDAHBatcher (spend.go:905) and incrementBatcher (spend.go:1089), so spend must close before them — which it does. sendStoreBatch/sendSetDAHBatch/sendIncrementBatch have no outbound batcher edges, so the "store feeds nothing; setDAH/increment are independent leaves" reasoning is accurate.
  • SQL close-order is safe — all four PutCtx calls are caller-driven, none in a batch callback, so no cross-batcher drain edges.
  • Deferred teardown runs strictly after services finish on both paths; old in-select call correctly removed (no double-close).
  • Cleanup() race fix closes a genuine read/write race on the store pointers.
  • locked.go re-entrancy fix is the right approach; errCh signalled exactly once per item.

Follow-ups (non-blocking)

  1. (Cheap, in-scope) locked.go:110-113 — the unparseable-response branch continues without signalling errCh, so SetLocked's return <-errCh leaks that goroutine forever. Pre-existing, but you're rewriting this exact branch; signalling a ProcessingError here would close the leak.
  2. Drain-order regression tests are integration-only (testing.Short() skip + testcontainers), so the consensus-relevant ordering isn't covered by make test. Acceptable, but a future reorder of Close() won't be caught by unit CI.
  3. Confirm go-batcher PutCtx doesn't drop on a cancelled ctx. During a real SIGTERM drain, handleExtraRecords (spend.go:980) calls setDAHBatcher.PutCtx(ctx, …) with the original spend request's (likely-cancelled) ctx. Low risk since the primary setDAH/increment writes use ctx-less .Put, but the integration test only exercises the non-cancelled path.
  4. Confirm externalStore is a distinct instance from the subtree/tx/temp blob stores — if the 30s ctx expires, the aerospike best-effort goroutine keeps running and calls externalStore.Close(ctx) while the daemon proceeds to close the blob stores.
  5. Shared 30s deadline across all four stores: a slow utxo drain eats the blob stores' budget. Consider a setting rather than a hardcoded literal.
  6. CloseAerospikeClient per-host eviction has no refcount — documented and correct for the single-store daemon; just ensure no test shares an Aerospike host across two live stores.

Also: the live-mainnet validation in the test plan is still unchecked. Given this is a state-corruption fix, confirming the "closing utxo store" log line + clean next-block-after-restart on a real node before merge is worth it.

Resolves a semantic conflict in stores/utxo/aerospike/locked.go where both
branches rewrote setLockedBatch:

- upstream bsv-blockchain#1025 ("stop caller-side leak"): trySignal, handled[] tracking,
  panic-recovery defer (signalBatchPanic), placeholderKey key-error handling,
  batchOperate wrapper, and bounded waitForLockedResult — but kept the child
  self-requeue (lockedBatcher.PutCtx from inside the callback).
- this branch (bsv-blockchain#928): removed the self-requeue, handling child records inline,
  because bsv-blockchain#928 closes the lockedBatcher on shutdown and the self-requeue then
  Puts into a closed channel (panic in an errgroup goroutine -> process crash),
  which upstream's bounded wait does not prevent.

Resolution: keep ALL of upstream's scaffolding and replace the child
self-requeue with the inline child-record BatchOperate (using batchOperate +
trySignal). Verified: build + go vet (incl. -tags aerospike); aerospike
close-drain + lock/multirecord/duplicate-spend tests pass against a real
container; daemon closeStores tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 5, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
7.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@freemans13 freemans13 merged commit f68b48c into bsv-blockchain:main Jun 5, 2026
36 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants