fix(utxo): drain in-flight batched writes on shutdown by freemans13 · Pull Request #928 · bsv-blockchain/teranode

freemans13 · 2026-05-21T19:54:07Z

The bug

Restarting teranode during legacy catchup can leave the UTXO store inconsistent. After redeploying a testnet node mid-catchup we observed:

```
ERROR | netsync/handle_block.go:1018 | legacy| [extendTransactions] called for block ... height 54494 DONE in 817.351µs with error: PROCESSING (4): failed to decorate previous outputs for tx a1f6a4ff...c948afcca -> PROCESSING (4): failed to decorate previous outputs: 1 missing
INFO | legacy/netsync/manager.go:886 | legacy| Resetting header sync state at height 54493
```

Looping forever: peer sends block → tx references a parent UTXO that's not in our store → fail → header-resync → retry → same error. The only recovery is a DB wipe.

The cause

All `utxo.Store` backends (aerospike, sql, and the queue/postgres store on a sibling PR) front their writes with `go-batcher` in `background=true` mode. Every `Create`/`Spend`/`Get`/`Unlock` call enqueues an item and returns immediately; the actual DB or Aerospike write happens in a worker goroutine.

When the process receives SIGTERM:

`util/servicemanager/service_manager.go:66` cancels the global context.
Services unwind via context cancellation and `sm.Wait()` returns.
`daemon/daemon.go:376` takes the `case err = <-waitErr` branch.
The function returns; the process exits.

Nothing in this path closes the utxo store, so the batcher workers are killed mid-flight. Items still in the input channel — and any batch dispatched but not yet committed by Aerospike/Postgres — are lost. The caller had already received a successful `Create`/`Spend` return (background mode acks on enqueue), the parent block had been committed to the blockchain store via gRPC, and on restart the chain advances past blocks whose UTXOs never reached durable storage. Subsequent blocks that spend those outputs fail with `failed to decorate previous outputs: N missing`.

Note that the existing `closeStores` in `daemon.go` only ran on the `<-d.doneCh` shutdown branch (explicit `daemon.Stop()`) and only closed the tx/subtree/temp blob stores. The OS-signal path never called it at all.

The fix

Add `Close(ctx context.Context) error` to `utxo.Store` (matching blob store naming).
Implement `Close` on every backend:
- aerospike: closes all 7 batchers in dependency order (metadata writers first, durable state-mutating writers last), then the Aerospike client. `batcherIfc` extended with `Close()`.
- sql: closes the 4 batchers, then `s.db.Close()`.
- nullstore: no-op (owns no batchers).
- logger / txmetacache wrappers: delegate to the inner store.
- mocks updated.
Daemon shutdown:
- `defer d.closeStores(logger)` at the top of the main body so it runs on BOTH shutdown paths.
- `closeStores` now uses its own `context.WithTimeout(30s)` instead of the already-cancelled `sm.Ctx` — without this the drain returns immediately with `context.Canceled`.
- `closeStores` now closes the utxo store before the blob stores so utxo writes referencing subtree blob data finish in the right order.

Each batcher's `Close()` blocks until the input channel is drained and the residual batch dispatched (go-batcher v2 batcher.go:437–449: the worker selects on done, closes the input channel, drains it, and dispatches the remainder with reason `shutdown` before returning).

Test plan

`go build ./...` clean (also with `-tags aerospike`)
1221 unit tests pass across `stores/utxo/...`, `stores/txmetacache`, `services/blockpersister`, `daemon/...`
Validate on a live mainnet node: redeploy via the usual SCP+restart path, confirm log shows the new "closing utxo store" debug line and the next-block-after-restart processes without missing-output errors.

Loss-of-write on shutdown — the smoking gun for missing-UTXO errors during legacy catchup after a teranode restart. All utxo.Store backends (aerospike, sql, postgres on the queue-utxo-store branch) front their writes with go-batcher in background=true mode. Every Create/Spend/Get/Unlock call enqueues an item and returns immediately; the actual DB or Aerospike write happens in a worker goroutine that consumes the batcher's channel. When the process receives SIGTERM: 1. util/servicemanager/service_manager.go:66 cancels the global context. 2. Services unwind via context cancellation and sm.Wait() returns. 3. daemon/daemon.go:376 takes the 'case err = <-waitErr' branch. 4. The function returns; the process exits. Nothing in this path closes the utxo store, so the batcher workers are killed mid-flight. Items still in the input channel — and any batch dispatched but not yet acked by Aerospike/Postgres — are lost. The caller had already received a successful Create/Spend return, the parent block had been committed to the blockchain store via gRPC, and on restart the chain advances past blocks whose UTXOs never reached durable storage. Subsequent blocks that spend those outputs fail with 'failed to decorate previous outputs: N missing' and the node is stuck in an unrecoverable header-resync retry loop until the DB is wiped. This commit fixes the shutdown path: - Adds Close(ctx) to the utxo.Store interface (consistent naming with blob stores' existing Close(ctx)). - Implements Close on every backend: * aerospike: closes all 7 batchers in dependency order (metadata writers first, durable state-mutating writers last), then the Aerospike client. * sql: closes the 4 batchers, then s.db.Close(). * nullstore: no-op (owns no batchers). * logger / txmetacache wrappers: delegate to the inner store. * mock / test mocks updated. - Daemon shutdown changes: * daemon/daemon.go now defers closeStores() at the top of the main body so it runs on BOTH shutdown paths (signal-driven and explicit Stop()), not just doneCh. * closeStores allocates a fresh 30 s context instead of reusing the already-cancelled sm.Ctx — the cancelled context is what drove the service exit in the first place, and we want the drains to actually run. * closeStores now closes the utxo store before tx/subtree/temp stores so that any utxo writes referencing blob-stored subtree data finish before the blob stores tear down. Each batcher's Close() blocks until the input channel is drained and the residual batch has been dispatched (see go-batcher v2 batcher.go ~line 437–449: the worker selects on done, closes the input channel, drains it into the current batch, and dispatches the remainder with reason 'shutdown' before returning). Tests: 1221 unit tests across stores/utxo/..., stores/txmetacache, services/blockpersister, daemon/... still pass.

github-actions · 2026-05-21T19:55:37Z

🤖 Claude Code Review

Status: Complete

Summary:

This PR addresses a critical data loss bug where in-flight batched UTXO writes were lost on shutdown, leading to corrupted state and stuck nodes. The fix adds Close() to all UTXO stores, ensures proper drain of background workers, and updates the daemon shutdown path.

Code Quality: The implementation is thorough, well-documented, and includes regression tests. The fixes from multiple review rounds (batcher close order, self-requeue elimination, go-batcher v2.0.4 upgrade) demonstrate careful iteration.

Findings:

Several documentation clarity issues remain from previous review rounds. Most are minor wording improvements. The core shutdown logic is sound.

See existing inline comment threads for details on:

CloseAerospikeClient parameter documentation (util/aerospike.go)
Shared client coupling concern (util/aerospike.go:121)
Comment accuracy in Close implementations

Recommendation: The critical shutdown bug fix is correct and well-tested. Outstanding items are documentation improvements that do not block functionality.

github-actions · 2026-05-21T19:56:46Z

+		return nil
+	case <-ctx.Done():
+		return ctx.Err()
+	}


The Close() implementations in aerospike.go and sql.go handle context timeout by returning ctx.Err() while allowing the drain to continue in the background goroutine. However, this creates a potential issue: if the context times out, the function returns an error but the goroutine continues running and may still attempt to close s.client or s.db after the function has already returned.

This could lead to:

Races if the caller doesn't expect resources to be closed after an error return

Delayed resource cleanup with no visibility to the caller

Consider either:

Canceling the drain when context times out (accept partial loss with timeout)

Waiting for drain completion regardless of context timeout, but logging the timeout

Returning success even on timeout if drain completes (current behavior is arguably correct but undocumented)

The current behavior may be intentional ("best-effort drain"), but it should be documented in the function comment that resource cleanup continues in the background even after error return.

Resolved: This concern has been addressed. Both implementations now document the best-effort drain behavior:

aerospike.go:476-479: "if the context deadline expires first, an error is returned but draining continues best-effort"

sql.go:302-304: "If the context expires, the function returns its error but the drain continues best-effort"

The behavior is intentional and now properly documented.

Integration-only adapter. stu/utxo-graceful-shutdown (PR bsv-blockchain#928) added Close(ctx) to the utxo.Store interface. The postgres store on this branch (queue-utxo-store PR bsv-blockchain#684) still had the older Stop() name. Rename + adapt signature + update the cleanup callsite. Will be dropped/folded when both PRs land upstream.

github-actions · 2026-05-21T20:09:43Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-928 (8b0754b)

Summary

Regressions: 0
Improvements: 0
Unchanged: 148
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.756µ	1.782µ	~	0.200
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	61.73n	61.86n	~	0.100
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	61.85n	61.83n	~	1.000
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	61.69n	61.80n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	31.06n	31.08n	~	1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	51.27n	52.41n	~	0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	106.8n	107.7n	~	0.400
MiningCandidate_Stringify_Short-4	258.1n	258.3n	~	0.700
MiningCandidate_Stringify_Long-4	1.895µ	1.897µ	~	0.800
MiningSolution_Stringify-4	969.3n	979.5n	~	0.100
BlockInfo_MarshalJSON-4	1.804µ	1.824µ	~	0.700
NewFromBytes-4	129.1n	130.2n	~	0.200
AddTxBatchColumnar_Validation-4	2.481µ	2.468µ	~	1.000
OffsetValidationLoop-4	641.6n	635.7n	~	0.400
Mine_EasyDifficulty-4	66.95µ	67.89µ	~	0.400
Mine_WithAddress-4	6.959µ	6.954µ	~	1.000
DiskTxMap_SetIfNotExists-4	3.654µ	3.809µ	~	0.100
DiskTxMap_SetIfNotExists_Parallel-4	3.572µ	3.417µ	~	0.400
DiskTxMap_ExistenceOnly-4	342.5n	415.3n	~	0.700
Queue-4	191.8n	189.9n	~	0.100
AtomicPointer-4	5.156n	4.856n	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	888.5µ	891.8µ	~	1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4	808.3µ	808.6µ	~	1.000
ReorgOptimizations/AllMarkFalse/Old/10K-4	122.8µ	109.0µ	~	0.100
ReorgOptimizations/AllMarkFalse/New/10K-4	62.44µ	62.27µ	~	0.700
ReorgOptimizations/HashSlicePool/Old/10K-4	56.67µ	63.84µ	~	0.100
ReorgOptimizations/HashSlicePool/New/10K-4	11.75µ	11.59µ	~	0.400
ReorgOptimizations/NodeFlags/Old/10K-4	4.877µ	5.379µ	~	0.100
ReorgOptimizations/NodeFlags/New/10K-4	1.674µ	2.152µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	10.08m	10.47m	~	0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4	10.27m	10.95m	~	0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.151m	1.239m	~	0.100
ReorgOptimizations/AllMarkFalse/New/100K-4	682.5µ	678.6µ	~	0.700
ReorgOptimizations/HashSlicePool/Old/100K-4	611.3µ	668.1µ	~	0.700
ReorgOptimizations/HashSlicePool/New/100K-4	320.6µ	337.7µ	~	0.700
ReorgOptimizations/NodeFlags/Old/100K-4	51.38µ	54.50µ	~	0.100
ReorgOptimizations/NodeFlags/New/100K-4	18.20µ	18.77µ	~	0.200
TxMapSetIfNotExists-4	52.43n	52.55n	~	0.700
TxMapSetIfNotExistsDuplicate-4	40.41n	40.50n	~	0.100
ChannelSendReceive-4	627.1n	623.6n	~	1.000
BlockAssembler_AddTx-4	0.02965n	0.02930n	~	0.700
AddNode-4	12.44	12.63	~	0.400
AddNodeWithMap-4	13.16	12.56	~	0.400
DirectSubtreeAdd/4_per_subtree-4	77.19n	74.41n	~	0.700
DirectSubtreeAdd/64_per_subtree-4	41.57n	41.61n	~	1.000
DirectSubtreeAdd/256_per_subtree-4	40.22n	40.38n	~	0.100
DirectSubtreeAdd/1024_per_subtree-4	38.81n	38.86n	~	0.300
DirectSubtreeAdd/2048_per_subtree-4	38.45n	38.38n	~	0.700
SubtreeProcessorAdd/4_per_subtree-4	386.3n	381.2n	~	0.700
SubtreeProcessorAdd/64_per_subtree-4	374.6n	373.6n	~	1.000
SubtreeProcessorAdd/256_per_subtree-4	359.7n	365.3n	~	0.400
SubtreeProcessorAdd/1024_per_subtree-4	360.0n	363.5n	~	0.200
SubtreeProcessorAdd/2048_per_subtree-4	364.3n	379.8n	~	0.200
SubtreeProcessorRotate/4_per_subtree-4	368.2n	393.8n	~	0.100
SubtreeProcessorRotate/64_per_subtree-4	360.6n	371.0n	~	0.100
SubtreeProcessorRotate/256_per_subtree-4	359.9n	388.2n	~	0.200
SubtreeProcessorRotate/1024_per_subtree-4	358.5n	382.3n	~	0.100
SubtreeNodeAddOnly/4_per_subtree-4	87.84n	88.42n	~	0.100
SubtreeNodeAddOnly/64_per_subtree-4	65.06n	65.18n	~	1.000
SubtreeNodeAddOnly/256_per_subtree-4	64.02n	64.53n	~	0.100
SubtreeNodeAddOnly/1024_per_subtree-4	63.56n	63.83n	~	0.100
SubtreeCreationOnly/4_per_subtree-4	149.7n	148.3n	~	0.400
SubtreeCreationOnly/64_per_subtree-4	533.4n	539.0n	~	0.400
SubtreeCreationOnly/256_per_subtree-4	1.985µ	1.978µ	~	0.700
SubtreeCreationOnly/1024_per_subtree-4	6.359µ	6.431µ	~	0.700
SubtreeCreationOnly/2048_per_subtree-4	11.55µ	11.76µ	~	0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	362.1n	376.9n	~	0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	363.0n	371.7n	~	0.200
ParallelGetAndSetIfNotExists/1k_nodes-4	2.424m	2.446m	~	0.100
ParallelGetAndSetIfNotExists/10k_nodes-4	6.973m	7.178m	~	0.100
ParallelGetAndSetIfNotExists/50k_nodes-4	8.778m	9.005m	~	0.200
ParallelGetAndSetIfNotExists/100k_nodes-4	12.08m	12.43m	~	0.200
SequentialGetAndSetIfNotExists/1k_nodes-4	2.041m	2.006m	~	0.100
SequentialGetAndSetIfNotExists/10k_nodes-4	6.018m	5.675m	~	0.100
SequentialGetAndSetIfNotExists/50k_nodes-4	18.28m	17.18m	~	0.100
SequentialGetAndSetIfNotExists/100k_nodes-4	32.49m	32.33m	~	0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	2.480m	2.510m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	10.21m	10.04m	~	0.200
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	14.76m	15.29m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	2.072m	2.077m	~	1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	10.02m	10.05m	~	1.000
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	61.74m	61.44m	~	1.000
CalcBlockWork-4	550.9n	547.6n	~	0.700
CalculateWork-4	752.1n	734.9n	~	0.100
CheckOldBlockIDs/on-chain-prefetch/1000-4	63.51µ	64.18µ	~	0.100
CheckOldBlockIDs/off-chain-prefetch/1000-4	51.46µ	52.68µ	~	0.400
CheckOldBlockIDs/on-chain-prefetch/10000-4	493.7µ	475.3µ	~	1.000
CheckOldBlockIDs/off-chain-prefetch/10000-4	352.3µ	361.2µ	~	0.400
BuildBlockLocatorString_Helpers/Size_10-4	1.346µ	1.451µ	~	0.100
BuildBlockLocatorString_Helpers/Size_100-4	12.77µ	13.82µ	~	0.100
BuildBlockLocatorString_Helpers/Size_1000-4	126.9µ	136.0µ	~	0.100
CatchupWithHeaderCache-4	104.5m	104.8m	~	0.100
_prepareTxsPerLevel-4	408.5m	407.3m	~	0.700
_prepareTxsPerLevelOrdered-4	4.053m	3.780m	~	0.700
_prepareTxsPerLevel_Comparison/Original-4	410.0m	414.1m	~	0.200
_prepareTxsPerLevel_Comparison/Optimized-4	3.659m	3.649m	~	0.400
SubtreeSizes/10k_tx_4_per_subtree-4	1.415m	1.341m	~	0.400
SubtreeSizes/10k_tx_16_per_subtree-4	335.2µ	325.1µ	~	0.100
SubtreeSizes/10k_tx_64_per_subtree-4	78.71µ	78.85µ	~	1.000
SubtreeSizes/10k_tx_256_per_subtree-4	19.89µ	19.53µ	~	0.100
SubtreeSizes/10k_tx_512_per_subtree-4	9.853µ	9.708µ	~	0.200
SubtreeSizes/10k_tx_1024_per_subtree-4	4.843µ	4.790µ	~	0.100
SubtreeSizes/10k_tx_2k_per_subtree-4	2.441µ	2.402µ	~	0.200
BlockSizeScaling/10k_tx_64_per_subtree-4	76.22µ	77.71µ	~	0.100
BlockSizeScaling/10k_tx_256_per_subtree-4	19.44µ	19.21µ	~	0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.879µ	4.747µ	~	0.100
BlockSizeScaling/50k_tx_64_per_subtree-4	404.1µ	394.9µ	~	0.100
BlockSizeScaling/50k_tx_256_per_subtree-4	97.61µ	95.48µ	~	0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4	23.63µ	23.20µ	~	0.400
SubtreeAllocations/small_subtrees_exists_check-4	165.3µ	158.8µ	~	0.100
SubtreeAllocations/small_subtrees_data_fetch-4	168.2µ	172.5µ	~	0.100
SubtreeAllocations/small_subtrees_full_validation-4	333.4µ	327.9µ	~	0.400
SubtreeAllocations/medium_subtrees_exists_check-4	9.936µ	9.427µ	~	0.100
SubtreeAllocations/medium_subtrees_data_fetch-4	10.58µ	10.35µ	~	0.700
SubtreeAllocations/medium_subtrees_full_validation-4	19.36µ	19.08µ	~	0.100
SubtreeAllocations/large_subtrees_exists_check-4	2.389µ	2.262µ	~	0.100
SubtreeAllocations/large_subtrees_data_fetch-4	2.463µ	2.536µ	~	0.100
SubtreeAllocations/large_subtrees_full_validation-4	4.846µ	4.770µ	~	0.100
_BufferPoolAllocation/16KB-4	3.874µ	3.889µ	~	0.700
_BufferPoolAllocation/32KB-4	8.224µ	9.218µ	~	1.000
_BufferPoolAllocation/64KB-4	16.18µ	21.12µ	~	0.100
_BufferPoolAllocation/128KB-4	32.73µ	35.74µ	~	0.100
_BufferPoolAllocation/512KB-4	111.3µ	136.3µ	~	0.100
_BufferPoolConcurrent/32KB-4	18.86µ	18.49µ	~	0.400
_BufferPoolConcurrent/64KB-4	29.83µ	29.61µ	~	0.700
_BufferPoolConcurrent/512KB-4	143.0µ	140.7µ	~	0.100
_SubtreeDeserializationWithBufferSizes/16KB-4	640.9µ	676.7µ	~	0.200
_SubtreeDeserializationWithBufferSizes/32KB-4	701.1µ	698.6µ	~	0.700
_SubtreeDeserializationWithBufferSizes/64KB-4	703.9µ	695.8µ	~	0.700
_SubtreeDeserializationWithBufferSizes/128KB-4	658.4µ	704.2µ	~	0.700
_SubtreeDeserializationWithBufferSizes/512KB-4	604.9µ	613.2µ	~	0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4	36.29m	36.44m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4	36.34m	36.63m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/64KB-4	36.24m	35.99m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4	36.31m	36.15m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/512KB-4	35.81m	36.01m	~	0.400
_PooledVsNonPooled/Pooled-4	832.9n	831.2n	~	0.200
_PooledVsNonPooled/NonPooled-4	7.771µ	8.075µ	~	0.400
_MemoryFootprint/Current_512KB_32concurrent-4	6.566µ	6.468µ	~	0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4	9.386µ	9.370µ	~	1.000
_MemoryFootprint/Alternative_64KB_32concurrent-4	8.874µ	9.163µ	~	0.100
StoreBlock_Sequential/BelowCSVHeight-4	320.6µ	315.3µ	~	0.700
StoreBlock_Sequential/AboveCSVHeight-4	315.8µ	316.3µ	~	1.000
GetUtxoHashes-4	278.5n	278.6n	~	0.800
GetUtxoHashes_ManyOutputs-4	50.46µ	45.74µ	~	0.100
_NewMetaDataFromBytes-4	213.0n	213.4n	~	0.100
_Bytes-4	396.4n	395.8n	~	1.000
_MetaBytes-4	137.4n	141.3n	~	0.200

Threshold: >10% with p < 0.05 | Generated: 2026-06-05 14:36 UTC

oskarszoon

Approve. Drain on shutdown is the right fix and improves on the pre-fix "no close signal at all" behaviour.

Three items worth flagging:

batcher.Close() is non-blocking — it just close(b.done) and returns. Store.Close() does NOT wait for drain completion. Drain goroutines in aerospike/sql call batcher.Close() on each batcher sequentially (instant), then exit and signal done. So Store.Close() can return while workers still dispatch their final batches. go-batcher v2 doesn't expose a Wait() or Done() channel. Practical risk is low (services have exited by then) but worth knowing.
SQL drain order: unlock → get → spend → create. Comment says "state-mutating writers last" but spend (deletion) drains BEFORE create (insertion). For UTXO consistency the order is creates first, then spends. Low real-world impact given async nature, but the comment claim is wrong.
Aerospike timeout path: if the 30s deadline fires before drain finishes, Store.Close returns early but the drain goroutine keeps running. s.client.Close() is only called in the <-done branch — on timeout the Aerospike client is never explicitly closed. Not a problem on process exit but worth documenting.

Test changes are interface-compliance stubs only — no test of the drain behaviour itself. Not blocking given production evidence, but a future regression here would be silent.

github-actions · 2026-06-02T08:54:44Z

+	done := make(chan struct{})
+	go func() {
+		defer close(done)
+		// Drain in dependency order: state-mutating writers last so they


[Minor] The comment says "state-mutating writers last" but the actual order drains spend (line 318) before create (line 321). For UTXO correctness, creates should logically complete before spends. While the async batching may make this order difference irrelevant in practice, the comment claim is backwards.

Consider either:

Reordering to create then spend to match the comment

Clarifying the comment describes timeout-prioritization strategy rather than UTXO consistency ordering

The spend-before-create order is consistent across both sql.go and aerospike.go implementations, and is correct for this use case.

During shutdown, these batchers are draining independent queues from different blocks/transactions that were processed earlier. The spend batcher isn't spending outputs that the create batcher is currently creating — those cross-transaction dependencies were already resolved during block validation.

The comment "state-mutating writers last" means both spend and create (the durable state writers) are drained after the metadata/cache writers (unlock, get). The specific order between spend and create doesn't affect correctness here since they operate on disjoint UTXO sets from different historical transactions.

No change needed — the implementation and comment are both correct.

…own data race closeStores (the deferred shutdown drain added for the batched-write fix) reads d.daemonStores.main*Store under globalStoreMutex.RLock. Stores.Cleanup nils those same fields during test-daemon teardown but took no lock, so the two ran concurrently and the -race detector failed TestStore_TwoPhaseCommit and every daemon-based integration suite (smoketest/sequential/prunertest/chainintegrity/ legacy-sync). Take the write lock in Cleanup so the store-pointer reads and writes are synchronised. The drain-on-shutdown behaviour itself is unchanged.

…cess restart util.GetAerospikeClient hands out a process-wide client cached by host. The new Store.Close() (drain-on-shutdown) called s.client.Close() in place, leaving a CLOSED client in the cache. A subsequent store for the same host -- e.g. an in-process daemon restart that reuses the same Aerospike container -- then 'reused' the closed client and failed startup with INVALID_NODE_ERROR (Failed to register udfLUA / check index existence), which is why TestBlockAssemblyRestartWithExternalTransactionsAerospike failed on restart. Add util.CloseAerospikeClient(host) to close AND evict the cached client, and call it from Store.Close, so the next GetAerospikeClient builds a fresh client. The batcher-drain behaviour (the actual write-loss fix) is unchanged.

…hutdown # Conflicts: # stores/utxo/aerospike/aerospike.go

ordishs

Review of the shutdown-drain fix. The root-cause diagnosis and surrounding plumbing (deferring closeStores onto both paths, the fresh 30s context, the Cleanup() mutex fix) are correct. But two blocking issues mean the fix does not actually drain writes as written, and the aerospike path can panic-crash shutdown. Details inline.

ordishs · 2026-06-03T15:53:59Z

+		// client and a later store for the same host (e.g. an in-process
+		// daemon restart) would reuse it and fail with INVALID_NODE_ERROR.
+		if s.client != nil {
+			util.CloseAerospikeClient(s.url.Host)


[Resolved] This has been fixed by upgrading to go-batcher v2.0.4 (see go.mod line 197 and latest commit 954f4ec).

go-batcher v2.0.4 implements blocking Close: the worker drains the input channel and dispatches the residual batch before returning (see batcher.go:Close). The select/done/30s-context now correctly waits for the drain to complete before tearing down the client.

Original concern (now fixed):
The pinned go-batcher/v2 v2.0.3 had fire-and-forget Close (just close(b.done), returns immediately). The worker drained asynchronously, so Close returned before workers finished and the client was torn down mid-flight, losing writes.

ordishs · 2026-06-03T15:54:00Z

+		// state writers and are drained last to maximise their chance of
+		// committing before the deadline.
+		if s.setDAHBatcher != nil {
+			s.setDAHBatcher.Close()


High — close order is inverted and can panic-crash shutdown. The spend batcher's dispatch callback feeds the setDAH and increment batchers: sendSpendBatchLua → processSpendBatchResults (spend.go:769/779/786) → SetDAHForChildRecords (s.setDAHBatcher.Put, spend.go:905) and IncrementSpentRecords (s.incrementBatcher.Put, spend.go:1089).

Here setDAHBatcher and incrementBatcher are closed before spendBatcher. When spend drains a residual batch that completes a tx (very likely during catchup shutdown), it Puts into the already-closed setDAH/increment channels. go-batcher docs: "any further send will panic with 'send on closed channel'." Worse, those Puts run in bare go func()s, so the panic escapes go-batcher's dispatchAndRecord recovery and crashes the process — meaning storeBatcher (closed last) never drains. Strictly worse than today for the create path.

Producers must close before the batchers they feed. Reorder to drain store/get/spend first, then setDAH/increment/outpoint/locked. (The inline rationale "drained last to maximise their chance of committing before the deadline" is also backwards.) Note setLockedBatch re-enqueues into lockedBatcher (locked.go:76) — a similar smaller self-requeue hazard.

[Fixed] in b5d3a31.

Close now drains in dependency order — producers before consumers: store → spend → setDAH/increment → get/outpoint/locked. The spend drain callback (sendSpendBatchLua → handleExtraRecords/handleSpendSignal) enqueues into setDAHBatcher/incrementBatcher, which are now still open when spend drains.

Added TestClose_DrainsQueuedSpendCrossBatcher, which fully spends a multi-record tx through a spend batcher kept queued until Close so the cross-batcher enqueues run during the drain. Verified against a real Aerospike testcontainer: it reproduces panic: send on closed channel on the pre-fix close order and passes on the new one.

Also fixed a related self-referential edge in the same commit: setLockedBatch re-Put child records into its own lockedBatcher (same panic class, plus a drain deadlock). It now writes child records inline within the callback, matching the create path. Covered by TestClose_DrainsQueuedSetLockedMultiRecord.

ordishs · 2026-06-03T15:54:00Z

+			util.CloseAerospikeClient(s.url.Host)
+		}
+		return nil
+	case <-ctx.Done():


[Resolved] This has been fixed. The client close/evict now runs inside the drain goroutine (aerospike.go:559-561), so it happens even when ctx expires first.

The drain goroutine always completes its work (drains batchers, closes external store, evicts Aerospike client) regardless of whether the parent context times out. If ctx.Done() fires first, Close returns ctx.Err() to signal the timeout, but the background cleanup continues and completes.

Original concern (now fixed):
If ctx.Done() won the select race, the drain goroutine kept running but the Aerospike client was never closed or evicted (it was only closed on the done path). Combined with the async-Close issue, this meant timeout = silent dirty teardown.

ordishs · 2026-06-03T15:54:00Z

+	select {
+	case <-done:
+		if s.db != nil {
+			return s.db.Close()


[Resolved] This has been fixed by upgrading to go-batcher v2.0.4 (see go.mod line 197 and latest commit 954f4ec).

go-batcher v2.0.4 implements blocking Close that drains synchronously. The done channel now closes only after all batchers have finished draining and the db.Close() has been called (sql.go:349-351 shows db.Close runs inside the drain goroutine after all batchers complete).

Original concern (now fixed):
go-batcher v2.0.3 Close was fire-and-forget, so done closed before workers finished. The same async-Close problem that affected aerospike also affected sql.

ordishs · 2026-06-03T15:54:00Z

+// example an in-process daemon restart that reuses the same Aerospike container)
+// would "reuse" the closed client and fail with INVALID_NODE_ERROR. Evicting
+// here forces the next GetAerospikeClient to build a fresh client.
+func CloseAerospikeClient(host string) {


Medium — shared-client coupling. This closes and evicts the process-wide cached client for the host. If any other store/namespace on the same host shares this client, closing it breaks them too. Fine for a single-utxo-store daemon today, but worth a guard or an explicit comment that this assumes sole ownership of the host client.

[Addressed] The current code at util/aerospike.go:115-120 now includes explicit documentation of the sole-ownership assumption:

"IMPORTANT — assumes sole ownership of the host's client. Because the cache is keyed per host with no reference counting, this closes the client for ALL users of that host. That is correct for the current single-UTXO-store daemon (one store owns the host), but if more than one store/namespace on the same host ever shares this client, closing it here would break the others. Add reference counting before relying on shared ownership."

This makes the limitation explicit for future maintainers.

ordishs · 2026-06-03T15:54:00Z

 	return http.StatusOK, "OK", nil
 }

+func (m *MockUTXOStore) Close(context.Context) error { return nil }


The "1221 unit tests pass" claim doesn't cover what's broken: there's no test that a populated store actually drains before its connection closes, nor (for aerospike) that a tx-completing spend survives Close without panicking. A regression test — enqueue writes, Close, assert they landed in the backing store; and drive a DAH-triggering spend through aerospike Close — would have caught both blocking findings.

[Addressed] Comprehensive drain tests were added in subsequent commits:

Aerospike (commit b5d3a31, stores/utxo/aerospike/close_drain_test.go):

TestClose_DrainsQueuedSetLockedMultiRecord: verifies multi-record SetLocked is drained during Close

TestClose_DrainsQueuedSpendCrossBatcher: verifies spend drain callbacks that enqueue into setDAH/increment complete successfully

SQL (commit 954f4ec, stores/utxo/sql/close_drain_postgres_test.go):

TestClose_DrainsQueuedCreate_Postgres: verifies queued Create is persisted durably after Close drains

All tests run against real backend containers (Aerospike/Postgres) and verify persistence after Close.

…hutdown

Copilot

Pull request overview

This PR fixes a shutdown durability bug where UTXO writes queued in background batchers could be lost on SIGTERM, leaving the UTXO store inconsistent after restart (e.g., “failed to decorate previous outputs: N missing”). It does so by introducing an explicit Close(ctx) lifecycle on utxo.Store implementations and ensuring the daemon closes stores on all shutdown paths with a fresh timeout context.

Changes:

Add Close(ctx context.Context) error to the utxo.Store interface and implement it across UTXO backends and wrappers (aerospike/sql/nullstore/logger/txmetacache) plus update mocks.
Update daemon shutdown to always close stores (including UTXO) on both OS-signal and daemon.Stop() paths, using a dedicated 30s timeout context and closing UTXO before blob stores.
Add Aerospike client cache eviction helper (util.CloseAerospikeClient) to avoid reusing closed shared clients.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
util/aerospike.go	Adds `CloseAerospikeClient()` to close+evict cached Aerospike clients.
stores/utxo/Interface.go	Extends `utxo.Store` with `Close(ctx)` and documents shutdown expectations.
stores/utxo/aerospike/aerospike.go	Implements `Close(ctx)` draining for Aerospike-backed UTXO store.
stores/utxo/sql/sql.go	Implements `Close(ctx)` draining for SQL-backed UTXO store.
stores/utxo/nullstore/nullstore.go	Adds no-op `Close(ctx)` for null store.
stores/utxo/logger/logger.go	Adds `Close(ctx)` passthrough with debug logging.
stores/txmetacache/txmetacache.go	Delegates `Close(ctx)` to wrapped UTXO store.
stores/utxo/mock.go	Updates UTXO mock to implement new `Close(ctx)` method.
stores/utxo/logger/logger_test.go	Updates logger-store test mock to implement `Close(ctx)`.
stores/utxo/factory/utxo_test.go	Updates factory test mock to implement `Close(ctx)`.
services/blockpersister/Server_test.go	Updates blockpersister test mock to implement `Close(ctx)`.
daemon/daemon.go	Defers store shutdown on all exit paths; adds timeout context and closes UTXO first.
daemon/daemon_stores.go	Synchronizes `Cleanup()` with shutdown close via `globalStoreMutex`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

freemans13 · 2026-06-04T08:29:45Z

+// The batchers run in background=true mode for create/get/unlock and
+// background=false for spend — every Put returns to the caller before the
+// underlying SQL write commits. Closing each batcher invokes go-batcher's
+// shutdown drain (see batcher.go:Close), which closes the input channel,


Good catch — fixed the doc. The spend batcher is background=false, so its Put blocks until the batch callback finishes and nothing can be lost on its input channel. Reworded to state that only create/get/unlock are background=true (Put-acks-on-enqueue), while draining the spend batcher here just flushes a partial batch and releases its worker.

freemans13 · 2026-06-04T08:29:47Z

+	select {
+	case <-done:
+		if s.db != nil {
+			return s.db.Close()
+		}


Fixed. db.Close() now runs inside the drain goroutine, after the batchers drain, so the connection pool is closed even when ctx has already expired (it no longer hangs off the <-done branch only). The db.Close error is captured in a dbErr var and returned when the drain completes within the deadline; on ctx timeout we still return ctx.Err() but the db is closed best-effort in the background.

freemans13 · 2026-06-04T08:29:48Z

+	// finished using the stores. Deferring it here, after the select that
+	// waits for service exit, gives us a single ordered teardown point
+	// that runs even if the body returns through a signal-driven exit
+	// rather than an explicit Stop() call.


Reworded the comment to be accurate: the defer is registered before the select but runs on function return, i.e. after the select has observed service exit on either path.

freemans13 · 2026-06-04T08:29:50Z

+	// Implementations MUST wait for outstanding batched writes to complete
+	// before returning. Returning before pending writes have committed risks
+	// silently losing UTXO state: callers (block validation, legacy sync)
+	// will have already received successful responses for those writes and
+	// will have committed the parent block, but on restart the UTXOs will


Updated the contract wording to explicitly allow ctx-driven early return: implementations MUST wait for outstanding writes unless ctx expires first, and a returned ctx error means 'drain not confirmed complete' (the underlying drain/release may continue best-effort).

freemans13 · 2026-06-04T08:29:51Z

+	select {
+	case <-done:
+		// Drains complete; close the Aerospike client. The client is shared
+		// per host via util's connection cache, so close-and-evict it rather
+		// than closing in place — otherwise the cache would keep a closed


Fixed both. The client close/evict (util.CloseAerospikeClient) now runs inside the drain goroutine instead of on the <-done branch, so it happens even when ctx expires first — avoiding the stale cache entry that would break a later in-process restart with INVALID_NODE_ERROR. Also added s.externalStore.Close(ctx) after the batchers drain so the external blob store created in Store.New is no longer leaked.

- sql Close: close DB inside drain goroutine so the connection pool is not leaked on ctx timeout; correct doc re spend batcher (background=false). - aerospike Close: close/evict client and external blob store inside the drain goroutine so they run even when ctx expires first; close the external store that Store.New creates. - utxo Interface Close: reconcile MUST-wait wording with ctx-driven early return. - daemon: correct misleading comment about defer placement vs the select. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

github-actions · 2026-06-05T14:25:51Z

+		// Drain in dependency order: state-mutating writers last so they
+		// have the best chance of committing before the deadline.


This is a minor wording imprecision. The comment says "state-mutating writers last" but unlock (SetLocked) is closed first. However, the actual order is sound - unlock is a best-effort metadata operation while create/spend are the critical durable UTXO writes that should have maximum time to complete. Suggested clarification: "drain best-effort metadata operations first, then critical create/spend writes last".

github-actions · 2026-06-05T14:25:28Z

+// The create/get/unlock batchers run in background=true mode — their Put
+// returns to the caller before the underlying SQL write commits, so a
+// SIGTERM mid-flight would silently lose queued writes without this drain.


This remains a valid concern. The comment at sql.go:306-308 states operations "return to the caller before the underlying SQL write commits" but Create/Get/SetLocked actually wait on done channels for batch completion. The actual issue being fixed is flushing partially-filled batches during shutdown, not preventing premature caller success. Suggested rewording for accuracy.

github-actions · 2026-06-05T14:25:26Z

+// "shutdown"). Without this drain, a SIGTERM mid-flight silently loses
+// the in-channel items: the caller has already received a successful
+// Create/Spend/etc. return because background batchers ack on enqueue,
+// not on dispatch, so the parent block gets committed elsewhere but the
+// UTXO write never reaches Aerospike. On restart, blocks that spend


This remains a valid concern. The comment at aerospike.go:529 states "background batchers ack on enqueue" but the actual Create/Spend implementations wait on per-item result channels (e.g., Create waits on resultCh, Spend waits on errCh). The real risk is partial/unflushed batches during shutdown, not that callers got premature success responses. Suggested rewording for accuracy.

github-actions · 2026-06-05T14:25:30Z

+	// before pending writes have committed risks silently losing UTXO state:
+	// callers (block validation, legacy sync) will have already received
+	// successful responses for those writes and will have committed the
+	// parent block, but on restart the UTXOs will be missing — breaking
+	// subsequent blocks that spend them.


This remains a valid concern. The Interface.Close contract at line 259 states "callers … will have already received successful responses" but this is not universally true - both Aerospike and SQL implementations wait for batch callbacks before returning to callers. Suggested rewording: "callers may have observed success" or "higher-level state may have been committed" to maintain accuracy across implementations.

github-actions · 2026-06-05T14:24:47Z

+// CloseAerospikeClient closes the cached aerospike client for the given host and
+// removes it from the process-wide connection cache. GetAerospikeClient hands
+// out a shared client per host, so closing the client without evicting it would
+// leave a closed client in the cache: a subsequent store for the same host (for


[Addressed] The current code at util/aerospike.go:111-113 now includes explicit documentation clarifying that host MUST be url.Host (may include port), not just a hostname. The comment warns: "pass the same url.Host, not a bare hostname, or the eviction silently no-ops."

github-actions · 2026-06-05T14:25:02Z

+func (s *Store) Close(ctx context.Context) error {
+	done := make(chan struct{})
+
+	var extErr error
+
+	go func() {
+		defer close(done)
+		// Order is intentional: stop accepting new write ops first (no-op
+		// here because Put/PutCtx don't have a separate close-gate), then
+		// drain in dependency order. setDAH/locked/outpoint/increment are
+		// best-effort metadata writers; store/spend are the durable UTXO
+		// state writers and are drained last to maximise their chance of
+		// committing before the deadline.
+		if s.setDAHBatcher != nil {
+			s.setDAHBatcher.Close()
+		}
+		if s.lockedBatcher != nil {
+			s.lockedBatcher.Close()
+		}
+		if s.outpointBatcher != nil {
+			s.outpointBatcher.Close()
+		}
+		if s.incrementBatcher != nil {
+			s.incrementBatcher.Close()
+		}
+		if s.getBatcher != nil {
+			s.getBatcher.Close()
+		}
+		if s.spendBatcher != nil {
+			s.spendBatcher.Close()
+		}
+		if s.storeBatcher != nil {
+			s.storeBatcher.Close()
+		}
+
+		// Drains complete; close the external blob store (created in
+		// Store.New) so its handles/connections are not leaked.
+		if s.externalStore != nil {
+			extErr = s.externalStore.Close(ctx)
+		}
+
+		// Close the Aerospike client. The client is shared per host via
+		// util's connection cache, so close-and-evict it rather than closing
+		// in place — otherwise the cache would keep a closed client and a
+		// later store for the same host (e.g. an in-process daemon restart)
+		// would reuse it and fail with INVALID_NODE_ERROR. Done inside the
+		// goroutine so it still runs even when ctx has already expired.
+		if s.client != nil {
+			util.CloseAerospikeClient(s.url.Host)
+		}
+	}()
+
+	select {
+	case <-done:
+		return extErr
+	case <-ctx.Done():
+		return ctx.Err()
+	}


[Addressed] Test coverage was added in commit 954f4ec and b5d3a31:

stores/utxo/aerospike/close_drain_test.go added TestClose_DrainsQueuedSetLockedMultiRecord and TestClose_DrainsQueuedSpendCrossBatcher (lines 31-172)

stores/utxo/sql/close_drain_postgres_test.go added TestClose_DrainsQueuedCreate_Postgres (lines 62-133)

These tests verify Close() drains queued writes before connection teardown.

github-actions · 2026-06-05T14:25:05Z

+func (s *Store) Close(ctx context.Context) error {
+	done := make(chan struct{})
+
+	var dbErr error
+
+	go func() {
+		defer close(done)
+		// Drain in dependency order: state-mutating writers last so they
+		// have the best chance of committing before the deadline.
+		if s.unlockBatcher != nil {
+			s.unlockBatcher.Close()
+		}
+		if s.getBatcher != nil {
+			s.getBatcher.Close()
+		}
+		if s.spendBatcher != nil {
+			s.spendBatcher.Close()
+		}
+		if s.createBatcher != nil {
+			s.createBatcher.Close()
+		}
+		// Always close the DB after the batchers drain, even if ctx has
+		// already expired, so the connection pool is not leaked.
+		if s.db != nil {
+			dbErr = s.db.Close()
+		}
+	}()
+
+	select {
+	case <-done:
+		return dbErr
+	case <-ctx.Done():
+		return ctx.Err()
+	}
+}


[Addressed] Test coverage was added in commit 954f4ec. See stores/utxo/sql/close_drain_postgres_test.go:62-133, TestClose_DrainsQueuedCreate_Postgres. This test verifies that:

A queued Create in the background batcher is drained by Close()

The transaction persists durably (verified by reopening the DB)

Close() completes without error even when the batcher would otherwise timeout

socket-security · 2026-06-04T13:50:58Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	golang/github.com/bsv-blockchain/go-batcher/v2@v2.0.3 ⏵ v2.0.4

View full report

The graceful-shutdown fix relies on batcher.Close() blocking until queued writes are drained. go-batcher v2.0.3 Close() was fire-and-forget, so the store closed its DB/Aerospike connection before the batcher worker dispatched the residual batch — the very writes this PR aims to preserve could still be lost. go-batcher v2.0.4 (bsv-blockchain/go-batcher#111) makes Close() block until the worker has drained the channel, dispatched the residual batch through fn, and all background dispatch goroutines have finished; it is also now idempotent. With the bump, the store Close() methods already in this PR are correct as written. Adds TestClose_DrainsQueuedCreate_Postgres: configures the create batcher so it only flushes on shutdown, enqueues a Create, calls Close, then re-opens the database and confirms the transaction persisted. Runs under the Postgres testcontainer (CI/GoFortress); skipped in -short. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-04T14:10:40Z

+		// best-effort metadata writers; store/spend are the durable UTXO
+		// state writers and are drained last to maximise their chance of
+		// committing before the deadline.
+		if s.setDAHBatcher != nil {


[Critical] Batcher close order inverted — will panic-crash on shutdown

The spend batcher dispatch callback calls SetDAHForChildRecords and IncrementSpentRecords (spend.go:905, 1089), which enqueue into setDAHBatcher and incrementBatcher. Closing these consumer batchers (lines 525-536) BEFORE the producer spendBatcher (line 540-542) means when spend drains its residual batch, it will Put into already-closed channels.

go-batcher behavior: when a batcher is closed, subsequent Put calls panic with send on closed channel. Those Puts run in bare goroutines without recovery, so the panic escapes and crashes the process.

Fix: Reverse the order. Close producers (store/spend/get) BEFORE consumers (setDAH/increment/outpoint/locked).

This was previously flagged by ordishs in comment 3349987293 but hasn't been addressed yet.

[Fixed] in b5d3a31.

Close now drains in dependency order — producers before consumers: store → spend → setDAH/increment → get/outpoint/locked. The spend drain callback (sendSpendBatchLua → handleExtraRecords/handleSpendSignal) enqueues into setDAHBatcher/incrementBatcher, which are now still open when spend drains.

Added TestClose_DrainsQueuedSpendCrossBatcher, which fully spends a multi-record tx through a spend batcher kept queued until Close so the cross-batcher enqueues run during the drain. Verified against a real Aerospike testcontainer: it reproduces panic: send on closed channel on the pre-fix close order and passes on the new one.

Also fixed a related self-referential edge in the same commit: setLockedBatch re-Put child records into its own lockedBatcher (same panic class, plus a drain deadlock). It now writes child records inline within the callback, matching the create path. Covered by TestClose_DrainsQueuedSetLockedMultiRecord.

Addresses review findings on PR bsv-blockchain#928 surfaced once go-batcher v2.0.4 made Close block until drain. With a real draining Close, a batcher whose drain callback enqueues into another batcher panics ("send on closed channel") if the downstream batcher is already closed. 1. Close order (Critical, flagged by reviewers): the spend batcher's callback (sendSpendBatchLua -> handleExtraRecords/handleSpendSignal) enqueues into setDAHBatcher and incrementBatcher on a fully-spent tx. Close closed those consumers before spend. Reordered to drain producers first (store, spend), then spend's consumers (setDAH, increment), then the independent batchers (get, outpoint, locked). 2. lockedBatcher self-requeue (panic + deadlock): setLockedBatch re-Put child records into its own batcher for multi-record txs and blocked on the result. During a draining Close that Put hits a closed channel and could never be serviced. Fixed by handling child records inline within the callback (one extra BatchOperate), matching how the create path writes a tx's extra/external records. Removes the only self-referential batcher edge. 3. CloseAerospikeClient: documented the sole-ownership assumption (closes the process-wide shared client for the host) and that the arg must be url.Host. Adds aerospike Close regression tests (verified against a real Aerospike testcontainer; each confirmed to panic on the pre-fix code): - TestClose_DrainsQueuedSetLockedMultiRecord - TestClose_DrainsQueuedSpendCrossBatcher Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ordishs

Approve — solid, careful fix for a real data-integrity bug

Verified the critical claims against the codebase rather than the PR description alone. The hard parts hold up.

Verified correct

Aerospike close-order. Traced the edges: sendSpendBatchLua → processSpendBatchResults → handleSpendSignal → handleExtraRecords/SetDAHForChildRecords enqueue into setDAHBatcher (spend.go:905) and incrementBatcher (spend.go:1089), so spend must close before them — which it does. sendStoreBatch/sendSetDAHBatch/sendIncrementBatch have no outbound batcher edges, so the "store feeds nothing; setDAH/increment are independent leaves" reasoning is accurate.
SQL close-order is safe — all four PutCtx calls are caller-driven, none in a batch callback, so no cross-batcher drain edges.
Deferred teardown runs strictly after services finish on both paths; old in-select call correctly removed (no double-close).
Cleanup() race fix closes a genuine read/write race on the store pointers.
locked.go re-entrancy fix is the right approach; errCh signalled exactly once per item.

Follow-ups (non-blocking)

(Cheap, in-scope) locked.go:110-113 — the unparseable-response branch continues without signalling errCh, so SetLocked's return <-errCh leaks that goroutine forever. Pre-existing, but you're rewriting this exact branch; signalling a ProcessingError here would close the leak.
Drain-order regression tests are integration-only (testing.Short() skip + testcontainers), so the consensus-relevant ordering isn't covered by make test. Acceptable, but a future reorder of Close() won't be caught by unit CI.
Confirm go-batcher PutCtx doesn't drop on a cancelled ctx. During a real SIGTERM drain, handleExtraRecords (spend.go:980) calls setDAHBatcher.PutCtx(ctx, …) with the original spend request's (likely-cancelled) ctx. Low risk since the primary setDAH/increment writes use ctx-less .Put, but the integration test only exercises the non-cancelled path.
Confirm externalStore is a distinct instance from the subtree/tx/temp blob stores — if the 30s ctx expires, the aerospike best-effort goroutine keeps running and calls externalStore.Close(ctx) while the daemon proceeds to close the blob stores.
Shared 30s deadline across all four stores: a slow utxo drain eats the blob stores' budget. Consider a setting rather than a hardcoded literal.
CloseAerospikeClient per-host eviction has no refcount — documented and correct for the single-store daemon; just ensure no test shares an Aerospike host across two live stores.

Also: the live-mainnet validation in the test plan is still unchecked. Given this is a state-corruption fix, confirming the "closing utxo store" log line + clean next-block-after-restart on a real node before merge is worth it.

Resolves a semantic conflict in stores/utxo/aerospike/locked.go where both branches rewrote setLockedBatch: - upstream bsv-blockchain#1025 ("stop caller-side leak"): trySignal, handled[] tracking, panic-recovery defer (signalBatchPanic), placeholderKey key-error handling, batchOperate wrapper, and bounded waitForLockedResult — but kept the child self-requeue (lockedBatcher.PutCtx from inside the callback). - this branch (bsv-blockchain#928): removed the self-requeue, handling child records inline, because bsv-blockchain#928 closes the lockedBatcher on shutdown and the self-requeue then Puts into a closed channel (panic in an errgroup goroutine -> process crash), which upstream's bounded wait does not prevent. Resolution: keep ALL of upstream's scaffolding and replace the child self-requeue with the inline child-record BatchOperate (using batchOperate + trySignal). Verified: build + go vet (incl. -tags aerospike); aerospike close-drain + lock/multirecord/duplicate-spend tests pass against a real container; daemon closeStores tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

sonarqubecloud · 2026-06-05T14:53:53Z

Quality Gate failed

Failed conditions
7.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

github-actions Bot reviewed May 21, 2026

View reviewed changes

oskarszoon approved these changes May 21, 2026

View reviewed changes

freemans13 self-assigned this May 22, 2026

freemans13 added 2 commits May 29, 2026 12:58

Merge remote-tracking branch 'upstream/main' into HEAD

24bc2b5

Merge upstream/main into stu/utxo-graceful-shutdown

1e7ed8f

github-actions Bot reviewed Jun 2, 2026

View reviewed changes

freemans13 added 3 commits June 2, 2026 11:13

Merge remote-tracking branch 'upstream/main' into HEAD

26b20c2

freemans13 requested a review from ordishs June 2, 2026 18:20

Merge remote-tracking branch 'upstream/main' into stu/utxo-graceful-s…

7f11a44

…hutdown # Conflicts: # stores/utxo/aerospike/aerospike.go

ordishs reviewed Jun 3, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into stu/utxo-graceful-s…

9c8138a

…hutdown

freemans13 requested a review from Copilot June 4, 2026 08:19

Copilot started reviewing on behalf of freemans13 June 4, 2026 08:19 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

freemans13 requested a review from Copilot June 4, 2026 08:45

Copilot started reviewing on behalf of freemans13 June 4, 2026 08:45 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

freemans13 force-pushed the stu/utxo-graceful-shutdown branch from ed55281 to 954f4ec Compare June 4, 2026 14:06

github-actions Bot reviewed Jun 4, 2026

View reviewed changes

freemans13 requested a review from ordishs June 4, 2026 17:33

ordishs approved these changes Jun 5, 2026

View reviewed changes

freemans13 merged commit f68b48c into bsv-blockchain:main Jun 5, 2026
36 of 37 checks passed

		// Drain in dependency order: state-mutating writers last so they
		// have the best chance of committing before the deadline.

Conversation

freemans13 commented May 21, 2026

The bug

The cause

The fix

Test plan

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison Report

Summary

Uh oh!

oskarszoon left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

ordishs left a comment

Choose a reason for hiding this comment

Uh oh!

ordishs Jun 3, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ordishs Jun 3, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ordishs Jun 3, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

github-actions Bot commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading

ordishs Jun 3, 2026 •

edited by github-actions Bot

Loading

ordishs Jun 3, 2026 •

edited by github-actions Bot

Loading

ordishs Jun 3, 2026 •

edited by github-actions Bot

Loading

socket-security Bot commented Jun 4, 2026 •

edited

Loading