fix(utxo): drain in-flight batched writes on shutdown#928
Conversation
Loss-of-write on shutdown — the smoking gun for missing-UTXO errors
during legacy catchup after a teranode restart.
All utxo.Store backends (aerospike, sql, postgres on the
queue-utxo-store branch) front their writes with go-batcher in
background=true mode. Every Create/Spend/Get/Unlock call enqueues an
item and returns immediately; the actual DB or Aerospike write happens
in a worker goroutine that consumes the batcher's channel.
When the process receives SIGTERM:
1. util/servicemanager/service_manager.go:66 cancels the global
context.
2. Services unwind via context cancellation and sm.Wait() returns.
3. daemon/daemon.go:376 takes the 'case err = <-waitErr' branch.
4. The function returns; the process exits.
Nothing in this path closes the utxo store, so the batcher workers
are killed mid-flight. Items still in the input channel — and any
batch dispatched but not yet acked by Aerospike/Postgres — are lost.
The caller had already received a successful Create/Spend return, the
parent block had been committed to the blockchain store via gRPC, and
on restart the chain advances past blocks whose UTXOs never reached
durable storage. Subsequent blocks that spend those outputs fail with
'failed to decorate previous outputs: N missing' and the node is
stuck in an unrecoverable header-resync retry loop until the DB is
wiped.
This commit fixes the shutdown path:
- Adds Close(ctx) to the utxo.Store interface (consistent naming with
blob stores' existing Close(ctx)).
- Implements Close on every backend:
* aerospike: closes all 7 batchers in dependency order (metadata
writers first, durable state-mutating writers last), then the
Aerospike client.
* sql: closes the 4 batchers, then s.db.Close().
* nullstore: no-op (owns no batchers).
* logger / txmetacache wrappers: delegate to the inner store.
* mock / test mocks updated.
- Daemon shutdown changes:
* daemon/daemon.go now defers closeStores() at the top of the main
body so it runs on BOTH shutdown paths (signal-driven and
explicit Stop()), not just doneCh.
* closeStores allocates a fresh 30 s context instead of reusing
the already-cancelled sm.Ctx — the cancelled context is what
drove the service exit in the first place, and we want the drains
to actually run.
* closeStores now closes the utxo store before tx/subtree/temp
stores so that any utxo writes referencing blob-stored subtree
data finish before the blob stores tear down.
Each batcher's Close() blocks until the input channel is drained and
the residual batch has been dispatched (see go-batcher v2 batcher.go
~line 437–449: the worker selects on done, closes the input channel,
drains it into the current batch, and dispatches the remainder with
reason 'shutdown' before returning).
Tests: 1221 unit tests across stores/utxo/..., stores/txmetacache,
services/blockpersister, daemon/... still pass.
|
🤖 Claude Code Review Status: Complete Summary: This PR addresses a critical data loss bug where in-flight batched UTXO writes were lost on shutdown, leading to corrupted state and stuck nodes. The fix adds Close() to all UTXO stores, ensures proper drain of background workers, and updates the daemon shutdown path. Code Quality: The implementation is thorough, well-documented, and includes regression tests. The fixes from multiple review rounds (batcher close order, self-requeue elimination, go-batcher v2.0.4 upgrade) demonstrate careful iteration. Findings: Several documentation clarity issues remain from previous review rounds. Most are minor wording improvements. The core shutdown logic is sound. See existing inline comment threads for details on:
Recommendation: The critical shutdown bug fix is correct and well-tested. Outstanding items are documentation improvements that do not block functionality. |
| return nil | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| } |
There was a problem hiding this comment.
The Close() implementations in aerospike.go and sql.go handle context timeout by returning ctx.Err() while allowing the drain to continue in the background goroutine. However, this creates a potential issue: if the context times out, the function returns an error but the goroutine continues running and may still attempt to close s.client or s.db after the function has already returned.
This could lead to:
- Races if the caller doesn't expect resources to be closed after an error return
- Delayed resource cleanup with no visibility to the caller
Consider either:
- Canceling the drain when context times out (accept partial loss with timeout)
- Waiting for drain completion regardless of context timeout, but logging the timeout
- Returning success even on timeout if drain completes (current behavior is arguably correct but undocumented)
The current behavior may be intentional ("best-effort drain"), but it should be documented in the function comment that resource cleanup continues in the background even after error return.
There was a problem hiding this comment.
Resolved: This concern has been addressed. Both implementations now document the best-effort drain behavior:
- aerospike.go:476-479: "if the context deadline expires first, an error is returned but draining continues best-effort"
- sql.go:302-304: "If the context expires, the function returns its error but the drain continues best-effort"
The behavior is intentional and now properly documented.
Integration-only adapter. stu/utxo-graceful-shutdown (PR bsv-blockchain#928) added Close(ctx) to the utxo.Store interface. The postgres store on this branch (queue-utxo-store PR bsv-blockchain#684) still had the older Stop() name. Rename + adapt signature + update the cleanup callsite. Will be dropped/folded when both PRs land upstream.
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-05 14:36 UTC |
oskarszoon
left a comment
There was a problem hiding this comment.
Approve. Drain on shutdown is the right fix and improves on the pre-fix "no close signal at all" behaviour.
Three items worth flagging:
-
batcher.Close()is non-blocking — it justclose(b.done)and returns.Store.Close()does NOT wait for drain completion. Drain goroutines in aerospike/sql callbatcher.Close()on each batcher sequentially (instant), then exit and signaldone. SoStore.Close()can return while workers still dispatch their final batches. go-batcher v2 doesn't expose aWait()orDone()channel. Practical risk is low (services have exited by then) but worth knowing. -
SQL drain order: unlock → get → spend → create. Comment says "state-mutating writers last" but spend (deletion) drains BEFORE create (insertion). For UTXO consistency the order is creates first, then spends. Low real-world impact given async nature, but the comment claim is wrong.
-
Aerospike timeout path: if the 30s deadline fires before drain finishes,
Store.Closereturns early but the drain goroutine keeps running.s.client.Close()is only called in the<-donebranch — on timeout the Aerospike client is never explicitly closed. Not a problem on process exit but worth documenting.
Test changes are interface-compliance stubs only — no test of the drain behaviour itself. Not blocking given production evidence, but a future regression here would be silent.
| done := make(chan struct{}) | ||
| go func() { | ||
| defer close(done) | ||
| // Drain in dependency order: state-mutating writers last so they |
There was a problem hiding this comment.
[Minor] The comment says "state-mutating writers last" but the actual order drains spend (line 318) before create (line 321). For UTXO correctness, creates should logically complete before spends. While the async batching may make this order difference irrelevant in practice, the comment claim is backwards.
Consider either:
- Reordering to create then spend to match the comment
- Clarifying the comment describes timeout-prioritization strategy rather than UTXO consistency ordering
There was a problem hiding this comment.
The spend-before-create order is consistent across both sql.go and aerospike.go implementations, and is correct for this use case.
During shutdown, these batchers are draining independent queues from different blocks/transactions that were processed earlier. The spend batcher isn't spending outputs that the create batcher is currently creating — those cross-transaction dependencies were already resolved during block validation.
The comment "state-mutating writers last" means both spend and create (the durable state writers) are drained after the metadata/cache writers (unlock, get). The specific order between spend and create doesn't affect correctness here since they operate on disjoint UTXO sets from different historical transactions.
No change needed — the implementation and comment are both correct.
…own data race closeStores (the deferred shutdown drain added for the batched-write fix) reads d.daemonStores.main*Store under globalStoreMutex.RLock. Stores.Cleanup nils those same fields during test-daemon teardown but took no lock, so the two ran concurrently and the -race detector failed TestStore_TwoPhaseCommit and every daemon-based integration suite (smoketest/sequential/prunertest/chainintegrity/ legacy-sync). Take the write lock in Cleanup so the store-pointer reads and writes are synchronised. The drain-on-shutdown behaviour itself is unchanged.
…cess restart util.GetAerospikeClient hands out a process-wide client cached by host. The new Store.Close() (drain-on-shutdown) called s.client.Close() in place, leaving a CLOSED client in the cache. A subsequent store for the same host -- e.g. an in-process daemon restart that reuses the same Aerospike container -- then 'reused' the closed client and failed startup with INVALID_NODE_ERROR (Failed to register udfLUA / check index existence), which is why TestBlockAssemblyRestartWithExternalTransactionsAerospike failed on restart. Add util.CloseAerospikeClient(host) to close AND evict the cached client, and call it from Store.Close, so the next GetAerospikeClient builds a fresh client. The batcher-drain behaviour (the actual write-loss fix) is unchanged.
…hutdown # Conflicts: # stores/utxo/aerospike/aerospike.go
ordishs
left a comment
There was a problem hiding this comment.
Review of the shutdown-drain fix. The root-cause diagnosis and surrounding plumbing (deferring closeStores onto both paths, the fresh 30s context, the Cleanup() mutex fix) are correct. But two blocking issues mean the fix does not actually drain writes as written, and the aerospike path can panic-crash shutdown. Details inline.
| // client and a later store for the same host (e.g. an in-process | ||
| // daemon restart) would reuse it and fail with INVALID_NODE_ERROR. | ||
| if s.client != nil { | ||
| util.CloseAerospikeClient(s.url.Host) |
There was a problem hiding this comment.
[Resolved] This has been fixed by upgrading to go-batcher v2.0.4 (see go.mod line 197 and latest commit 954f4ec).
go-batcher v2.0.4 implements blocking Close: the worker drains the input channel and dispatches the residual batch before returning (see batcher.go:Close). The select/done/30s-context now correctly waits for the drain to complete before tearing down the client.
Original concern (now fixed):
The pinned go-batcher/v2 v2.0.3 had fire-and-forget Close (just close(b.done), returns immediately). The worker drained asynchronously, so Close returned before workers finished and the client was torn down mid-flight, losing writes.
| // state writers and are drained last to maximise their chance of | ||
| // committing before the deadline. | ||
| if s.setDAHBatcher != nil { | ||
| s.setDAHBatcher.Close() |
There was a problem hiding this comment.
High — close order is inverted and can panic-crash shutdown. The spend batcher's dispatch callback feeds the setDAH and increment batchers: sendSpendBatchLua → processSpendBatchResults (spend.go:769/779/786) → SetDAHForChildRecords (s.setDAHBatcher.Put, spend.go:905) and IncrementSpentRecords (s.incrementBatcher.Put, spend.go:1089).
Here setDAHBatcher and incrementBatcher are closed before spendBatcher. When spend drains a residual batch that completes a tx (very likely during catchup shutdown), it Puts into the already-closed setDAH/increment channels. go-batcher docs: "any further send will panic with 'send on closed channel'." Worse, those Puts run in bare go func()s, so the panic escapes go-batcher's dispatchAndRecord recovery and crashes the process — meaning storeBatcher (closed last) never drains. Strictly worse than today for the create path.
Producers must close before the batchers they feed. Reorder to drain store/get/spend first, then setDAH/increment/outpoint/locked. (The inline rationale "drained last to maximise their chance of committing before the deadline" is also backwards.) Note setLockedBatch re-enqueues into lockedBatcher (locked.go:76) — a similar smaller self-requeue hazard.
There was a problem hiding this comment.
[Fixed] in b5d3a31.
Close now drains in dependency order — producers before consumers: store → spend → setDAH/increment → get/outpoint/locked. The spend drain callback (sendSpendBatchLua → handleExtraRecords/handleSpendSignal) enqueues into setDAHBatcher/incrementBatcher, which are now still open when spend drains.
Added TestClose_DrainsQueuedSpendCrossBatcher, which fully spends a multi-record tx through a spend batcher kept queued until Close so the cross-batcher enqueues run during the drain. Verified against a real Aerospike testcontainer: it reproduces panic: send on closed channel on the pre-fix close order and passes on the new one.
Also fixed a related self-referential edge in the same commit: setLockedBatch re-Put child records into its own lockedBatcher (same panic class, plus a drain deadlock). It now writes child records inline within the callback, matching the create path. Covered by TestClose_DrainsQueuedSetLockedMultiRecord.
| util.CloseAerospikeClient(s.url.Host) | ||
| } | ||
| return nil | ||
| case <-ctx.Done(): |
There was a problem hiding this comment.
[Resolved] This has been fixed. The client close/evict now runs inside the drain goroutine (aerospike.go:559-561), so it happens even when ctx expires first.
The drain goroutine always completes its work (drains batchers, closes external store, evicts Aerospike client) regardless of whether the parent context times out. If ctx.Done() fires first, Close returns ctx.Err() to signal the timeout, but the background cleanup continues and completes.
Original concern (now fixed):
If ctx.Done() won the select race, the drain goroutine kept running but the Aerospike client was never closed or evicted (it was only closed on the done path). Combined with the async-Close issue, this meant timeout = silent dirty teardown.
| select { | ||
| case <-done: | ||
| if s.db != nil { | ||
| return s.db.Close() |
There was a problem hiding this comment.
[Resolved] This has been fixed by upgrading to go-batcher v2.0.4 (see go.mod line 197 and latest commit 954f4ec).
go-batcher v2.0.4 implements blocking Close that drains synchronously. The done channel now closes only after all batchers have finished draining and the db.Close() has been called (sql.go:349-351 shows db.Close runs inside the drain goroutine after all batchers complete).
Original concern (now fixed):
go-batcher v2.0.3 Close was fire-and-forget, so done closed before workers finished. The same async-Close problem that affected aerospike also affected sql.
| // example an in-process daemon restart that reuses the same Aerospike container) | ||
| // would "reuse" the closed client and fail with INVALID_NODE_ERROR. Evicting | ||
| // here forces the next GetAerospikeClient to build a fresh client. | ||
| func CloseAerospikeClient(host string) { |
There was a problem hiding this comment.
Medium — shared-client coupling. This closes and evicts the process-wide cached client for the host. If any other store/namespace on the same host shares this client, closing it breaks them too. Fine for a single-utxo-store daemon today, but worth a guard or an explicit comment that this assumes sole ownership of the host client.
There was a problem hiding this comment.
[Addressed] The current code at util/aerospike.go:115-120 now includes explicit documentation of the sole-ownership assumption:
"IMPORTANT — assumes sole ownership of the host's client. Because the cache is keyed per host with no reference counting, this closes the client for ALL users of that host. That is correct for the current single-UTXO-store daemon (one store owns the host), but if more than one store/namespace on the same host ever shares this client, closing it here would break the others. Add reference counting before relying on shared ownership."
This makes the limitation explicit for future maintainers.
| return http.StatusOK, "OK", nil | ||
| } | ||
|
|
||
| func (m *MockUTXOStore) Close(context.Context) error { return nil } |
There was a problem hiding this comment.
The "1221 unit tests pass" claim doesn't cover what's broken: there's no test that a populated store actually drains before its connection closes, nor (for aerospike) that a tx-completing spend survives Close without panicking. A regression test — enqueue writes, Close, assert they landed in the backing store; and drive a DAH-triggering spend through aerospike Close — would have caught both blocking findings.
There was a problem hiding this comment.
[Addressed] Comprehensive drain tests were added in subsequent commits:
Aerospike (commit b5d3a31, stores/utxo/aerospike/close_drain_test.go):
- TestClose_DrainsQueuedSetLockedMultiRecord: verifies multi-record SetLocked is drained during Close
- TestClose_DrainsQueuedSpendCrossBatcher: verifies spend drain callbacks that enqueue into setDAH/increment complete successfully
SQL (commit 954f4ec, stores/utxo/sql/close_drain_postgres_test.go):
- TestClose_DrainsQueuedCreate_Postgres: verifies queued Create is persisted durably after Close drains
All tests run against real backend containers (Aerospike/Postgres) and verify persistence after Close.
There was a problem hiding this comment.
Pull request overview
This PR fixes a shutdown durability bug where UTXO writes queued in background batchers could be lost on SIGTERM, leaving the UTXO store inconsistent after restart (e.g., “failed to decorate previous outputs: N missing”). It does so by introducing an explicit Close(ctx) lifecycle on utxo.Store implementations and ensuring the daemon closes stores on all shutdown paths with a fresh timeout context.
Changes:
- Add
Close(ctx context.Context) errorto theutxo.Storeinterface and implement it across UTXO backends and wrappers (aerospike/sql/nullstore/logger/txmetacache) plus update mocks. - Update daemon shutdown to always close stores (including UTXO) on both OS-signal and
daemon.Stop()paths, using a dedicated 30s timeout context and closing UTXO before blob stores. - Add Aerospike client cache eviction helper (
util.CloseAerospikeClient) to avoid reusing closed shared clients.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| util/aerospike.go | Adds CloseAerospikeClient() to close+evict cached Aerospike clients. |
| stores/utxo/Interface.go | Extends utxo.Store with Close(ctx) and documents shutdown expectations. |
| stores/utxo/aerospike/aerospike.go | Implements Close(ctx) draining for Aerospike-backed UTXO store. |
| stores/utxo/sql/sql.go | Implements Close(ctx) draining for SQL-backed UTXO store. |
| stores/utxo/nullstore/nullstore.go | Adds no-op Close(ctx) for null store. |
| stores/utxo/logger/logger.go | Adds Close(ctx) passthrough with debug logging. |
| stores/txmetacache/txmetacache.go | Delegates Close(ctx) to wrapped UTXO store. |
| stores/utxo/mock.go | Updates UTXO mock to implement new Close(ctx) method. |
| stores/utxo/logger/logger_test.go | Updates logger-store test mock to implement Close(ctx). |
| stores/utxo/factory/utxo_test.go | Updates factory test mock to implement Close(ctx). |
| services/blockpersister/Server_test.go | Updates blockpersister test mock to implement Close(ctx). |
| daemon/daemon.go | Defers store shutdown on all exit paths; adds timeout context and closes UTXO first. |
| daemon/daemon_stores.go | Synchronizes Cleanup() with shutdown close via globalStoreMutex. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // The batchers run in background=true mode for create/get/unlock and | ||
| // background=false for spend — every Put returns to the caller before the | ||
| // underlying SQL write commits. Closing each batcher invokes go-batcher's | ||
| // shutdown drain (see batcher.go:Close), which closes the input channel, |
There was a problem hiding this comment.
Good catch — fixed the doc. The spend batcher is background=false, so its Put blocks until the batch callback finishes and nothing can be lost on its input channel. Reworded to state that only create/get/unlock are background=true (Put-acks-on-enqueue), while draining the spend batcher here just flushes a partial batch and releases its worker.
| select { | ||
| case <-done: | ||
| if s.db != nil { | ||
| return s.db.Close() | ||
| } |
There was a problem hiding this comment.
Fixed. db.Close() now runs inside the drain goroutine, after the batchers drain, so the connection pool is closed even when ctx has already expired (it no longer hangs off the <-done branch only). The db.Close error is captured in a dbErr var and returned when the drain completes within the deadline; on ctx timeout we still return ctx.Err() but the db is closed best-effort in the background.
| // finished using the stores. Deferring it here, after the select that | ||
| // waits for service exit, gives us a single ordered teardown point | ||
| // that runs even if the body returns through a signal-driven exit | ||
| // rather than an explicit Stop() call. |
There was a problem hiding this comment.
Reworded the comment to be accurate: the defer is registered before the select but runs on function return, i.e. after the select has observed service exit on either path.
| // Implementations MUST wait for outstanding batched writes to complete | ||
| // before returning. Returning before pending writes have committed risks | ||
| // silently losing UTXO state: callers (block validation, legacy sync) | ||
| // will have already received successful responses for those writes and | ||
| // will have committed the parent block, but on restart the UTXOs will |
There was a problem hiding this comment.
Updated the contract wording to explicitly allow ctx-driven early return: implementations MUST wait for outstanding writes unless ctx expires first, and a returned ctx error means 'drain not confirmed complete' (the underlying drain/release may continue best-effort).
| select { | ||
| case <-done: | ||
| // Drains complete; close the Aerospike client. The client is shared | ||
| // per host via util's connection cache, so close-and-evict it rather | ||
| // than closing in place — otherwise the cache would keep a closed |
There was a problem hiding this comment.
Fixed both. The client close/evict (util.CloseAerospikeClient) now runs inside the drain goroutine instead of on the <-done branch, so it happens even when ctx expires first — avoiding the stale cache entry that would break a later in-process restart with INVALID_NODE_ERROR. Also added s.externalStore.Close(ctx) after the batchers drain so the external blob store created in Store.New is no longer leaked.
- sql Close: close DB inside drain goroutine so the connection pool is not leaked on ctx timeout; correct doc re spend batcher (background=false). - aerospike Close: close/evict client and external blob store inside the drain goroutine so they run even when ctx expires first; close the external store that Store.New creates. - utxo Interface Close: reconcile MUST-wait wording with ctx-driven early return. - daemon: correct misleading comment about defer placement vs the select. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| // Drain in dependency order: state-mutating writers last so they | ||
| // have the best chance of committing before the deadline. |
There was a problem hiding this comment.
This is a minor wording imprecision. The comment says "state-mutating writers last" but unlock (SetLocked) is closed first. However, the actual order is sound - unlock is a best-effort metadata operation while create/spend are the critical durable UTXO writes that should have maximum time to complete. Suggested clarification: "drain best-effort metadata operations first, then critical create/spend writes last".
| // The create/get/unlock batchers run in background=true mode — their Put | ||
| // returns to the caller before the underlying SQL write commits, so a | ||
| // SIGTERM mid-flight would silently lose queued writes without this drain. |
There was a problem hiding this comment.
This remains a valid concern. The comment at sql.go:306-308 states operations "return to the caller before the underlying SQL write commits" but Create/Get/SetLocked actually wait on done channels for batch completion. The actual issue being fixed is flushing partially-filled batches during shutdown, not preventing premature caller success. Suggested rewording for accuracy.
| // "shutdown"). Without this drain, a SIGTERM mid-flight silently loses | ||
| // the in-channel items: the caller has already received a successful | ||
| // Create/Spend/etc. return because background batchers ack on enqueue, | ||
| // not on dispatch, so the parent block gets committed elsewhere but the | ||
| // UTXO write never reaches Aerospike. On restart, blocks that spend |
There was a problem hiding this comment.
This remains a valid concern. The comment at aerospike.go:529 states "background batchers ack on enqueue" but the actual Create/Spend implementations wait on per-item result channels (e.g., Create waits on resultCh, Spend waits on errCh). The real risk is partial/unflushed batches during shutdown, not that callers got premature success responses. Suggested rewording for accuracy.
| // before pending writes have committed risks silently losing UTXO state: | ||
| // callers (block validation, legacy sync) will have already received | ||
| // successful responses for those writes and will have committed the | ||
| // parent block, but on restart the UTXOs will be missing — breaking | ||
| // subsequent blocks that spend them. |
There was a problem hiding this comment.
This remains a valid concern. The Interface.Close contract at line 259 states "callers … will have already received successful responses" but this is not universally true - both Aerospike and SQL implementations wait for batch callbacks before returning to callers. Suggested rewording: "callers may have observed success" or "higher-level state may have been committed" to maintain accuracy across implementations.
| // CloseAerospikeClient closes the cached aerospike client for the given host and | ||
| // removes it from the process-wide connection cache. GetAerospikeClient hands | ||
| // out a shared client per host, so closing the client without evicting it would | ||
| // leave a closed client in the cache: a subsequent store for the same host (for |
There was a problem hiding this comment.
[Addressed] The current code at util/aerospike.go:111-113 now includes explicit documentation clarifying that host MUST be url.Host (may include port), not just a hostname. The comment warns: "pass the same url.Host, not a bare hostname, or the eviction silently no-ops."
| func (s *Store) Close(ctx context.Context) error { | ||
| done := make(chan struct{}) | ||
|
|
||
| var extErr error | ||
|
|
||
| go func() { | ||
| defer close(done) | ||
| // Order is intentional: stop accepting new write ops first (no-op | ||
| // here because Put/PutCtx don't have a separate close-gate), then | ||
| // drain in dependency order. setDAH/locked/outpoint/increment are | ||
| // best-effort metadata writers; store/spend are the durable UTXO | ||
| // state writers and are drained last to maximise their chance of | ||
| // committing before the deadline. | ||
| if s.setDAHBatcher != nil { | ||
| s.setDAHBatcher.Close() | ||
| } | ||
| if s.lockedBatcher != nil { | ||
| s.lockedBatcher.Close() | ||
| } | ||
| if s.outpointBatcher != nil { | ||
| s.outpointBatcher.Close() | ||
| } | ||
| if s.incrementBatcher != nil { | ||
| s.incrementBatcher.Close() | ||
| } | ||
| if s.getBatcher != nil { | ||
| s.getBatcher.Close() | ||
| } | ||
| if s.spendBatcher != nil { | ||
| s.spendBatcher.Close() | ||
| } | ||
| if s.storeBatcher != nil { | ||
| s.storeBatcher.Close() | ||
| } | ||
|
|
||
| // Drains complete; close the external blob store (created in | ||
| // Store.New) so its handles/connections are not leaked. | ||
| if s.externalStore != nil { | ||
| extErr = s.externalStore.Close(ctx) | ||
| } | ||
|
|
||
| // Close the Aerospike client. The client is shared per host via | ||
| // util's connection cache, so close-and-evict it rather than closing | ||
| // in place — otherwise the cache would keep a closed client and a | ||
| // later store for the same host (e.g. an in-process daemon restart) | ||
| // would reuse it and fail with INVALID_NODE_ERROR. Done inside the | ||
| // goroutine so it still runs even when ctx has already expired. | ||
| if s.client != nil { | ||
| util.CloseAerospikeClient(s.url.Host) | ||
| } | ||
| }() | ||
|
|
||
| select { | ||
| case <-done: | ||
| return extErr | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| } |
There was a problem hiding this comment.
[Addressed] Test coverage was added in commit 954f4ec and b5d3a31:
- stores/utxo/aerospike/close_drain_test.go added TestClose_DrainsQueuedSetLockedMultiRecord and TestClose_DrainsQueuedSpendCrossBatcher (lines 31-172)
- stores/utxo/sql/close_drain_postgres_test.go added TestClose_DrainsQueuedCreate_Postgres (lines 62-133)
These tests verify Close() drains queued writes before connection teardown.
| func (s *Store) Close(ctx context.Context) error { | ||
| done := make(chan struct{}) | ||
|
|
||
| var dbErr error | ||
|
|
||
| go func() { | ||
| defer close(done) | ||
| // Drain in dependency order: state-mutating writers last so they | ||
| // have the best chance of committing before the deadline. | ||
| if s.unlockBatcher != nil { | ||
| s.unlockBatcher.Close() | ||
| } | ||
| if s.getBatcher != nil { | ||
| s.getBatcher.Close() | ||
| } | ||
| if s.spendBatcher != nil { | ||
| s.spendBatcher.Close() | ||
| } | ||
| if s.createBatcher != nil { | ||
| s.createBatcher.Close() | ||
| } | ||
| // Always close the DB after the batchers drain, even if ctx has | ||
| // already expired, so the connection pool is not leaked. | ||
| if s.db != nil { | ||
| dbErr = s.db.Close() | ||
| } | ||
| }() | ||
|
|
||
| select { | ||
| case <-done: | ||
| return dbErr | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| } | ||
| } |
There was a problem hiding this comment.
[Addressed] Test coverage was added in commit 954f4ec. See stores/utxo/sql/close_drain_postgres_test.go:62-133, TestClose_DrainsQueuedCreate_Postgres. This test verifies that:
- A queued Create in the background batcher is drained by Close()
- The transaction persists durably (verified by reopening the DB)
- Close() completes without error even when the batcher would otherwise timeout
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
The graceful-shutdown fix relies on batcher.Close() blocking until queued writes are drained. go-batcher v2.0.3 Close() was fire-and-forget, so the store closed its DB/Aerospike connection before the batcher worker dispatched the residual batch — the very writes this PR aims to preserve could still be lost. go-batcher v2.0.4 (bsv-blockchain/go-batcher#111) makes Close() block until the worker has drained the channel, dispatched the residual batch through fn, and all background dispatch goroutines have finished; it is also now idempotent. With the bump, the store Close() methods already in this PR are correct as written. Adds TestClose_DrainsQueuedCreate_Postgres: configures the create batcher so it only flushes on shutdown, enqueues a Create, calls Close, then re-opens the database and confirms the transaction persisted. Runs under the Postgres testcontainer (CI/GoFortress); skipped in -short. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ed55281 to
954f4ec
Compare
| // best-effort metadata writers; store/spend are the durable UTXO | ||
| // state writers and are drained last to maximise their chance of | ||
| // committing before the deadline. | ||
| if s.setDAHBatcher != nil { |
There was a problem hiding this comment.
[Critical] Batcher close order inverted — will panic-crash on shutdown
The spend batcher dispatch callback calls SetDAHForChildRecords and IncrementSpentRecords (spend.go:905, 1089), which enqueue into setDAHBatcher and incrementBatcher. Closing these consumer batchers (lines 525-536) BEFORE the producer spendBatcher (line 540-542) means when spend drains its residual batch, it will Put into already-closed channels.
go-batcher behavior: when a batcher is closed, subsequent Put calls panic with send on closed channel. Those Puts run in bare goroutines without recovery, so the panic escapes and crashes the process.
Fix: Reverse the order. Close producers (store/spend/get) BEFORE consumers (setDAH/increment/outpoint/locked).
This was previously flagged by ordishs in comment 3349987293 but hasn't been addressed yet.
There was a problem hiding this comment.
[Fixed] in b5d3a31.
Close now drains in dependency order — producers before consumers: store → spend → setDAH/increment → get/outpoint/locked. The spend drain callback (sendSpendBatchLua → handleExtraRecords/handleSpendSignal) enqueues into setDAHBatcher/incrementBatcher, which are now still open when spend drains.
Added TestClose_DrainsQueuedSpendCrossBatcher, which fully spends a multi-record tx through a spend batcher kept queued until Close so the cross-batcher enqueues run during the drain. Verified against a real Aerospike testcontainer: it reproduces panic: send on closed channel on the pre-fix close order and passes on the new one.
Also fixed a related self-referential edge in the same commit: setLockedBatch re-Put child records into its own lockedBatcher (same panic class, plus a drain deadlock). It now writes child records inline within the callback, matching the create path. Covered by TestClose_DrainsQueuedSetLockedMultiRecord.
Addresses review findings on PR bsv-blockchain#928 surfaced once go-batcher v2.0.4 made Close block until drain. With a real draining Close, a batcher whose drain callback enqueues into another batcher panics ("send on closed channel") if the downstream batcher is already closed. 1. Close order (Critical, flagged by reviewers): the spend batcher's callback (sendSpendBatchLua -> handleExtraRecords/handleSpendSignal) enqueues into setDAHBatcher and incrementBatcher on a fully-spent tx. Close closed those consumers before spend. Reordered to drain producers first (store, spend), then spend's consumers (setDAH, increment), then the independent batchers (get, outpoint, locked). 2. lockedBatcher self-requeue (panic + deadlock): setLockedBatch re-Put child records into its own batcher for multi-record txs and blocked on the result. During a draining Close that Put hits a closed channel and could never be serviced. Fixed by handling child records inline within the callback (one extra BatchOperate), matching how the create path writes a tx's extra/external records. Removes the only self-referential batcher edge. 3. CloseAerospikeClient: documented the sole-ownership assumption (closes the process-wide shared client for the host) and that the arg must be url.Host. Adds aerospike Close regression tests (verified against a real Aerospike testcontainer; each confirmed to panic on the pre-fix code): - TestClose_DrainsQueuedSetLockedMultiRecord - TestClose_DrainsQueuedSpendCrossBatcher Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ordishs
left a comment
There was a problem hiding this comment.
Approve — solid, careful fix for a real data-integrity bug
Verified the critical claims against the codebase rather than the PR description alone. The hard parts hold up.
Verified correct
- Aerospike close-order. Traced the edges:
sendSpendBatchLua → processSpendBatchResults → handleSpendSignal → handleExtraRecords/SetDAHForChildRecordsenqueue intosetDAHBatcher(spend.go:905) andincrementBatcher(spend.go:1089), so spend must close before them — which it does.sendStoreBatch/sendSetDAHBatch/sendIncrementBatchhave no outbound batcher edges, so the "store feeds nothing; setDAH/increment are independent leaves" reasoning is accurate. - SQL close-order is safe — all four
PutCtxcalls are caller-driven, none in a batch callback, so no cross-batcher drain edges. - Deferred teardown runs strictly after services finish on both paths; old in-
selectcall correctly removed (no double-close). Cleanup()race fix closes a genuine read/write race on the store pointers.locked.gore-entrancy fix is the right approach; errCh signalled exactly once per item.
Follow-ups (non-blocking)
- (Cheap, in-scope)
locked.go:110-113— the unparseable-response branchcontinues without signallingerrCh, soSetLocked'sreturn <-errChleaks that goroutine forever. Pre-existing, but you're rewriting this exact branch; signalling aProcessingErrorhere would close the leak. - Drain-order regression tests are integration-only (
testing.Short()skip + testcontainers), so the consensus-relevant ordering isn't covered bymake test. Acceptable, but a future reorder ofClose()won't be caught by unit CI. - Confirm go-batcher
PutCtxdoesn't drop on a cancelled ctx. During a real SIGTERM drain,handleExtraRecords(spend.go:980) callssetDAHBatcher.PutCtx(ctx, …)with the original spend request's (likely-cancelled) ctx. Low risk since the primary setDAH/increment writes use ctx-less.Put, but the integration test only exercises the non-cancelled path. - Confirm
externalStoreis a distinct instance from the subtree/tx/temp blob stores — if the 30s ctx expires, the aerospike best-effort goroutine keeps running and callsexternalStore.Close(ctx)while the daemon proceeds to close the blob stores. - Shared 30s deadline across all four stores: a slow utxo drain eats the blob stores' budget. Consider a setting rather than a hardcoded literal.
CloseAerospikeClientper-host eviction has no refcount — documented and correct for the single-store daemon; just ensure no test shares an Aerospike host across two live stores.
Also: the live-mainnet validation in the test plan is still unchecked. Given this is a state-corruption fix, confirming the "closing utxo store" log line + clean next-block-after-restart on a real node before merge is worth it.
Resolves a semantic conflict in stores/utxo/aerospike/locked.go where both branches rewrote setLockedBatch: - upstream bsv-blockchain#1025 ("stop caller-side leak"): trySignal, handled[] tracking, panic-recovery defer (signalBatchPanic), placeholderKey key-error handling, batchOperate wrapper, and bounded waitForLockedResult — but kept the child self-requeue (lockedBatcher.PutCtx from inside the callback). - this branch (bsv-blockchain#928): removed the self-requeue, handling child records inline, because bsv-blockchain#928 closes the lockedBatcher on shutdown and the self-requeue then Puts into a closed channel (panic in an errgroup goroutine -> process crash), which upstream's bounded wait does not prevent. Resolution: keep ALL of upstream's scaffolding and replace the child self-requeue with the inline child-record BatchOperate (using batchOperate + trySignal). Verified: build + go vet (incl. -tags aerospike); aerospike close-drain + lock/multirecord/duplicate-spend tests pass against a real container; daemon closeStores tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|


The bug
Restarting teranode during legacy catchup can leave the UTXO store inconsistent. After redeploying a testnet node mid-catchup we observed:
```
ERROR | netsync/handle_block.go:1018 | legacy| [extendTransactions] called for block ... height 54494 DONE in 817.351µs with error: PROCESSING (4): failed to decorate previous outputs for tx a1f6a4ff...c948afcca -> PROCESSING (4): failed to decorate previous outputs: 1 missing
INFO | legacy/netsync/manager.go:886 | legacy| Resetting header sync state at height 54493
```
Looping forever: peer sends block → tx references a parent UTXO that's not in our store → fail → header-resync → retry → same error. The only recovery is a DB wipe.
The cause
All `utxo.Store` backends (aerospike, sql, and the queue/postgres store on a sibling PR) front their writes with `go-batcher` in `background=true` mode. Every `Create`/`Spend`/`Get`/`Unlock` call enqueues an item and returns immediately; the actual DB or Aerospike write happens in a worker goroutine.
When the process receives SIGTERM:
Nothing in this path closes the utxo store, so the batcher workers are killed mid-flight. Items still in the input channel — and any batch dispatched but not yet committed by Aerospike/Postgres — are lost. The caller had already received a successful `Create`/`Spend` return (background mode acks on enqueue), the parent block had been committed to the blockchain store via gRPC, and on restart the chain advances past blocks whose UTXOs never reached durable storage. Subsequent blocks that spend those outputs fail with `failed to decorate previous outputs: N missing`.
Note that the existing `closeStores` in `daemon.go` only ran on the `<-d.doneCh` shutdown branch (explicit `daemon.Stop()`) and only closed the tx/subtree/temp blob stores. The OS-signal path never called it at all.
The fix
Each batcher's `Close()` blocks until the input channel is drained and the residual batch dispatched (go-batcher v2 batcher.go:437–449: the worker selects on done, closes the input channel, drains it, and dispatches the remainder with reason `shutdown` before returning).
Test plan