fix(utxo): propagate panic recovery as error in Spend#762
Conversation
|
🤖 Claude Code Review Status: Complete Current Review:
Known limitation (documented, acceptable for merge): Security assessment: The fix correctly closes the UTXO corruption window described in #4560 by surfacing panics instead of returning silent success. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-12 09:50 UTC |
oskarszoon
left a comment
There was a problem hiding this comment.
Thanks for tackling this. The signalling fix itself is correct (named returns + if *err == nil guard in both stores, no other recover() sites in production paths). But there are real blockers before merge.
Blockers
1. CI is red on this branch. Commit 2d4481771 traded a depguard violation for a forbidigo one — fmt.Errorf is banned project-wide. Hits stores/utxo/aerospike/spend_panic_test.go:35 and stores/utxo/sql/spend_panic_test.go:50. Switch to teranode's own errors.New(errors.ERR_UNKNOWN, ...).
2. Goroutine panics are not caught. Both Spend impls do most of the actual work inside g.Go(...) lambdas (PutCtx, channel reads, slice writes, errSpent parsing). The deferred recover only catches panics in the parent goroutine. If issue #4560's reported panic originated inside one of those workers, this PR does not fix it — the process still crashes. Each worker lambda needs its own recover (or errgroup wrapping with a recover helper) before this can claim to close #4560.
3. errors.NewProcessingError collides with the block-validation retry classifier. services/blockvalidation/BlockValidation.go:1511-1513 treats ErrProcessing as transient and retries; same in services/legacy/netsync/handle_block.go:953,1037,1292. A recovered panic is not transient — repeated retries will hammer a broken path. Either introduce a distinct ErrPanic error code, or exclude the panic-derived error from the retry classifier.
Other real issues
- No Aerospike end-to-end regression test. SQL has both the helper unit test and a real
Spend(ctx, nil, h)regression. Aerospike only has the helper unit test, so the defer placement, named returns, and&errplumbing on the actualSpendaren't covered. Add a parallel Aerospike test. - Partial-mutation state remains unmitigated. None of the three callers (
validator.spendUtxos,quick_validate.createAndSpendUTXOsForBatch,process_conflicting.ProcessConflicting) callUnspendonSpenderror. The PR's own internal rollback (theneedsSpendRollbackblock) is bypassed by the panic path because control transfers to the deferred recover beforeg.Wait(). The PR makes the failure visible but leaves a half-spent UTXO set on disk. At minimum, document this in the commit so a follow-up can address rollback semantics. - Defer asymmetry between stores. SQL puts the
blockHeight == 0check before the defer (stores/utxo/sql/sql.go:1637+); Aerospike puts the defer first (stores/utxo/aerospike/spend.go:285+). Aerospike's order is structurally safer — future code added at the top is protected by default. Align both to defer-first.
Nit
- The logged message uses
%von the recovered value only — no stack trace. For real-world triage adddebug.Stack()to the log line. Keep the error message text unchanged so the test substring assertion still passes.
Happy to re-approve once CI is green and goroutine recovery + retry-classifier handling are addressed.
errgroup.Go does not recover panics in worker bodies, so a panic inside a spend worker would crash the process even though the parent had a recover. Add a per-worker defer that funnels through handleSpendPanic and writes into spends[idx].Err. Reclassify recovered panics as ERR_UNKNOWN rather than ERR_PROCESSING so the block-validation retry classifier does not treat them as transient infrastructure errors and retry indefinitely. Log a stack trace alongside the recovered value so the source is recoverable post-mortem. SQL: move the blockHeight==0 guard below the recover so an early panic is still captured.
|
oskarszoon
left a comment
There was a problem hiding this comment.
All three blockers from prior review resolved in 438c0df:
- Lint clean —
fmt.Errorfremoved,golangci-lint run ./stores/utxo/...passes locally. - Worker-goroutine panics now caught — per-worker
defer handleSpendPanicin both Aerospike and SQL stores. - Reclassified to
ErrUnknown(notErrProcessing) so theBlockValidation.gotransient-retry classifier won't loop on a panic.
Plus the defer-asymmetry and stack-trace nits — both addressed. Aerospike end-to-end regression test added (TestStore_SpendNilTxPanicRecovery).
Two low-severity nits left, non-blocking: require.Contains(err.Error(), "panic") could be errors.Is(err, ErrUnknown); worker-panic path doesn't trigger needsSpendRollback (pre-existing, just flagging).
LGTM.




Closes #4560.
Summary
Both Aerospike and SQL
Spend()implementations wrap their body in adefer recover()that logs the panic but does not surface it. With unnamed returns, a recovered panic returns(nil, nil)— success — even though UTXO state may be partially mutated. Validator / ProcessConflicting / reorg paths then proceed as if the spend completed, opening a UTXO-corruption / double-spend window.Fix
Convert both
Spendsignatures to named returns and seterrin the deferred recover:func (s *Store) Spend(...) (spends []*utxo.Spend, err error)if err == nil { err = errors.NewProcessingError("panic in Spend: %v", r) }Applied identically in
stores/utxo/aerospike/spend.goandstores/utxo/sql/sql.go. The recover body is extracted into a smallhandleSpendPanichelper per package so the contract can be exercised directly in unit tests.Test plan
Spend; asserts the function returns a non-nil error containing "panic".go test -race ./stores/utxo/sql/...passes.go test -race -tags aerospike ./stores/utxo/aerospike/...passes.