Skip to content

feat(utxo): add a PostgreSQL-native UTXO store#684

Open
freemans13 wants to merge 127 commits into
bsv-blockchain:mainfrom
freemans13:stu/queue-utxo-store
Open

feat(utxo): add a PostgreSQL-native UTXO store#684
freemans13 wants to merge 127 commits into
bsv-blockchain:mainfrom
freemans13:stu/queue-utxo-store

Conversation

@freemans13

@freemans13 freemans13 commented Apr 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

A PostgreSQL-native UTXO store (stores/utxo/postgres/) implementing the utxo.Store interface with pgx v5, batched operations, and deferred (off-hot-path) pruning. Intended as an alternative backend alongside the existing SQL and Aerospike stores.

Status: functional and passing the store's unit tests. Benchmarked so far only on a single developer machine, so the figures below are indicative of relative behaviour on that setup — not production numbers — and carry meaningful run-to-run variance. Not yet exercised through the validator service or the sequential/integration suites.

Design

  • Transaction-centric layout. A txs row carries its own per-output state as parallel arrays (utxo hash, spendable/frozen flags, coinbase maturity), so a spend validates from a single row read. Spends are recorded as rows in an append-only spends table ("is this output spent?" = row present). There is no separate outputs table.
  • Hash-partitioned txs/spends (8 partitions) to spread autovacuum and prune work.
  • Deferred delete-at-height. Inline spend / set-mined only tag heights (spent_at_height / mined_at_height, BRIN-indexed). A background sweep recomputes "mined and fully spent" off the hot path and stamps delete_at_height; a continuous pruner reclaims rows whose height has been reached. Reorg-safe (Unspend clears DAH directly; a keyspace-sliced backstop reconciles misses; a durable watermark makes the cursor crash-safe).
  • synchronous_commit=on on every connection (group commit amortises the fsyncs).

Schema — tables and the thinking behind them

Two data tables plus a one-row cursor. Both data tables are LOGGED (durable) and PARTITION BY HASH into 8 partitions on the same tx-hash key with the same modulus, so a tx's row and its spend records co-locate in aligned leaves (txs_pNNspends_pNN).

txs — one row per transaction (PARTITION BY HASH(hash))

Holds everything needed to validate a spend against this tx in a single row:

  • Identity / metadata: hash (PK), version, lock_time, fee, size_in_bytes, coinbase, raw_tx (extended bytes, LZ4-compressed), inserted_at.
  • Mutable state / flags: locked, conflicting, frozen, unmined_since (NULL ⇒ on the longest chain), mined_at_height, delete_at_height (the deferred prune stamp), preserve_until.
  • Block membership (parallel arrays): block_ids, block_heights, subtree_idxs — a tx can be in more than one block, so membership accumulates.
  • Conflicts: conflicting_children (child hashes).
  • Per-output UTXO state (parallel arrays, indexed by output number): utxo_hashes, out_spendables, out_frozens, coinbase_spending_heights, spendable_ins.

Thinking: the per-output arrays are the "fold" — they replace a separate outputs table, so a spend reads one row instead of joining, and there is no per-output INSERT, no second random-hash index, and no extra cascade-delete table. hash is the natural lookup and partition key. The validator hot path UPDATEs locked (and later the block arrays and the DAH stamp), which churns roughly one dead tuple per tx, so the txs leaves use a low fillfactor to keep those updates HOT-prunable and are hash-partitioned so autovacuum can keep up in parallel.

spends — one row per spent output (PARTITION BY HASH(prev_tx_hash))

prev_tx_hash, prev_output_idx, spending_data (spending txid + input index), spent_at_height, with UNIQUE (prev_tx_hash, prev_output_idx).

Thinking: the presence of a row is the "this output is spent" fact, and the UNIQUE constraint is the double-spend arbiter — a second attempt to spend the same output fails the insert, and the conflicting spender is read back from the existing row. Keeping spends as append-only INSERTs (rather than a nullable spending_data column on the output) means the spend hot path never UPDATEs and so never bloats. The table is hash-partitioned on the same key/modulus as txs, so a parent's spends live in the aligned leaf — the prune cascade and the DAH sweep operate per-partition with no cross-partition contention.

dah_watermark — single-row sweep cursor

id (always 1), last_swept_height. Records how far the background delete-at-height sweep has advanced, so it resumes after a restart instead of rescanning, and never advances past heights it has not actually swept. Rewound on reorg so re-org'd ranges are re-evaluated.

Indexes (and why each exists)

  • txs PK on hash — the lookup key and the partition key.
  • Partial px_unmined_since / px_delete_at_height — partial so they stay tiny (near-empty in steady state); they drive the unmined-transaction and prunable-transaction iterators without scanning the table.
  • Per-leaf BRIN on txs.mined_at_height and spends.spent_at_height — both columns are written in increasing height order, so BRIN is near-free on insert (a summary per heap range, no per-row entries) and lets the deferred sweep enumerate only the recent height window rather than the whole table.

Measured behaviour (single M-series macOS dev laptop, local PostgreSQL, synchronous_commit=on)

Indicative only — one developer machine, untuned beyond the notes in TUNING.md, with notable run-to-run variance.

  • Validator hot path (Get + Spend + Create + SetLocked per tx), no pruning: median ≈ 92K tx/s at 10K workers, high variance (bursts higher, dips lower).
  • Create while the pruner runs, table held bounded: ≈ 65K tx/s sustained, low variance, with reclaim keeping pace with ingest over the run.

Reproduce with stores/utxo/throughput_test.go and stores/utxo/throughput_pruned_test.go. Required PostgreSQL GUCs and the dev-box caveats (including a macOS shared-memory / shared_buffers ceiling under a heavy co-located client) are documented in stores/utxo/postgres/TUNING.md.

Notes on the DAH sweep

The sweep's candidate-enumeration query has a large row-count under-estimate; with nested-loop joins the plan can degrade badly under load and stall stamping (so the table grows and reclaim stops). Forcing hash joins in the sweep transaction (SET LOCAL enable_nestloop = off) avoids that. The sweep also runs per hash-partition in parallel and advances its watermark in bounded height steps so it does not silently skip candidates. With these in place, concurrent create + prune holds the table bounded on the dev box.

Store layout

stores/utxo/
  aerospike/    — Aerospike key-value store
  postgres/     — PostgreSQL store (this PR)
  sql/          — SQLite/PostgreSQL portable store

The "postgres" URL scheme routes to this store; the SQL store keeps "postgressql", "sqlite", "sqlitememory".

Test plan

  • Store unit tests pass (go test ./stores/utxo/postgres/), -race clean
  • Concurrent create + prune holds the table bounded on the dev box (reclaim keeps pace)
  • synchronous_commit=on enforced per connection
  • Integration tests with the validator service
  • Sequential tests — postgres-backed suite: 84 passed, 0 failed, 4 pre-existing skips (double_spend, longest_chain, large_tx_reorg, conflict resolution, frozen-tx, orphan-return, deep reorg). The 4 skips are author-disabled in the test source (incl. a known BlockAssembler.getReorgBlockHeaders bug, unrelated to this store).
  • Benchmarking on representative (non-laptop) hardware

Review fixes (consensus correctness — aerospike store as gold standard)

A multi-agent review surfaced divergences from the production aerospike store in the reorg / atomicity / ownership paths. These have been fixed (commit 13ec557) and validated by the postgres-backed sequential suite above.

Critical / high (consensus, data-loss):

  • Reorg DAH-clearunsetMinedMulti now clears delete_at_height (and locked) when a tx falls off the longest chain, and rewinds the DAH sweep watermark to the reorged height. Previously a reorged-out, fully-spent tx kept its stale prune stamp and was deleted by the pruner, destroying still-live UTXOs.
  • Ownership-checked, atomic UnspendUnspend is now a single transaction (delete + DAH-clear + lock) and matches the stored spending_data token before removing a spend (nil token rejected). A non-owning caller (a stale reorg record whose output was re-spent) no longer wipes the live spender. Mirrors aerospike/sql.
  • Per-output freezeFreezeUTXOs no longer sets the transaction-level frozen flag, so freezing one output of a multi-output tx no longer blocks spending its siblings (the tx-level flag remains the create-time whole-tx freeze gate).
  • SetMinedMulti postcondition — every input hash must appear in the result with the block id (else TxNotFound), rows.Err() is checked, and mined_at_height is bound as the block height parameter instead of an interpolated stale chain tip.
  • Pruner delete race — the cascade DELETE re-checks delete_at_height at execution time, closing the window where a concurrent reorg Unspend revives a tx between the tombstone scan and the delete.

Lower severity:

  • batchDecorateChunk checks rows.Err() (no spurious TxNotFound on a truncated result set).
  • DAH sweep + backstop full-spend test counts only spends of spendable outputs.
  • DAH watermark advance is forward-only under concurrent sweepers; background DAH cursor runs under a store-owned context cancelled by Stop/Close.
  • SetConflicting checks RowsAffected; DAH height arithmetic widened to int64 before adding.
  • Regression tests added: per-output freeze isolation, Unspend ownership, reorg DAH-clear.

Known follow-ups (not in this change):

  • ProcessExpiredPreservations keeps aerospike's unconditional DAH stamp on expiry; aerospike's safety gate is the pruner's (optional) defensive child-verification, which this postgres pruner does not yet implement.
  • Iterator context threading (Get*Iterator use context.Background()) is deferred — it requires a utxo.Store interface change across all backends.

Cross-backend consistency verification (aerospike as the reference oracle)

Correctness of this new store is verified differentially against the production Aerospike store as the gold-standard oracle, with the SQL store as a second reference. The shared, backend-agnostic compliance suite in stores/utxo/tests/ runs the same utxo.Store test against every backend, so any behavioural divergence shows up as a test that passes on Aerospike (and SQL) but fails on Postgres — pinpointing exactly where this store deviates from the reference semantics. New shared tests added here:

  • SetMinedUnsetOnMissingTx
  • UnsetMinedPreservesUnminedSinceWhenNonLCBlocksRemain
  • RemoveBlockIDsKeepsParallelArraysAligned (postgres + sql; Aerospike has the same gap, out of scope)
  • UnspendFlagAsLockedLocksParent (postgres + sql; pins existing correct behaviour)

Each fix below was confirmed by running these against a real Aerospike container, the SQLite store, and the live PostgreSQL backend — Postgres failed before the fix and matches the oracle after.

Additional review fixes (consensus / robustness)

A second multi-agent review pass (every finding independently verified against the Aerospike/SQL reference code before any change) produced the following fixes:

  • SetMinedMulti(UnsetMined=true) on a missing tx is now a no-op instead of returning a StorageError (Interface.go:295-303). Previously any reorg whose block referenced an already-pruned tx was aborted. (commit 6d927d9d)
  • unmined_since is preserved on reorg when a tx still has non-longest-chain block entries. The ELSE NULL branch was clobbering the marker to 0, disabling the DAH-sweep unmined guard and exposing the tx to premature pruning / fork-choice misclassification. (commit 6d927d9d)
  • RemoveBlockIDs keeps the parallel arrays aligned — it now trims block_ids/block_heights/subtree_idxs together (UNNEST WITH ORDINALITY), where it previously trimmed only block_ids, misaligning every subsequent positional read. (commit 6d927d9d)
  • Out-of-bounds spend index no longer panicsgetInternal bounds-checks prev_output_idx (an unbounded BIGINT) before subscripting SpendingDatas; a corrupt/orphaned spends row now yields a graceful error instead of an index-out-of-range panic reachable from any caller-supplied tx hash. (commit db7bd795)
  • Atomic freeze/reassign fallback — the UnFreezeUTXOs/ReAssignUTXO two-step error fallbacks now run in a single transaction, removing the window where out_frozens[idx]=false but txs.frozen still held the old value (which a concurrent spend would read on the tx_frozen gate). (commit db7bd795)
  • preserve_until partial index added so ProcessExpiredPreservations no longer full-scans txs each prune tick. (commit db7bd795)
  • Bounded pruner tombstone deletiondeleteTombstonedPartition now selects + cascade-deletes in bounded LIMIT batches instead of loading every tombstoned hash into memory (OOM risk at scale); this also closes the prior Go-side SELECT/DELETE race window. (commit a204f97c)

Verified non-issues (checked against the Aerospike/SQL reference, intentionally not changed):

  • SetConflicting "stale DAH" — Aerospike's setDeleteAtHeight Lua sets a conflicting tx's delete_at_height only when not already set, identical to Postgres's COALESCE; refreshing it would diverge from the reference.
  • block_heights INT[] "truncation" — the uint32 → int32 → uint32 round-trip is lossless; no observable difference.

Known follow-ups (deferred, not in this change):

  • SetConflicting does a per-hash Get() outside the write transaction (N+1 + a narrow TOCTOU; the 0-row case is already detected and returns TxNotFound).
  • spendBatched does not roll back a spend that the batcher commits after the request context is cancelled — needs a fix plus a -race concurrency regression test before merging.

🤖 Generated with Claude Code

freemans13 and others added 18 commits April 10, 2026 14:33
All-direct architecture: no queue tables, no stored procedures, no pg_cron,
no pg_notify. Every operation writes directly to snapshot tables.

Key changes from v3:
- Spends are direct INSERTs into new append-only spends table
- SetMined is direct INSERT block_ids + UPDATE tx_state
- Creates are direct INSERTs via unnest arrays
- New tx_state table holds all mutable flags (HOT updates)
- outputs/transactions tables are fully immutable after create
- Materializer approach retained as documented fallback for creates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 tasks covering: schema, store struct, create, get, spend, setlocked,
setmined, delete/unspend, iterators, alert system, preservation, pruner,
metrics, and throughput benchmarking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite the queue UTXO store for v4 architecture:

- 7 tables (transactions, tx_state, inputs, outputs, spends, block_ids,
  conflicting_children) each with 64 hash partitions
- Separate immutable data (transactions) from mutable state (tx_state)
- New spends table replaces spending_data column on outputs
- fillfactor=100 on immutable partitions, 50 on tx_state for HOT updates
- Partial indexes on tx_state for unmined_since and delete_at_height
- Simplified Store struct: pgxpool + sql.DB, no batcher/materializer
- Start() is a no-op (no background goroutines)
- All interface methods have stubs returning "not implemented"
- TestSchemaCreation verifies all 7 tables with 64 partitions each

Removes: queue tables, stored procedures, pg_cron, pg_notify,
batcher.go, listener.go, buffer.go

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create: direct INSERT via unnest arrays into transactions, tx_state,
inputs, outputs, and block_ids tables within a single pgx transaction.
Uses CopyFrom for mega-transactions (>100 inputs/outputs).

Get: JOIN-based queries against transactions + tx_state for metadata,
separate queries for inputs, outputs, block_ids, and conflicting_children.
GetSpend JOINs outputs + tx_state + spends table for spend status.
BatchDecorate, PreviousOutputsDecorate, and BatchPreviousOutputsDecorate
use bulk IN-clause queries chunked to 400 hashes per query.

Also fixes block_ids DDL to include block_height and subtree_idx columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…irect-write store

Spend uses a validation CTE + INSERT INTO spends with ON CONFLICT DO NOTHING,
falling back to a diagnostic query when the INSERT returns 0 rows to determine
the exact failure reason (double-spend, frozen, locked, conflicting, etc.).
Idempotent re-spends (same spending data) are treated as success.

SetLocked/SetConflicting directly UPDATE tx_state with chunked IN clauses.
SetConflicting also inserts conflicting_children and returns affected parent
spends + spending child tx hashes. MarkTransactionsOnLongestChain updates
unmined_since with resilient error handling for large reorgs.

Includes basic Unspend (DELETE FROM spends) for rollback support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…direct-write store

Task 7: SetMinedMulti with bulk INSERT into block_ids via unnest (chunked
by 500), bulk UPDATE tx_state (locked=false, optionally clear unmined_since),
UnsetMined reorg path with per-hash DELETE and unmined_since restoration.

Task 8: Delete removes all associated data across 7 tables in a single pgx
transaction. setDAH helper manages delete_at_height based on all-outputs-spent,
has-block_ids, and on-longest-chain conditions (mirrors aerospike Lua logic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uner, and metrics for v4 store

Tasks 9-12: Rewrite iterator stubs with JOIN tx_state queries and pgx.Rows-based
batched iteration. Implement FreezeUTXOs/UnFreezeUTXOs/ReAssignUTXO for the alert
system. Add PreserveTransactions/ProcessExpiredPreservations and a pruner service
that cascade-deletes tombstoned transactions. Add Prometheus metrics for Create,
Spend, and SetMinedMulti operations. All 49 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce per-operation round-trips by adding go-batcher-based batching:

- Create: single CTE statement (26 params with UNNEST arrays) replaces
  4 separate INSERTs; pgx.SendBatch pipelines N creates in 1 flush
- Spend: bulk SELECT (VALUES list) + bulk INSERT replaces per-input
  validation CTE; in-memory validation between phases
- Start() initializes both batchers; Stop() drains and closes them
- Without Start(), both operations work in direct (unbatched) mode
  so all 49 existing unit tests pass unchanged

Throughput at 500 workers: ~4,900 TPS (matches SQL store's ~4,965 TPS)
Throughput at 1000 workers: ~6,100 TPS (Queue store only)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key techniques: COPY protocol for bulk creates, UNLOGGED tables,
3-table schema (txs + outputs + spends), in-process LRU cache,
100-conn pool, synchronous_commit=off.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11 tasks: schema (3 UNLOGGED tables), store (pool 100, cache), create
(COPY staging), get (single-table + cache), spend (txs join), mined
(array append), conflicting (UPDATE txs), delete/iterators/alert/
preservation/pruner, tests, benchmark, integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…in-process cache

Collapse 7 tables (transactions, tx_state, inputs, outputs, spends, block_ids,
conflicting_children) to 3 UNLOGGED tables (txs, outputs, spends) with 16
partitions each. Store full serialized tx as raw_tx blob (replacing inputs
table), block_ids/conflicting_children as arrays on txs row.

Key changes:
- schema.go: 3 UNLOGGED tables, 16 partitions, fillfactor tuning
- store.go: pool size 100, synchronous_commit=off, bounded txCache
- create.go: COPY to staging tables + INSERT...SELECT for batched creates
- get.go: single SELECT from txs, raw_tx deserialization, no JOINs
- spend.go: JOIN txs instead of tx_state
- mined.go: array append UPDATE instead of separate block_ids table
- conflicting.go: UPDATE txs, BYTEA[] array for children
- delete.go: 3-table cascade delete
- iterators.go: query txs directly, deserialize raw_tx
- alert_system.go: UPDATE txs.frozen
- preservation.go: UPDATE txs columns
- pruner_provider.go: 3-table cascade delete

All 49 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:
1. LIKE txs EXCLUDING ALL — don't inherit PK/constraints on staging tables
2. CopyFrom inside transaction — prevents ON COMMIT DELETE ROWS from
   clearing staging data before INSERT...SELECT can read it

v5 benchmark: 9,087 TPS at 1000 workers (2x SQL store at 500)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace N individual fetchBlockIDs calls with single
SELECT hash, block_ids FROM txs WHERE hash = ANY($1).

SetMined at 10K txs: 7,992 → 95,849 TPS (12x improvement, 4x faster than SQL store)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reverted chunked fetch — single bulk query is faster.
SetMined slowdown at 210K txs is from the UPDATE touching 210K rows
across 16 partitions, not from the fetch.

v5 at 10K workers: 16,515 TPS validator, 39,207 TPS SetMined

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminates per-chunk round-trip latency by batching all UPDATE chunks +
the fetch query into a single pgx.SendBatch flush.

SetMined at 10K txs: 96,135 TPS (3.2x SQL store)
SetMined at 210K txs: 41,788 TPS (1.8x SQL store)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ce loss

UNLOGGED vs LOGGED benchmark shows <10% difference on local SSD with
synchronous_commit=off. LOGGED tables provide crash recovery (WAL
preserved) vs UNLOGGED which truncates all data on crash.

For financial transaction data, durability is non-negotiable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache eliminates Get DB round-trip (100% hit rate for same-process data).
1ms batcher interval reduces Spend+Create wait from 20ms to 2ms.

v5 at 1K workers: 14,538 TPS (1.5x SQL store, up from 9,579)
v5 at 10K workers: 19,735 TPS (1.6x SQL store, up from 15,653)
v5 at 1 worker: 162 TPS (3.3x SQL store, up from 31)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ect writes

Rewrite of stores/utxo/queue from 7-table materialized design to 3-table
direct-write architecture. Key changes:

- 3 tables (txs, outputs, spends) down from 7 — inputs stored as raw_tx blob,
  block_ids as arrays, tx_state merged into txs
- COPY protocol for batched creates (staging tables + INSERT...SELECT)
- Direct INSERT for spends with bulk validation CTE
- Pipelined SetMined via SendBatch with array append
- Connection pool 100 with synchronous_commit=off
- 16 hash partitions per table with fillfactor tuning
- Drain mode batcher (go-batcher v1.2.10)

Benchmark results (local Mac, single postgres):
- Validator hot path: 9,579 TPS at 1K workers (1.4x SQL store)
- SetMined: 96,135 TPS at 10K txs (3.2x SQL store)
- All 49 unit tests + 9 sequential integration tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

A substantial, well-engineered addition — a PostgreSQL-native UTXO store (~12k LOC) with a thoughtful packed-column layout, deferred off-hot-path DAH pruning, and extensive inline rationale. The double-spend core is sound (UNIQUE(prev_tx_hash, prev_output_idx) + ON CONFLICT DO NOTHING serializes concurrent spenders), the packed bitmap/substr encoding round-trips correctly, and queries are parameterized (no injection). The CheckBlockIsInCurrentChain refactor (flag fast-path + flag-free parent_id CTE fallback) is carefully reasoned and avoids false-negative invalidations.

Findings (current review):

  • 🔴 [Critical] get.go:397GetSpend dereferences spend.UTXOHash[:] unconditionally, but it is legitimately nil on the (txid, vout) lookup path used by the public /api/v1/utxos endpoint. Panics where the SQL store guards with != nil. Clear, reachable.
  • 🟠 [Major — verify] dah_sweep.go:248 — Under READ COMMITTED the DAH stamping UPDATE can re-stamp a row whose spend a concurrent reorg Unspend just cleared (the IS DISTINCT FROM guard does not re-assert fully-spent-ness). Likely self-healing via watermark-rewind + retention buffer — please confirm both invariants hold, else a still-spendable UTXO could be pruned.
  • 🟡 [Minor] mined.go:55 — blind array append yields duplicate block_ids on crash-recovery re-mine; readers surface them. Aerospike store de-dups.
  • 🟡 [Minor] spend.go diagnostic-failure check order diverges from the SQL store (sql.go:2132-2172), so the two backends can return different error types for the same multi-condition outpoint; since needsSpendRollback keys off the type, rollback/conflict-marking behaviour can differ. Worth aligning.

The Critical should be fixed before merge; the DAH race warrants an explicit confirmation of the rewind/retention ordering.

Reviewed statically against the SQL and Aerospike reference stores. Note: the store has not yet been exercised through the validator service or the sequential/integration suites (per the PR description) — that coverage would catch issues a static read cannot.

@github-actions

github-actions Bot commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-684 (570e5d9)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 138
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.758µ 1.799µ ~ 0.700
Block_ValidOrderAndBlessed_DiskVsMemory/leaves=1024/memory-4 12.10m 12.09m ~ 1.000
Block_ValidOrderAndBlessed_DiskVsMemory/leaves=1024/disk_1-4 12.67m 12.74m ~ 0.700
Block_ValidOrderAndBlessed_DiskVsMemory/leaves=1024/disk_2-4 12.92m 12.90m ~ 0.700
Block_ValidOrderAndBlessed_DiskVsMemory/leaves=16384/memo... 27.42m 27.84m ~ 0.200
Block_ValidOrderAndBlessed_DiskVsMemory/leaves=16384/disk... 35.26m 34.93m ~ 0.100
Block_ValidOrderAndBlessed_DiskVsMemory/leaves=16384/disk... 35.95m 36.38m ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 64.07n 64.13n ~ 0.500
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 63.97n 64.03n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 64.10n 64.15n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 29.54n 31.59n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 50.18n 53.45n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 106.1n 105.9n ~ 0.800
MiningCandidate_Stringify_Short-4 251.1n 248.7n ~ 0.500
MiningCandidate_Stringify_Long-4 1.655µ 1.643µ ~ 0.200
MiningSolution_Stringify-4 857.0n 845.2n ~ 0.400
BlockInfo_MarshalJSON-4 1.623µ 1.622µ ~ 1.000
NewFromBytes-4 102.50n 98.18n ~ 0.100
AddTxBatchColumnar_Validation-4 1.828µ 1.966µ ~ 0.100
OffsetValidationLoop-4 555.9n 422.6n ~ 0.100
Mine_EasyDifficulty-4 67.27µ 68.00µ ~ 0.700
Mine_WithAddress-4 7.168µ 7.213µ ~ 0.700
DiskTxMap_SetIfNotExists-4 3.423µ 3.449µ ~ 0.400
DiskTxMap_SetIfNotExists_Parallel-4 3.417µ 3.365µ ~ 1.000
DiskTxMap_ExistenceOnly-4 300.8n 304.7n ~ 0.700
Queue-4 188.3n 188.2n ~ 1.000
AtomicPointer-4 4.569n 4.768n ~ 0.100
TxMapSetIfNotExists-4 52.31n 52.70n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 40.45n 39.98n ~ 0.400
ChannelSendReceive-4 590.2n 587.8n ~ 0.700
BlockAssembler_AddTx-4 0.03191n 0.02810n ~ 0.200
AddNode-4 11.32 11.20 ~ 1.000
AddNodeWithMap-4 11.51 11.64 ~ 0.700
DirectSubtreeAdd/4_per_subtree-4 75.00n 75.64n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 41.67n 42.18n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 39.61n 39.67n ~ 1.000
DirectSubtreeAdd/1024_per_subtree-4 38.40n 38.40n ~ 1.000
DirectSubtreeAdd/2048_per_subtree-4 37.84n 38.00n ~ 0.200
SubtreeProcessorAdd/4_per_subtree-4 248.4n 258.8n ~ 0.400
SubtreeProcessorAdd/64_per_subtree-4 240.6n 253.5n ~ 0.100
SubtreeProcessorAdd/256_per_subtree-4 244.5n 251.5n ~ 0.100
SubtreeProcessorAdd/1024_per_subtree-4 236.0n 243.1n ~ 0.400
SubtreeProcessorAdd/2048_per_subtree-4 235.9n 241.9n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 241.5n 236.6n ~ 0.700
SubtreeProcessorRotate/64_per_subtree-4 245.9n 241.9n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 244.3n 244.7n ~ 0.800
SubtreeProcessorRotate/1024_per_subtree-4 240.4n 241.7n ~ 0.700
SubtreeNodeAddOnly/4_per_subtree-4 87.41n 87.87n ~ 0.100
SubtreeNodeAddOnly/64_per_subtree-4 64.47n 64.39n ~ 0.100
SubtreeNodeAddOnly/256_per_subtree-4 63.53n 63.45n ~ 0.700
SubtreeNodeAddOnly/1024_per_subtree-4 63.08n 63.13n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 144.1n 144.5n ~ 0.700
SubtreeCreationOnly/64_per_subtree-4 538.6n 536.7n ~ 1.000
SubtreeCreationOnly/256_per_subtree-4 1.879µ 1.867µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 6.160µ 6.122µ ~ 1.000
SubtreeCreationOnly/2048_per_subtree-4 11.17µ 11.18µ ~ 1.000
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 244.4n 237.2n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 248.9n 234.8n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 13.19m 13.50m ~ 0.400
ParallelGetAndSetIfNotExists/10k_nodes-4 17.57m 17.84m ~ 0.200
ParallelGetAndSetIfNotExists/50k_nodes-4 19.69m 19.84m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 22.95m 23.44m ~ 0.200
SequentialGetAndSetIfNotExists/1k_nodes-4 11.85m 13.13m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 15.12m 16.62m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 23.47m 23.93m ~ 0.700
SequentialGetAndSetIfNotExists/100k_nodes-4 28.93m 27.98m ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 12.88m 15.03m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 21.32m 18.66m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 23.08m 23.00m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 12.54m 14.86m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 18.77m 17.59m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 52.50m 52.10m ~ 0.400
CalcBlockWork-4 545.8n 524.2n ~ 0.100
CalculateWork-4 712.2n 708.0n ~ 0.700
CheckOldBlockIDs/on-chain-prefetch/1000-4 66.32µ 68.16µ ~ 1.000
CheckOldBlockIDs/off-chain-prefetch/1000-4 51.31µ 51.05µ ~ 0.400
CheckOldBlockIDs/on-chain-prefetch/10000-4 463.9µ 454.5µ ~ 0.100
CheckOldBlockIDs/off-chain-prefetch/10000-4 355.7µ 347.1µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_10-4 1.365µ 1.340µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 13.00µ 12.82µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 128.7µ 126.8µ ~ 0.100
CatchupWithHeaderCache-4 104.5m 104.5m ~ 0.200
_BufferPoolAllocation/16KB-4 3.940µ 3.796µ ~ 0.200
_BufferPoolAllocation/32KB-4 9.730µ 9.622µ ~ 1.000
_BufferPoolAllocation/64KB-4 20.78µ 19.22µ ~ 1.000
_BufferPoolAllocation/128KB-4 29.63µ 32.89µ ~ 0.100
_BufferPoolAllocation/512KB-4 114.9µ 112.8µ ~ 1.000
_BufferPoolConcurrent/32KB-4 18.61µ 18.34µ ~ 1.000
_BufferPoolConcurrent/64KB-4 30.16µ 30.10µ ~ 1.000
_BufferPoolConcurrent/512KB-4 144.1µ 146.0µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/16KB-4 646.3µ 652.0µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/32KB-4 627.3µ 621.9µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/64KB-4 624.2µ 618.0µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/128KB-4 617.6µ 612.9µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/512KB-4 628.3µ 616.3µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 37.04m 35.98m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.85m 35.55m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.73m 35.38m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.14m 35.31m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.20m 35.16m ~ 0.100
_PooledVsNonPooled/Pooled-4 670.3n 829.6n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.580µ 7.701µ ~ 0.400
_MemoryFootprint/Current_512KB_32concurrent-4 7.097µ 6.996µ ~ 0.700
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.675µ 9.467µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.404µ 9.032µ ~ 0.100
_prepareTxsPerLevel-4 405.1m 410.6m ~ 0.700
_prepareTxsPerLevelOrdered-4 3.618m 3.622m ~ 0.700
_prepareTxsPerLevel_Comparison/Original-4 402.2m 414.5m ~ 0.100
_prepareTxsPerLevel_Comparison/Optimized-4 4.004m 3.684m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.445m 1.470m ~ 0.400
SubtreeSizes/10k_tx_16_per_subtree-4 332.3µ 337.6µ ~ 0.100
SubtreeSizes/10k_tx_64_per_subtree-4 82.54µ 81.18µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 20.66µ 20.27µ ~ 0.100
SubtreeSizes/10k_tx_512_per_subtree-4 10.25µ 10.07µ ~ 0.100
SubtreeSizes/10k_tx_1024_per_subtree-4 5.084µ 4.948µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.530µ 2.499µ ~ 0.400
BlockSizeScaling/10k_tx_64_per_subtree-4 79.87µ 78.35µ ~ 0.100
BlockSizeScaling/10k_tx_256_per_subtree-4 20.37µ 19.97µ ~ 0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4 5.032µ 4.915µ ~ 0.100
BlockSizeScaling/50k_tx_64_per_subtree-4 403.6µ 397.6µ ~ 1.000
BlockSizeScaling/50k_tx_256_per_subtree-4 100.48µ 98.78µ ~ 0.400
BlockSizeScaling/50k_tx_1024_per_subtree-4 25.13µ 24.56µ ~ 0.100
SubtreeAllocations/small_subtrees_exists_check-4 165.2µ 163.0µ ~ 0.200
SubtreeAllocations/small_subtrees_data_fetch-4 175.1µ 171.9µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 330.2µ 327.2µ ~ 0.100
SubtreeAllocations/medium_subtrees_exists_check-4 10.148µ 9.836µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 10.88µ 10.77µ ~ 0.200
SubtreeAllocations/medium_subtrees_full_validation-4 20.55µ 20.13µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.494µ 2.406µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.695µ 2.671µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 5.225µ 5.088µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 335.0µ 341.5µ ~ 0.100
StoreBlock_Sequential/AboveCSVHeight-4 337.1µ 341.0µ ~ 0.700
GetUtxoHashes-4 267.6n 263.1n ~ 0.300
GetUtxoHashes_ManyOutputs-4 43.71µ 48.64µ ~ 0.100
_NewMetaDataFromBytes-4 228.7n 229.5n ~ 0.400
_Bytes-4 401.7n 399.1n ~ 1.000
_MetaBytes-4 137.7n 139.7n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-06-12 14:58 UTC

freemans13 and others added 2 commits April 11, 2026 01:30
…tions

One row per tx (like Aerospike). spend_utxo() stored function validates
and applies spends atomically server-side. All output data stored as
parallel arrays. Zero JOINs, zero multi-table transactions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

**Goal:** Replace the v3 queue+materializer UTXO store with an all-direct, append-only architecture that eliminates queue tables, stored procedures, pg_cron, and pg_notify.

**Architecture:** Every operation writes directly to snapshot tables. Spends are append-only INSERTs into a new `spends` table (no UPDATE on outputs). All mutable state lives in a narrow `tx_state` table with HOT updates. Creates use unnest arrays for multi-row inserts in a single transaction.

@github-actions github-actions Bot Apr 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved] The implementation correctly matches v5 spec.

The v4 plan document is included for historical reference. This PR implements v5 (consolidated design) as documented in the v5 design spec.

v5 spec (lines 107-130) describes the consolidated txs table with embedded state columns (locked, conflicting, frozen, etc.), which matches the implementation in schema.go:84-103.

The plan document serves as a historical record of the v4 design iteration, which was superseded by v5 consolidated approach for better performance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v4 plan document is included for historical reference. This PR actually implements v5 (consolidated design), documented in docs/superpowers/specs/2026-04-10-turbo-utxo-store-v5-design.md. The implementation correctly matches the v5 spec.


**Architecture:** Every operation writes directly to snapshot tables. Spends are append-only INSERTs into a new `spends` table (no UPDATE on outputs). All mutable state lives in a narrow `tx_state` table with HOT updates. Creates use unnest arrays for multi-row inserts in a single transaction.

**Tech Stack:** Go, pgx/v5 (pgxpool), PostgreSQL 17 (64-way hash partitioning, fillfactor tuning, HOT updates)

@github-actions github-actions Bot Apr 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved] 16 partitions matches v5 spec.

v5 spec (line 167) explicitly states: Partition count 16 instead of 64. Less routing overhead for current data volumes.

Code (schema.go:24): const numPartitions = 16

The implementation correctly matches the v5 design decision to use 16 partitions (not 64) for reduced routing overhead. The v4 plan reference to 64-way partitioning was superseded by the v5 optimization.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 16 partitions (not 64) is correct for v5. The v4 plan with 64 partitions was found to have partition routing overhead. V5 intentionally reduces to 3 tables × 16 partitions for better performance.

Comment thread stores/utxo/queue/store.go Outdated

// Set synchronous_commit = off on each connection for write performance.
pgxConfig.AfterConnect = func(ctx context.Context, conn *pgx.Conn) error {
_, err := conn.Exec(ctx, "SET synchronous_commit = off")

@github-actions github-actions Bot Apr 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved] Issue fixed in latest commit.

The code now uses synchronous_commit=on (the PostgreSQL default) for full durability, as shown in store.go:104-105. The implementation correctly balances performance (UNLOGGED tables, batching) with acceptable durability for financial data.

freemans13 and others added 3 commits April 11, 2026 02:20
From postgres-1m-tps-playbook.md:
- LZ4 compression on raw_tx column (faster than default pglz at volume)
- Aggressive autovacuum on txs table (scale_factor=0.01, cost_delay=2ms)

These help at sustained production load, not in short benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Outputs stored as parallel arrays on txs row. Reduces Create from
2-table write to 1-table write. Spend validation JOINs 2 tables
(txs + spends) instead of 3 (txs + outputs + spends).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
frozen BOOLEAN DEFAULT FALSE,
spendable_in INT,
PRIMARY KEY (tx_hash, idx)
) PARTITION BY HASH(tx_hash);

@github-actions github-actions Bot Apr 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved] Schema now correctly implements 3-table v5 design.

v5 spec (lines 107-153) describes 3 UNLOGGED tables: txs, outputs, spends.
Code (schema.go:84-131) implements all 3 tables exactly as specified.

The implementation and documentation are now fully aligned.

…hmark

- Replace 3-phase transaction spend batcher (bulk SELECT → Go validation →
  bulk INSERT) with pipelined validation CTEs via SendBatch — each CTE does
  validate+insert in one query, all sent in one network flush
- Enable parallel spend batcher (background=true) — safe with pipelined CTEs
  since no transaction/deadlock risk
- Reduce pool to 100 connections (fewer = less PG overhead)
- Increase batch sizes to 500 items with 5ms intervals
- Add aerospike throughput benchmark for 3-way comparison
- Results: Queue 18K TPS vs SQL 6K TPS vs Aerospike 86K TPS at 15K workers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
freemans13 and others added 5 commits June 9, 2026 14:57
…ith gold standard

Three divergences found while reviewing the new PostgreSQL UTXO store, each
backed by a cross-backend regression test that passes on aerospike/sql and
previously failed on postgres:

- UnsetMined on a missing tx now no-ops instead of returning a StorageError,
  per Interface.go:295-303 (was aborting any reorg that touches an
  already-pruned tx).
- unsetMinedMulti preserves unmined_since when non-longest-chain block entries
  remain (the unmined_since CASE used ELSE NULL, clobbering the marker and
  exposing the tx to premature DAH pruning / fork-choice misclassification).
- RemoveBlockIDs trims block_ids/block_heights/subtree_idxs together via
  UNNEST WITH ORDINALITY so the parallel arrays stay index-aligned (was
  trimming block_ids only, misaligning every subsequent positional read).

Adds shared utxo.Store tests in stores/utxo/tests wired across backends:
- SetMinedUnsetOnMissingTx
- UnsetMinedPreservesUnminedSinceWhenNonLCBlocksRemain
- RemoveBlockIDsKeepsParallelArraysAligned (postgres + sql; sql is the correct reference)
- UnspendFlagAsLockedLocksParent (postgres + sql; pins existing correct behavior)

The last two are intentionally NOT wired to aerospike, which has matching
gaps (RemoveBlockIDs leaves parallel bins stale; Unspend silently discards
flagAsLocked) that are out of scope for this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…llback, preserve_until index

Three robustness/perf fixes from the PR bsv-blockchain#684 review:

- get.go: bounds-check prev_output_idx before using it to subscript
  data.SpendingDatas in getInternal. The column is an unbounded BIGINT; a
  corrupt/orphaned spends row could carry an index past the tx output count,
  turning a Get into an index-out-of-range panic reachable from any
  caller-supplied tx hash. Now returns a processing error instead.
  Regression test (get_bounds_test.go) injects a malformed spends row and
  asserts Get does not panic.

- alert_system.go: wrap the UnFreezeUTXOs and ReAssignUTXO two-step error
  fallbacks in a single transaction. Previously they issued two separate
  pool.Exec calls, exposing a torn state where out_frozens[idx]=false but
  txs.frozen still held the old value — which a concurrent Spend would read
  on the tx_frozen gate and wrongly reject.

- schema.go: add partial index px_preserve_until on txs(preserve_until) WHERE
  preserve_until IS NOT NULL, so ProcessExpiredPreservations no longer full-scans
  the txs table on every prune tick.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deleteTombstonedPartition previously SELECTed every tombstoned hash in a
partition into a [][]byte before deleting any — at millions of tombstoned rows
per partition (x8 partitions) this could OOM the node. Replace the load-all
scan with a bounded loop: a single statement selects up to pruneDeleteBatchSize
doomed hashes (LIMIT) and cascade-deletes them from the aligned spends/txs
leaves, looping until a batch comes back short. Peak memory is now constant
regardless of tombstone backlog.

The doomed CTE applies the delete_at_height predicate in the same statement and
snapshot as the DELETE, so a concurrent reorg Unspend that clears a stamp is
honoured (the revived tx is not selected and keeps its spends); each iteration
re-evaluates the predicate, so revived rows are excluded from later batches.
This also removes the prior Go-side SELECT/DELETE race window entirely.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-step bounded DAH sweep, interleaved prune, SetMinedMulti RETURNING

Sustained-with-prune profiling (pg_stat_statements + wait events) showed the
bench is postgres-server-CPU-bound (~12 of 16 cores, 0% idle), not disk-bound:
stretching checkpoints 90s->30min cut FPIs ~20x with zero TPS change. The
reclaim pipeline (sweep enumerate + stamp + doomed scan + cascade delete) was
~25-30% of all server CPU and its serial sweep-then-delete cycles caused a
metastable slow regime (bimodal ~88K vs ~65K medians on identical code).

- schema.go: px_unmined_since + px_delete_at_height partial btrees -> BRIN.
  Both columns are modified by per-tx UPDATEs (SetMinedMulti nulls
  unmined_since; the DAH sweep stamps delete_at_height); btrees there
  disqualified HOT (measured 33.7% HOT ratio -> 83.2% with BRIN).
- dah_sweep.go: split sweepDAHRangePartition into enumerate (BRIN range scan,
  bounded) + stamp via bytea[] param under force_custom_plan, replacing the
  enable_nestloop=off hack that forced full-spends-partition hash joins every
  call (~30% of server CPU). Cap per-call candidates (derived from partition
  count) so one call can never become a multi-second monolith. Add
  sweepDAHStep for bounded slices.
- pruner_provider.go: Prune now interleaves ONE bounded sweep slice with a
  bounded delete slice per call instead of sweeping to tip then draining
  fully — stamping and deleting pipeline instead of starving each other.
- mined.go: fold the post-UPDATE verification SELECT into the UPDATE via
  RETURNING hash, block_ids (drops a full second =ANY probe pass).
- throughput_pruned_test.go: prunedMiners 3->12 (3 serial miners WERE the
  measured plateau; production runs MaxMinedRoutines=128), mine-channel depth
  telemetry, and a live-table-size creation gate (1.5M rows) so reclaim
  falling behind reads as back-pressure, not unbounded table growth.
- throughput_designd_spike_test.go: Phase-4A throwaway A/B instrument for
  generational DROP-partition reclaim vs cascade DELETE.

Measured (M3 Max 16c/64GB, PG18.2, 10K workers, table bounded):
sustained-with-prune 63.1K -> ~75K honest / 88.4K best (CV 4.2%); no-prune
ceiling 92K -> 111.8K median. Full postgres suite green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… not disk-bound; 100K verdict superseded

The 'shared NVMe write budget' model and the '100K not reachable here' verdict
were measured under GUC drift (max_wal_size=1GB with min_wal_size=2GB) and
three since-fixed CPU sinks. New measured truth: no-prune ceiling ~112K on the
same host, balanced ~75K honest / 88K best, gap = the per-row reclaim
pipeline's CPU (~25-30%), route to >=100K = generational DROP-partition
reclaim. checkpoint_timeout recommendation 90s -> 30min (FPI volume, measured
~20x), with a min<max WAL-size warning.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread stores/utxo/postgres/create.go Outdated
if err != nil {
return outputArrayParams{}, err
}
p.idx = append(p.idx, int64(i))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Per-output arrays are compacted but read by absolute output index — misalignment when any output is nil

buildOutputArrays skips nil outputs with continue, so utxo_hashes/out_spendables/out_frozens/coinbase_spending_heights/spendable_ins are written positionally compacted. The true output index is captured in p.idx (line 84) but is never stored — the txs table has no output-index column, and the bulk path only uses idx for ORDER BY, not to gap-fill positions.

Every reader, however, subscripts these arrays by the absolute output index: spend.go uses t.utxo_hashes[$2::int + 1] (= vout+1) and dah_sweep.go uses out_spendables[s.prev_output_idx::int + 1]. The model requires array position i+1 == output index i. The moment any output is nil, every later output reads its UTXO hash / spendable / frozen / coinbase-maturity / spendable_in from the wrong slot → spend validates against the wrong output, and the DAH “fully spent” count is wrong (premature or never-pruning).

Both reference stores preserve the mapping: the sql store compacts identically but stores the real index in the outputs.idx column (sql.go:888) and reads by idx = vout; aerospike writes utxos[i] at the true index slot (aerospike/create.go:824-833), leaving nil slots nil. This store dropped that mapping. Fix: gap-fill the arrays to full len(tx.Outputs) (NULL/placeholder for nil/non-stored slots) so position == output index.

(Reachability depends on whether Create ever receives a tx with nil outputs — the nil-skip guard exists in all three stores, so the codebase clearly anticipates it.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Resolved by commit a5c7d3b4 (packed per-output columns). buildOutputArrays now writes each output at its true index — copy(p.utxoHashes[i*32:], …) and setPackedBit(p.spendableBits, i) (create.go:125,131) — leaving nil outputs as a zeroed 32-byte slot with the spendable bit clear, so array position == output index. All readers (spend.go, get.go, dah_sweep.go, delete.go) subscript by absolute index via substr(utxo_hashes, idx*32+1, 32) / get_bit(out_spendables, idx), consistent with the write. The misalignment-on-nil-output hazard is closed.

Comment thread stores/utxo/postgres/mined.go Outdated
func (s *Store) unsetMinedMulti(ctx context.Context, hashes []*chainhash.Hash, blockID uint32, blockHeight uint32) (map[chainhash.Hash][]uint32, error) {
resultMap := make(map[chainhash.Hash][]uint32, len(hashes))

currentBlockHeight := int64(s.blockHeight.Load())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] unmined_since off-by-one on reorg (UnsetMined) — missing +1

currentBlockHeight := int64(s.blockHeight.Load()) is bound as $3 and written into unmined_since (line 229-231) when a tx is fully reorged out. Both gold-standard stores stamp unmined_since with blockHeight.Load() + 1:

  • sql: currentBlockHeight := s.blockHeight.Load() + 1 (sql.go:3276), comment: “mirrors the aerospike Lua which sets unmined_since = currentBlockHeight”
  • aerospike: thisBlockHeight := s.blockHeight.Load() + 1 (set_mined.go:208)

The comment here (lines 227-228) claims this matches aerospike teranode.lua:637-644 and sql.go:3268-3308, but it diverges by one. A reorged-out tx is marked unmined one block too early, so the prunable-unmined iterator / QueryOldUnminedTransactions (which compare unmined_since <= cutoff) make it eligible for deletion one block sooner than the reference — premature pruning of still-live data after a reorg.

Fix: currentBlockHeight := int64(s.blockHeight.Load()) + 1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unaddressed in the current code. `mined.go:189` binds `unmined_since` to `int32(s.blockHeight.Load())` with no `+1`. The new in-code comment (lines 226-232) claims parity with `sql.go:3268-3308` and aerospike `teranode.lua:637-644`, but both references add one: `sql.go:3276` uses `s.blockHeight.Load() + 1` and aerospike `set_mined.go:208` uses `s.blockHeight.Load() + 1`. A reorged-out tx is therefore stamped one block lower than the reference oracle. Fix: `int32(s.blockHeight.Load()) + 1`.

// concurrent SetBlockHeight, and (b) preserves a single prepared-plan
// cache entry across heights (a literal would re-plan per height).
updateSQL = `UPDATE txs SET
block_ids = COALESCE(block_ids, '{}') || $2::int[],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] SetMinedMulti appends block membership non-idempotently — duplicates on re-processing

The UPDATE unconditionally does block_ids = COALESCE(block_ids, '{}') || $2::int[] (and the parallel block_heights / subtree_idxs) with no existence guard. Re-processing the same block for the same tx (crash-recovery replay, retry, or a duplicate call) appends duplicate (block_id, block_height, subtree_idx) triples. The end-of-function postcondition only checks block_id presence, so it cannot catch this.

Both references dedup:

  • sql: INSERT INTO block_ids (...) ... ON CONFLICT DO NOTHING on the unique (transaction_id, block_id) (sql.go:3204-3209)
  • aerospike: checks blockExists before appending (teranode.lua:611-629)

Consequences of duplicate entries: Get/BatchDecorate return duplicated BlockIDs/BlockHeights, and any block-membership-count or longest-chain logic over these arrays is corrupted until a removal cleans it. Note the companion unsetMinedMulti uses array_remove (removes all copies), so a single later removal would wipe every duplicate at once — but the duplicated state is still observable in between.

Fix: guard the append, e.g. WHERE hash = ANY($1) AND NOT (COALESCE(block_ids,'{}') @> $2::int[]), or unnest-and-rebuild to keep the triples unique.

LIMIT $2
),
del_spends AS (DELETE FROM %[2]s WHERE prev_tx_hash IN (SELECT hash FROM doomed) RETURNING 1)
DELETE FROM %[1]s WHERE hash IN (SELECT hash FROM doomed)`, txsLeaf, spendsLeaf)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Cascade DELETE does not re-check delete_at_height on the live tuple — the claimed reorg-race fix is not actually in force

The comment (lines 222-226) and the PR description state the cascade “re-checks delete_at_height at execution time, closing the window where a concurrent reorg Unspend revives a tx.” But the predicate lives only in the doomed CTE (line 230); the final DELETE FROM txs WHERE hash IN (SELECT hash FROM doomed) (line 234) re-checks only hash IN doomed.

Under READ COMMITTED (the pool default — no explicit isolation / FOR UPDATE / advisory lock), the doomed CTE is materialized once against the statement snapshot. If a concurrent Unspend / unsetMinedMulti sets delete_at_height = NULL and commits after that snapshot, PostgreSQL’s EvalPlanQual recheck on the DELETE re-applies only the DELETE’s own qual (hash IN doomed, still true) — not the CTE’s delete_at_height predicate — so the revived row is deleted while del_spends has already removed its spend markers. That is the exact data-loss race the comment claims to prevent.

The window is narrow and, in practice, further bounded by the prune retention (a reorg deep enough to revive a prune-eligible tx is near the safety horizon), so real-world reachability is low — but the code asserts a guarantee it does not provide.

Fix: repeat the predicate on the live tuple in the outer DELETE so EPQ re-checks it:
DELETE FROM <txs> WHERE hash IN (SELECT hash FROM doomed) AND delete_at_height IS NOT NULL AND delete_at_height <= $1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still stands after re-review of the current cascadeSQL (lines 227–234). The in-code comment ("applies the delete_at_height predicate in the SAME statement and snapshot as the DELETE, so a concurrent reorg Unspend that clears the stamp is honoured") conflates snapshot visibility with the EvalPlanQual (EPQ) recheck, and they are not the same thing:

  • Under READ COMMITTED the whole statement runs on one snapshot S0. A reorg Unspend that sets delete_at_height = NULL and commits after S0 is invisible to that snapshot, so the doomed CTE still contains the revived hash.
  • When DELETE FROM txs … WHERE hash IN (SELECT hash FROM doomed) reaches that concurrently-updated row, PostgreSQL triggers EPQ: it refetches the latest tuple and re-evaluates only the DELETE node's own qual against it. That qual is hash IN doomeddoomed is a materialized CteScan, the hash is unchanged, so it still matches and the row is deleted. delete_at_height is never re-checked on the live tuple because it lives inside the (already-materialized) CTE, not on the outer DELETE.

Net: a reorg Unspend racing the prune window can still delete a revived tx (and del_spends removes its spend records unconditionally too). To actually honour the predicate via EPQ, repeat it on the outer DELETE:

DELETE FROM txs_pNN
WHERE hash IN (SELECT hash FROM doomed)
  AND delete_at_height IS NOT NULL AND delete_at_height <= $1

and apply the equivalent guard to the del_spends cascade. Low reachability (narrow window), but consensus-critical (loss of a live UTXO row + its spend proofs), so flagging it as still-open [Major].

…ut harness

Adds TestThroughput_QueueStorePruned2Shard: the exact pruned validator harness
driven through a bench-only router that splits the hot path across TWO local
postgres instances (ports 5432/5433) by txid first bit. runPrunedValidator now
takes a small prunedBenchStore interface so both *postgres.Store and the router
fit.

Measured (M3 Max 16-core, both instances sharing the same cores):
- single instance:        ~77K honest median / 88.4K best, bimodal at the edge
- 2-shard (8 partitions): 92.4K / 92.1K median, CV 3.7-6.6%, NO bimodality —
  table bounded ~254K rows/shard, reclaim tracking mined continuously.
The +20% with zero added hardware quantifies the intra-instance contention
component (8 fixed WALInsert slots, one WAL stream/checkpointer per instance)
and validates the client-side hash-sharding direction: two real hosts project
to ~150K+.

Negative result kept for the record: numPartitions=4 per shard reaches ~107K
for the first ~80s, then the live-table gate trips (reclaim halves with the
partition streams) and it settles at ~89.5K over 40 reps — rejected; 8
partitions per shard stays.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
)

func init() {
availableDatabases["postgres"] = func(ctx context.Context, logger ulogger.Logger, tSettings *settings.Settings, storeURL *url.URL) (utxo.Store, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] postgres:// scheme silently remapped from the SQL store to this new store

Before this PR, availableDatabases["postgres"] routed to the mature sql.Store (PostgreSQL backend). This PR moves the SQL store to "postgressql" and points "postgres" at the new, by-the-PR's-own-admission not-yet-integration-tested store.

This is a silent behavioral change for any existing config that uses a postgres:// UTXO-store URL — it switches backends with no error and no migration path. One such config already lives in the repo at settings.conf:1228 (utxostore.docker.ss.teranode1 = postgres://...). That docker config — and any operator config using postgres:// — now binds to the new store rather than the SQL store it was written against, and the two schemas are not interchangeable.

Suggest one of: keep postgres:// on the SQL store and give the new store its own distinct scheme (e.g. pgutxo://); or, if the remap is intended, update settings.conf:1228 and the utxostore_settings.go longdesc (which still documents postgres:// generically) and call the break out explicitly in the PR/changelog.

freemans13 and others added 2 commits June 10, 2026 01:55
…ght/vout columns

Array packing (the headline): the 5 per-output parallel array columns
(utxo_hashes BYTEA[], out_spendables/out_frozens BOOLEAN[],
coinbase_spending_heights BIGINT[], spendable_ins INT[]) become a flat
32-byte-stride utxo_hashes BYTEA + get_bit-compatible bitmaps + out_count/
spendable_count scalars + a single coinbase_spending_height (the per-output
array was redundant — every output of a coinbase shares one maturity height).
This deletes the out_arr array_agg GROUP BY re-aggregation CTE from the bulk
create path (a top-4 server-CPU statement) and the per-output param fan-out;
per-output access in the spend validation CTEs is substr()/get_bit() byte
arithmetic. Encoding agreement between the Go packers and SQL
substr()/get_bit() is locked in by packed_outputs_test.go against live
postgres. The deferred-DAH fully-spent comparand reads the spendable_count
scalar instead of array_positions().

INT4 narrowing: spends.prev_output_idx + spent_at_height and the txs height
columns (unmined_since, delete_at_height, preserve_until, mined_at_height,
coinbase_spending_height) BIGINT -> INT, with explicit vout overflow guards
where uint32 enters. Spends UNIQUE-index entries -13%, heap rows -10%.

Measured (M3 Max, 10K workers, sustained-with-prune, table bounded, honesty
gate quiet): 2-shard 92.4K -> 95.3/95.7K median (CV 2.9-3.9%) from packing;
INT4 neutral-within-drift on this CPU-bound box (87.9/92.8/94.4K across three
runs, late reps touching 100.5K) — kept for the production cache-footprint
win (narrower rows = more of the working set resident). Two burst-shaped
knobs were A/B'd and REJECTED with rationale comments left in place: 2ms
batcher ticks (95.7K -> 88K) and minedChunkSize=4000 (95.7K -> 86K) — both
break the create/reclaim interleaving.

Schema is not backward compatible with v7-array databases (fresh-schema
deployments only; migration note in schema.go). Full postgres suite green
incl. -race; oracle invariants unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Router now takes any number of postgres stores (txid-first-byte modulo
routing); adds TestThroughput_QueueStorePruned3Shard against a third local
instance (port 5434).

Measured (M3 Max 16-core, packed schema, 10K workers, table bounded, honesty
gate quiet): 3-shard 94.6-98.5K median across 4 runs (GOGC=400 neutral),
vs 2-shard 95.3/95.7K — the third WAL stream buys only ~1-3%, pinning the
local wall as TOTAL CPU rather than intra-instance contention beyond two
instances. Individual reps regularly cross 100-106K. Per-host takeaway:
~96K balanced sustained-with-prune is this box's ceiling; the sharding
multiplier is real but saturates when shards split the same cores.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@bsv-blockchain bsv-blockchain deleted a comment from github-actions Bot Jun 10, 2026
freemans13 and others added 4 commits June 10, 2026 10:47
…no-op (match aerospike)

UnFreezeUTXOs and ReAssignUTXO did a separate frozen-check read and then ran an
UPDATE whose result was discarded. If the row was concurrently deleted/unfrozen
between the read and the UPDATE, the UPDATE matched 0 rows and the call returned
nil — silently reporting success when nothing changed (for ReAssignUTXO, the new
utxo_hash/spendable_in were never applied, so a later spend would fail on a hash
mismatch). FreezeUTXOs already guarded this via RowsAffected.

Make the guarded UPDATE the sole source of truth: require the frozen bit in the
WHERE (get_bit = 1), drop the racy pre-check, and on a 0-row result diagnose and
return a typed error (TxNotFound / UtxoFrozen) via frozenWriteRejectReason. This
removes the TOCTOU entirely and mirrors the aerospike gold standard, whose
unfreeze/reassign Lua UDFs check atomically within the record and return
TX_NOT_FOUND / UTXO_NOT_FROZEN rather than silently succeeding.

Adds shared test UnfreezeAndReassignNotFrozenErr (wired to aerospike, sql, and
postgres) asserting that unfreeze/reassign of a non-frozen output errors on every
backend — pinning cross-backend consistency.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The merge applied cleanly textually but two signatures gained parameters
independently, so callers no longer matched:

- util.SafeSetLimit gained a leading logger arg (upstream); update the postgres
  BatchPreviousOutputsDecorate caller to pass s.logger.
- validator.New (policyRejectedTx producer, upstream bsv-blockchain#799) and
  subtreevalidation.New (p2pClient, this branch) each gained a param; the merge
  auto-combined both, leaving upstream's new legacy_* integration tests one arg
  short. Pass nil for the missing client, matching the existing tests.

Caught by golangci-lint/go vet (test files; go build alone does not compile them).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A cancelled client context does not reliably abort a Postgres backend already
deep in a disk-bound scan. On an undersized node (DB >> shared_buffers, slow
single disk) a stuck per-partition sweep can grind for hours while each interval
stacks another generation on top, saturating the disk in a self-reinforcing loop
(observed on teratestnet: 16+ concurrent sweep backends, the oldest 2.3h old).
Add SET LOCAL statement_timeout = '120s' to the sweep transaction so Postgres
self-aborts a runaway server-side. Healthy bounded sweeps finish in ~hundreds of
ms; 120s only trips a true runaway, with headroom for the cold-cache window right
after a reset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LIMIT $2
),
del_spends AS (DELETE FROM %[2]s WHERE prev_tx_hash IN (SELECT hash FROM doomed) RETURNING 1)
DELETE FROM %[1]s WHERE hash IN (SELECT hash FROM doomed)`, txsLeaf, spendsLeaf)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Pruner cascade does not re-validate delete_at_height against a concurrent reorg — the in-comment safety claim does not hold under PostgreSQL READ COMMITTED

The comment (lines 222–226) asserts that "the doomed CTE applies the delete_at_height predicate in the SAME statement and snapshot as the DELETE, so a concurrent reorg Unspend that clears the stamp is honoured: a revived tx is simply not selected."

That reasoning is incorrect. doomed is referenced twice (by del_spends and by the final DELETE FROM txs), so PostgreSQL materialises it once. The two DELETEs then qualify on hash IN (SELECT hash FROM doomed) — the delete_at_height predicate is not part of either DELETE’s own qualification, and doomed takes no row lock (FOR UPDATE).

Failure scenario (the exact reorg data-loss case the design must survive):

  1. Pruner statement starts (snapshot S0); tx T is stamped (delete_at_height <= H) and lands in doomed.
  2. A reorg Unspend/unsetMinedMulti commits after S0: it deletes T’s spend rows and sets T.delete_at_height = NULLT is back on no chain and its outputs are live again.
  3. The pruner reaches T. Because T was updated by a now-committed txn, PostgreSQL runs EvalPlanQual, re-checking the DELETE’s own qual (hash IN doomed) against the new row version — hash is unchanged, so it still matches. delete_at_height is never re-tested. T is deleted, destroying a tx whose outputs are now live. del_spends is keyed off the same stale doomed set, so it also deletes any fresh spend rows a re-spend created.

Suggested fix: re-assert the predicate on the parent delete so EPQ re-checks the live row, e.g.

DELETE FROM txs_pNN t
WHERE t.hash IN (SELECT hash FROM doomed)
  AND t.delete_at_height IS NOT NULL AND t.delete_at_height <= $1

and gate del_spends on the same re-checked set (ideally SELECT ... FOR UPDATE SKIP LOCKED in doomed) so a revived tx loses neither its row nor its spends.

UPDATE txs
SET utxo_hashes = overlay(utxo_hashes placing $3::bytea from $2::int * 32 + 1 for 32),
out_frozens = set_bit(out_frozens, $2::int, 0),
spendable_ins[$2::int + 1] = $4,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] ReAssignUTXO drops the spendable-after-reassign restriction for every output with vout ≥ 1

spendable_ins is NULL on create and is written only here. In PostgreSQL, assigning to a subscript of a NULL array does not zero-fill from index 1 — arr[N] = v on a NULL array yields an array with lower bound N and array_length(arr,1) = 1 (e.g. reassigning vout=5 produces [6:6]={si}).

The spend-validation CTE reads it positionally (spend.go:63 and :516):

CASE WHEN array_length(t.spendable_ins, 1) >= $2::int + 1 THEN t.spendable_ins[$2::int + 1] END AS spendable_in

For a reassigned output at vout ≥ 1 (with spendable_ins previously NULL — the normal case), array_length is 1, so the guard 1 >= vout+1 is false and spendable_in resolves to NULL. The block_height < spendable_in immaturity check is silently skipped, so the reassigned UTXO becomes immediately spendable instead of after ReAssignedUtxoSpendableAfterBlocks.

Only vout=0 happens to work ([1:1] satisfies 1 >= 1), which is exactly the index the single test (store_test.go:1330) exercises — and it never asserts the immaturity gate, so the bug is uncaught. The sql store enforces this via a per-row outputs.spendableIn column; aerospike via its reassign UDF.

Fix: build/extend spendable_ins to a contiguous 1-based array so position i+1 always maps to output i (e.g. COALESCE(spendable_ins, array_fill(NULL::int, ARRAY[out_count])) before the subscript assignment), matching the schema comment’s stated [i+1] contract.

tag, err := s.pool.Exec(ctx, `
UPDATE txs
SET out_frozens = set_bit(out_frozens, $2::int, 0),
frozen = (set_bit(out_frozens, $2::int, 0) <> decode(repeat('00', length(out_frozens)), 'hex'))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Recomputing the tx-level frozen flag from out_frozens breaks the per-output freeze isolation that FreezeUTXOs maintains

FreezeUTXOs deliberately does not touch the tx-level frozen column (see its comment, lines 34–40) — it is the whole-tx freeze gate set only at create via WithFrozen, and the spend CTE blocks every output of the tx when it is true (NOT tx_frozen, spend.go:71/:523). Per-output freeze lives solely in the out_frozens bitmap.

But UnFreezeUTXOs (line 129) and ReAssignUTXO (line 174) recompute frozen = (out_frozens has any bit set). This is asymmetric with the freeze path and incorrect:

  1. Create tx (not whole-tx frozen → frozen=false).
  2. FreezeUTXOs output 0 and output 1 per-output (out_frozens bits 0,1 set, frozen stays false). Spending output 2 still works.
  3. UnFreezeUTXOs output 0 → bit 0 cleared, bit 1 still set → recompute sets frozen = TRUE.
  4. The spend CTE now reads tx_frozen=true and blocks all outputs of the tx — including output 2, which was never frozen.

So unfreezing one individually-frozen output silently freezes the entire transaction. Aerospike/sql operate purely per-output and never derive a whole-tx flag from individual outputs.

Fix: drop the frozen = (...) recompute from both UnFreezeUTXOs and ReAssignUTXO (line 174) — leave frozen as the create-time whole-tx gate, exactly as FreezeUTXOs does on the way in.


func init() {
availableDatabases["postgres"] = func(ctx context.Context, logger ulogger.Logger, tSettings *settings.Settings, url *url.URL) (utxo.Store, error) {
availableDatabases["postgressql"] = func(ctx context.Context, logger ulogger.Logger, tSettings *settings.Settings, url *url.URL) (utxo.Store, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] SQL store registered under "postgressql", but sql.New/InitSQLDB never learned that scheme — the SQL Postgres backend is now unreachable

Commit f5ffac449 moved the "postgres" registration to the new store and re-registered the SQL store under "postgressql". But the SQL store only recognizes the "postgres" scheme:

  • util/sql.go:34 (InitSQLDB, first call in sql.New): switch storeURL.Scheme { case "postgres": ... case "sqlite","sqlitememory": ... } — no "postgressql" case → returns unknown scheme: postgressql.
  • stores/utxo/sql/sql.go:169-181: same switch with default:unknown database engine: postgressql.
  • stores/utxo/sql/sql.go:189: engine: storeURL.Scheme would store "postgressql", so every s.engine == "postgres" fast-path (batched INSERT, pipelined SetLocked, decorate) is silently disabled even if init somehow succeeded.

Net: a postgressql://... URL fails at DB-init — the SQL-backed Postgres store is reachable through no working scheme. "postgressql" is referenced nowhere else in the repo, so this escape hatch is dead on arrival.

Fix: add "postgressql" cases to InitSQLDB and sql.New (normalizing engine to "postgres"), or rewrite the scheme to "postgres" before handing off to sql.New.

freemans13 and others added 4 commits June 12, 2026 14:42
rebuildOffChainSet's fast path runs SELECT id FROM blocks WHERE on_main_chain = false.
The only on_main_chain index (idx_on_main_chain_height) is partial WHERE on_main_chain
= true, so the false lookup falls back to a full sequential scan of the entire blocks
table. rebuildOffChainSet runs on every fork detection, block invalidation/revalidation,
and on a periodic background refresh, so on a large, disk-bound node this recurs
constantly: observed on a testnet node at ~627k blocks as a 634 MB parallel seq scan
that returns 0 rows and competes for disk bandwidth with block-validation reads.

Add idx_off_main_chain, a partial index scoped to on_main_chain = false. Because the
off-chain set is tiny (a few hundred fork/orphan blocks across all of mainnet history,
often zero) the index holds only those rows, never the ~99% on-chain rows — currently
an 8 KB empty metapage. The fast-path query flips from a 44,670-cost parallel seq scan
to a 1.24-cost index-only scan. Added to both the PostgreSQL and SQLite schema paths
and to the expected-index probe used by isBlockchainSchemaCurrent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… on_main_chain flag

A transiently-false on_main_chain flag on a block that IS on the best chain was
being turned into a PERMANENT block invalidation. The flag self-heals (a later
reconcile/rebuild fixes it), but the invalidation it causes does not: ValidateBlock
stores the candidate invalid=true and never revisits it, freezing the chain.

How the flag goes transiently wrong: StoreBlock's slow path writes on_main_chain=
false and relies on reconcileOnMainChain to fix it, but on failure it only logs and
continues; the every-boot startup rebuildOnMainChainFlag gives up after its retries/
timeout. Both leave a wrong flag with mainChainRebuilding==0, i.e. on the 'flags are
reliable' fast path. A cold-cache post-restart catchup is exactly when these fail.

checkOldBlockIDs (block validation) asks 'is this tx's parent block on the current
chain?' and escalates a negative to invalidation. Both routes it used trusted the
flag: GetBlockHeaderIDs' window query (WHERE on_main_chain=true) and
CheckBlockIsInCurrentChain (SELECT ... AND on_main_chain=true; in-memory variant
filters by the off-chain set, which is itself rebuilt from the flag). So one bad
flag → false negative → permanent invalidation. Observed on a testnet node after a
restart: block 627586 invalidated because parent 627418 (on_main_chain=true in the
DB now) was momentarily seen off-chain; 11-block cascade froze the chain.

Fix — make the NEGATIVE answer flag-independent:
- checkBlockIsInCurrentChainSQL: a flag fast-path miss now falls through to the
  authoritative parent_id recursive CTE (previously only used while rebuilding)
  before returning false. Positives still short-circuit on the indexed flag, so the
  CTE only runs on the rare about-to-reject path. Batched (cteBlockIDBatch=400) so
  the block_ids UNION ALL stays under sqlite's 500-term compound-SELECT limit.
- CheckBlockIsInCurrentChain in-memory wrapper: stop using the flag-derived
  off-chain set as an unconfirmed negative; confirm all committed-id candidates via
  the SQL route. The id>maxBlockID reject stays a pure in-memory, query-free path
  (consensus-critical for allocated-but-uncommitted ids).

Regression test reproduces the exact scenario (on-chain block with a forced
on_main_chain=false must still report on-chain) for both the SQL and in-memory
paths; it fails on the pre-fix code and passes after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
}

// Validate UTXO hash matches.
if !bytes.Equal(utxoHashBytes, spend.UTXOHash[:]) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Nil-pointer dereference in GetSpend — externally reachable panic

spend.UTXOHash is *chainhash.Hash and is legitimately nil for callers that locate a UTXO by (txid, vout). This line unconditionally dereferences it:

if !bytes.Equal(utxoHashBytes, spend.UTXOHash[:]) {

The SQL reference store guards this exact case at stores/utxo/sql/sql.go:3409:

if spend.UTXOHash != nil && !bytes.Equal(utxoHash, spend.UTXOHash[:]) {

The path is reachable from the public /api/v1/utxos HTTP endpoint: services/asset/httpimpl/GetUTXOs.go:199 constructs &utxo.Spend{... UTXOHash: nil}, and services/asset/repository/repository.go:922 calls repo.UtxoStore.GetSpend(ctx, spend) — with a comment at line 917 explicitly noting spend.UTXOHash may be nil. With the postgres store selected, this panics.

Fix: add the spend.UTXOHash != nil guard to match the SQL store.

UPDATE %[2]s t SET delete_at_height = st.new_dah
FROM state st
WHERE t.hash = st.hash
AND t.delete_at_height IS DISTINCT FROM st.new_dah`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major — please verify] DAH re-stamp race vs. concurrent Unspend (reorg safety)

This stamping UPDATE is a single statement, so spend_agg/state and the write share one READ COMMITTED snapshot — good. But the snapshot of spends is taken at statement start, while the target txs row is locked later. Consider:

  1. Sweep statement starts; spend_agg sees all spendable outputs of hash H spent → new_dah = future height.
  2. A reorg Unspend commits (separate txn): DELETE FROM spends ... then UPDATE txs SET delete_at_height = NULL (spend.go:686,708).
  3. The sweep UPDATE reaches H. Under EPQ it re-fetches the latest txs row (now delete_at_height = NULL), but st.new_dah is fixed from the stale CTE. The only guard is t.delete_at_height IS DISTINCT FROM st.new_dahNULL vs future height is distinct, so it re-stamps a DAH on a tx whose spend was just removed.

The IS DISTINCT FROM guard does not re-assert fully-spent-ness against live spends.

Mitigations that likely make this non-fatal in practice (please confirm each holds):

  • The watermark rewind on reorg should re-cover H and a corrective sweep pass would recompute spent_count < spendable_count → new_dah = NULL, clearing the bad stamp.
  • The retention buffer (new_dah = completion + 1 + retention) must exceed reorg-detection latency so the pruner cannot reach the bad DAH before the corrective sweep runs.

If both hold the window is self-healing; if either can be violated, a still-spendable UTXO can be pruned. Consider re-asserting eligibility in the UPDATE ... WHERE (re-check live spend rows) rather than relying solely on IS DISTINCT FROM.

// Queue UPDATE chunks.
// Simple array append — no idempotency check. Duplicates only occur on crash
// recovery (same block re-processed). UnsetMined removes all matching entries
// so duplicates are harmless.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] Blind array append produces duplicate block_ids that readers surface

The comment says duplicates "are harmless" because UnsetMined removes all matching entries — true for removal, but readers are not protected. On crash-recovery re-processing of the same block, block_ids / block_heights / subtree_idxs each grow by one duplicate element, and Get/BatchDecorate copy them verbatim into data.BlockIDs (get.go:135-148), so consumers see duplicate IDs (e.g. block id 7 twice). The Aerospike store de-duplicates on mine, so this is a cross-store divergence.

Consider a NOT (block_ids @> $2) guard (the conflicting-children insert at create.go:600 already uses this pattern) to keep the arrays duplicate-free.

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
33.4% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants