Skip to content

Replace BadgerDB with off-heap mmap hash table in block-validation maps#1053

Merged
icellan merged 27 commits into
bsv-blockchain:mainfrom
icellan:claude-block-validator-ram
Jun 12, 2026
Merged

Replace BadgerDB with off-heap mmap hash table in block-validation maps#1053
icellan merged 27 commits into
bsv-blockchain:mainfrom
icellan:claude-block-validator-ram

Conversation

@icellan

@icellan icellan commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Block validation's two disk-backed maps — DiskParentSpendsMap (duplicate-input detection) and DiskTxMapUint64 (duplicate-tx detection + parent ordering) — were backed by BadgerDB. Profiling a 410M-tx block on the scaling cluster showed the disk path's per-block cost was dominated by Badger machinery (memtable arenas, store open/close, write batching), not the actual workload, and Badger's durability/compaction apparatus is dead weight for what is a per-block ephemeral scratch structure.

This replaces Badger with a purpose-built off-heap structure:

  • New stores/mmaphash package — a fixed-slot, open-addressing (linear-probe) hash table over one sparse, immediately-unlinked mmap file, split into independently-locked segments (probing wraps within a segment, so a probe never crosses a lock boundary). Contents live in the OS page cache (off the Go heap, not GC-scanned) and spill to NVMe under pressure; cleanup is an unlink.
  • Both disk maps rewritten as thin multi-disk wrappers over mmaphash.Table. ~700 lines of Badger/cuckoo/writer-goroutine machinery removed.
  • ParentSpendsMap.SetIfNotExists widened to (bool, error) so a table-capacity overflow halts validation with a processing error instead of being silently misreported as a duplicate input. The txMap's Put already returns an error, so overflow there surfaces as a non-ErrTxExists error that halts (a real duplicate still returns ErrTxExists → block invalid).

Public interfaces and call sites are otherwise unchanged; the in-memory maps and the no-block_diskMapDirs default path are untouched.

Results (BenchmarkBlock_ValidOrderAndBlessed_DiskVsMemory, realistic SHA-256 keys)

Case Badger (before) mmap (after)
1024 leaves / 1 disk ~70 ms, ~180 MB/op ~11.5 ms, ~0.48 MB/op
16384 leaves / 1 disk ~125 ms, ~280 MB/op ~20 ms, ~7 MB/op
16384 leaves / in-memory ~19 ms, ~14 MB/op (reference)

~6× faster and ~40× lower allocation than Badger, and the off-heap disk path now essentially matches in-memory speed while keeping map contents off the Go heap.

Test Plan

  • go test -race -tags testtxmetacache ./model/ ./stores/mmaphash/ — unit, parity (disk vs in-memory), fuzz, concurrency exactly-once, collision/overflow
  • go test -tags testtxmetacache -run TestTableScale ./stores/mmaphash/ — 50M-entry scale test (skipped under -short)
  • go vet / staticcheck clean on both packages
  • Block-level integration: TestBlock_Valid_DupTxDetected_DiskMapDirs, TestBlock_ValidOrderAndBlessed_DiskMapDirs (in-memory / single-disk / multi-disk)
  • Validate RAM-overflow behavior at true 1B-inpoint scale on the scaling cluster before promoting block_diskMapDirs to a default

Notes / residual risk

The table is fixed-capacity (sized from TransactionCount); overflow fails safe (halts) and, with uniformly-distributed tx hashes at load factor 0.5, cannot occur for a valid block. The big win holds when the table is mostly page-cache-resident (the scaling pods have the RAM); the RAM-overflow regime should be validated on-cluster.

icellan added 19 commits May 29, 2026 16:20
makeHash now varies bytes [16:17] (the disk-routing window) alongside
[0:3], so TestDiskTxMapUint64_MultiDisk and TestDiskParentSpendsMap_MultiDisk
with numDisks=3 actually distribute entries across all disks instead of
piling everything on disk 0.

Both MultiDisk tests now assert nonEmpty>=2 and total==n, so a regression
in the routing logic will be caught immediately.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces BadgerDB-backed, per-block “disk maps” used during block validation with a purpose-built off-heap mmap open-addressing hash table to reduce CPU overhead and Go heap/GC pressure while keeping the disk-backed validation path fast.

Changes:

  • Introduces stores/mmaphash (segmented, immediately-unlinked, mmap-backed hash table) plus unit/bench coverage.
  • Rewrites DiskTxMapUint64 and DiskParentSpendsMap to use one mmaphash.Table per configured disk path and removes Badger/cuckoo machinery.
  • Updates ParentSpendsMap.SetIfNotExists to return (bool, error) and wires error handling + integration tests/benchmarks + settings/metrics help text updates.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
stores/mmaphash/table.go New mmap-backed segmented hash table implementation (core logic + lifecycle).
stores/mmaphash/table_test.go Unit tests for sizing, semantics, collisions, overflow, and concurrency.
stores/mmaphash/table_bench_test.go Micro-benchmark for Upsert performance/allocation.
model/disk_tx_map_uint64.go Replaces Badger+cuckoo disk tx map with mmap-backed tables.
model/disk_tx_map_uint64_test.go Updates tx map tests for new backend and stats semantics.
model/disk_parent_spends_map.go Replaces Badger+cuckoo parent-spends map with mmap-backed tables; returns errors on capacity/storage failure.
model/disk_parent_spends_map_test.go Updates parent-spends tests; adds overflow + parity + fuzz coverage.
model/map.go Changes ParentSpendsMap.SetIfNotExists signature to (bool, error) and updates in-memory implementation.
model/map_test.go Updates in-memory map tests/benchmarks for new signature.
model/Block.go Handles ParentSpendsMap.SetIfNotExists errors during duplicate-input detection.
model/Block_test.go Adds disk-map-dirs integration tests and a disk-vs-memory benchmark harness; adjusts existing tests for new signature.
settings/block_settings.go Updates block_diskMapDirs documentation to describe mmap backend.
model/metrics.go Updates Prometheus metric help strings to reflect new mmap semantics (no in-RAM filter, “logical bytes populated”).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread stores/mmaphash/table.go
Comment thread stores/mmaphash/table.go
Comment thread stores/mmaphash/table.go
Comment thread stores/mmaphash/table.go
Comment thread model/disk_tx_map_uint64.go Outdated
Comment thread stores/mmaphash/table.go Outdated
…r-ram

# Conflicts:
#	model/Block_test.go
#	model/disk_parent_spends_map.go
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1053 (92b5ed1)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 132
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.613µ 1.618µ ~ 1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.06n 71.00n ~ 0.100
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.30n 71.03n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.11n 71.07n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 33.90n 32.87n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 56.47n 54.95n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 134.4n 131.8n ~ 0.400
MiningCandidate_Stringify_Short-4 219.1n 202.3n ~ 0.100
MiningCandidate_Stringify_Long-4 1.617µ 1.378µ ~ 0.100
MiningSolution_Stringify-4 842.3n 742.5n ~ 0.100
BlockInfo_MarshalJSON-4 1.734µ 1.534µ ~ 0.100
NewFromBytes-4 129.1n 130.0n ~ 0.100
AddTxBatchColumnar_Validation-4 2.493µ 2.612µ ~ 0.700
OffsetValidationLoop-4 636.3n 637.4n ~ 0.200
Mine_EasyDifficulty-4 61.05µ 60.87µ ~ 1.000
Mine_WithAddress-4 6.802µ 7.003µ ~ 0.200
BlockAssembler_AddTx-4 0.02750n 0.02858n ~ 0.700
AddNode-4 10.27 10.63 ~ 0.700
AddNodeWithMap-4 11.39 11.03 ~ 0.100
DiskTxMap_SetIfNotExists-4 3.845µ 3.786µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.560µ 3.569µ ~ 1.000
DiskTxMap_ExistenceOnly-4 332.4n 336.4n ~ 1.000
Queue-4 186.6n 185.6n ~ 0.100
AtomicPointer-4 3.266n 3.652n ~ 0.100
TxMapSetIfNotExists-4 49.37n 49.40n ~ 0.500
TxMapSetIfNotExistsDuplicate-4 41.33n 41.45n ~ 0.700
ChannelSendReceive-4 562.3n 564.0n ~ 0.700
DirectSubtreeAdd/4_per_subtree-4 83.66n 77.07n ~ 0.100
DirectSubtreeAdd/64_per_subtree-4 42.05n 41.29n ~ 0.200
DirectSubtreeAdd/256_per_subtree-4 40.59n 40.30n ~ 0.700
DirectSubtreeAdd/1024_per_subtree-4 39.16n 38.72n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 39.08n 38.24n ~ 0.400
SubtreeProcessorAdd/4_per_subtree-4 255.6n 261.5n ~ 0.400
SubtreeProcessorAdd/64_per_subtree-4 247.3n 253.2n ~ 0.400
SubtreeProcessorAdd/256_per_subtree-4 246.6n 256.8n ~ 0.200
SubtreeProcessorAdd/1024_per_subtree-4 238.8n 249.9n ~ 0.200
SubtreeProcessorAdd/2048_per_subtree-4 235.7n 246.7n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 240.7n 257.2n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 239.7n 251.9n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 240.5n 254.0n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 242.2n 252.8n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 89.14n 91.09n ~ 0.100
SubtreeNodeAddOnly/64_per_subtree-4 65.62n 66.18n ~ 0.100
SubtreeNodeAddOnly/256_per_subtree-4 64.49n 65.21n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 63.77n 64.68n ~ 0.100
SubtreeCreationOnly/4_per_subtree-4 151.9n 154.6n ~ 0.100
SubtreeCreationOnly/64_per_subtree-4 559.2n 573.9n ~ 0.400
SubtreeCreationOnly/256_per_subtree-4 1.987µ 1.921µ ~ 0.200
SubtreeCreationOnly/1024_per_subtree-4 6.405µ 6.455µ ~ 0.400
SubtreeCreationOnly/2048_per_subtree-4 11.57µ 11.78µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 251.1n 250.4n ~ 0.400
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 252.1n 254.6n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 14.50m 14.81m ~ 0.400
ParallelGetAndSetIfNotExists/10k_nodes-4 18.97m 19.78m ~ 0.400
ParallelGetAndSetIfNotExists/50k_nodes-4 21.26m 21.84m ~ 0.700
ParallelGetAndSetIfNotExists/100k_nodes-4 26.70m 27.63m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 13.06m 14.28m ~ 0.200
SequentialGetAndSetIfNotExists/10k_nodes-4 17.85m 19.72m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 29.20m 29.04m ~ 1.000
SequentialGetAndSetIfNotExists/100k_nodes-4 34.18m 41.69m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 13.83m 15.64m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 20.27m 20.23m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 25.25m 25.73m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 15.91m 15.76m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 19.29m 19.71m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 65.42m 77.74m ~ 0.100
CalcBlockWork-4 502.3n 517.4n ~ 0.100
CalculateWork-4 695.3n 692.2n ~ 0.700
CheckOldBlockIDs/on-chain-prefetch/1000-4 64.38µ 63.53µ ~ 0.400
CheckOldBlockIDs/off-chain-prefetch/1000-4 51.12µ 51.07µ ~ 1.000
CheckOldBlockIDs/on-chain-prefetch/10000-4 479.9µ 494.1µ ~ 0.700
CheckOldBlockIDs/off-chain-prefetch/10000-4 343.5µ 350.2µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_10-4 1.336µ 1.342µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_100-4 12.75µ 12.84µ ~ 0.200
BuildBlockLocatorString_Helpers/Size_1000-4 126.3µ 126.2µ ~ 1.000
CatchupWithHeaderCache-4 104.2m 104.6m ~ 0.100
_prepareTxsPerLevel-4 421.2m 418.3m ~ 1.000
_prepareTxsPerLevelOrdered-4 3.952m 3.665m ~ 0.200
_prepareTxsPerLevel_Comparison/Original-4 422.9m 440.2m ~ 0.400
_prepareTxsPerLevel_Comparison/Optimized-4 4.262m 3.696m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.379m 1.338m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 309.6µ 310.8µ ~ 0.400
SubtreeSizes/10k_tx_64_per_subtree-4 71.88µ 72.63µ ~ 0.400
SubtreeSizes/10k_tx_256_per_subtree-4 17.89µ 18.41µ ~ 0.100
SubtreeSizes/10k_tx_512_per_subtree-4 9.046µ 9.230µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.412µ 4.474µ ~ 0.400
SubtreeSizes/10k_tx_2k_per_subtree-4 2.194µ 2.197µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 69.74µ 70.18µ ~ 0.700
BlockSizeScaling/10k_tx_256_per_subtree-4 17.53µ 17.70µ ~ 0.700
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.328µ 4.386µ ~ 0.700
BlockSizeScaling/50k_tx_64_per_subtree-4 365.3µ 374.0µ ~ 0.200
BlockSizeScaling/50k_tx_256_per_subtree-4 87.43µ 89.86µ ~ 0.200
BlockSizeScaling/50k_tx_1024_per_subtree-4 21.55µ 21.86µ ~ 0.700
SubtreeAllocations/small_subtrees_exists_check-4 149.9µ 150.0µ ~ 1.000
SubtreeAllocations/small_subtrees_data_fetch-4 157.6µ 159.5µ ~ 0.200
SubtreeAllocations/small_subtrees_full_validation-4 311.3µ 311.4µ ~ 0.700
SubtreeAllocations/medium_subtrees_exists_check-4 8.859µ 8.909µ ~ 1.000
SubtreeAllocations/medium_subtrees_data_fetch-4 9.340µ 9.461µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 17.42µ 17.42µ ~ 1.000
SubtreeAllocations/large_subtrees_exists_check-4 2.087µ 2.082µ ~ 1.000
SubtreeAllocations/large_subtrees_data_fetch-4 2.235µ 2.235µ ~ 0.800
SubtreeAllocations/large_subtrees_full_validation-4 4.353µ 4.345µ ~ 1.000
_BufferPoolAllocation/16KB-4 3.871µ 5.695µ ~ 0.200
_BufferPoolAllocation/32KB-4 10.380µ 8.891µ ~ 0.200
_BufferPoolAllocation/64KB-4 19.99µ 17.77µ ~ 0.100
_BufferPoolAllocation/128KB-4 34.74µ 35.47µ ~ 1.000
_BufferPoolAllocation/512KB-4 122.9µ 126.2µ ~ 1.000
_BufferPoolConcurrent/32KB-4 21.68µ 19.76µ ~ 0.100
_BufferPoolConcurrent/64KB-4 34.06µ 30.98µ ~ 0.100
_BufferPoolConcurrent/512KB-4 169.6µ 164.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 685.2µ 663.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 664.0µ 621.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 669.4µ 625.6µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 656.2µ 629.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 664.9µ 633.6µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 37.97m 38.01m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 38.19m 38.02m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/64KB-4 38.19m 38.07m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4 38.36m 37.97m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/512KB-4 38.47m 38.28m ~ 0.400
_PooledVsNonPooled/Pooled-4 828.7n 821.0n ~ 0.100
_PooledVsNonPooled/NonPooled-4 8.793µ 7.501µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 8.774µ 8.311µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 12.52µ 10.98µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 12.50µ 10.37µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 309.1µ 309.2µ ~ 1.000
StoreBlock_Sequential/AboveCSVHeight-4 311.9µ 311.5µ ~ 1.000
GetUtxoHashes-4 278.4n 297.3n ~ 0.400
GetUtxoHashes_ManyOutputs-4 46.56µ 45.89µ ~ 0.200
_NewMetaDataFromBytes-4 213.8n 215.0n ~ 0.300
_Bytes-4 397.8n 400.0n ~ 0.700
_MetaBytes-4 138.7n 140.2n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-06-12 11:15 UTC

@ordishs

ordishs commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Capacity sizing — parentSpendsMap is now a hard cap

Flagging one item from review on the DiskParentSpendsMap sizing in validOrderAndBlessed (model/Block.go):

FilterCapacity: uint(b.TransactionCount * 3),

Under Badger this was effectively unbounded and grew on demand. With mmaphash it's a hard cap: the table holds one entry per input, and once a segment fills, SetIfNotExists returns an error that halts validation. The *3 encodes an assumption of ≤3 inputs/tx on average.

The failure mode is safe-but-disruptive: it won't mark a valid block invalid or accept an invalid one, but a block exceeding ~3 inputs/tx on average — consolidation-heavy traffic is entirely normal on BSV — can overflow the table and stall the node on an otherwise-valid block. Two things make this more likely than the global "load factor 0.5" argument suggests:

  1. Overflow is per-segment, not global — a single hot segment can fill while others are near-empty.
  2. Inpoints sharing the same parent hash but different output index land in the same segment and same start bucket (the index lives in key[32:36], outside both the segment window key[8:16] and bucket window key[0:8]), so sweeping/splitting many outputs of one parent forms a probe chain and raises local occupancy. Correctness holds (full-key compare), but it pushes a segment toward overflow earlier.

Suggestions before promoting block_diskMapDirs to a default:

  • Derive capacity from the actual inpoint count where known, or add headroom (*4*5) and make the multiplier a setting; and/or
  • On ErrTableFull, transparently grow/re-create a larger table instead of halting.
  • Add a test that inserts many inpoints sharing one parent hash (same-parent clustering).

The PR notes already call for validating RAM-overflow behaviour at true scale on the cluster — that's exactly the right gate. I've added an inline comment at the call site documenting the assumption and risk.

Otherwise this is a clean, well-tested refactor (parity/fuzz/concurrency/overflow/scale coverage is strong) and a solid performance win. go build / go vet / go test -race on stores/mmaphash and the model disk-map tests all pass locally on darwin.

@freemans13 freemans13 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Clean, well-scoped, well-tested replacement of BadgerDB with the off-heap mmap hash table — ~6× faster and ~40× lower allocation, with sound per-segment locking and fail-safe overflow semantics.

Two follow-ups worth addressing (non-blocking given the opt-in flag):

  • txMap close leak: in Block.Valid the disk txMap is closed sequentially after validOrderAndBlessed rather than via defer, so an error from the flush or validOrderAndBlessed skips the close and leaks the mmap RSS. The parentSpendsMap is already defer-closed — txMap should match.
  • Build constraint: stores/mmaphash/table.go imports golang.org/x/sys/unix with no //go:build tag, while txmetacache uses an OS-split. Add a build tag or confirm Windows is unsupported.

Residual risk to document: locate() derives both segment and start bucket from the tx-hash bytes only (not the output index), so all inpoints spending the same parent tx cluster into one segment — a liveness/halt risk on valid blocks with very-high-output-count parents, beyond the uniform-load overflow modeled in the summary.

Please confirm the test-plan checkboxes (especially -race on ./model/ and the multi-disk integration tests) are green before merge.

icellan added 2 commits June 10, 2026 11:36
… comments

- TestTableScale: skip unless RUN_MMAPHASH_SCALE=true (the default make test runs
  with -race and no -short, so 50M race-tracked ops blew the 10m package timeout
  and starved neighbouring packages)
- New: reject Expected > 2^44 (avoids nextPow2 non-termination / int64 overflow)
- Close: clear t.data only after a successful munmap
- Upsert/Lookup: reject wrong key length instead of panicking/silent-corrupting
- DiskTxMapUint64.Get: surface a Lookup error as a miss rather than ignoring it
- fix stale locate() disk-routing comment
Comment thread model/disk_parent_spends_map.go
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Reviewed the BadgerDB → off-heap mmaphash replacement for block validation’s two ephemeral maps. This is consensus-critical hot-path code, so the focus was correctness, concurrency, and the fail-safe behavior.

Consensus safety — looks sound. Identical keys always route to the same disk (key[16:18]), segment (key[8:16]), and start bucket (key[0:8]), and the full key is stored and compared with bytes.Equal (no truncated-hash false positives), so duplicate detection cannot produce a false negative. Capacity overflow returns ErrTableFull → a processing/storage error that halts validation rather than being misreported as a duplicate or a miss; a genuine duplicate still surfaces as ErrTxExists → block invalid. The widened SetIfNotExists (bool, error) is correctly consumed at checkDuplicateInputs. Segments own disjoint regions of one mmap and all access is serialized by the per-segment RWMutex, so reads never observe torn writes. Lifecycle/cleanup (releaseTxMap, deferred Close) is nil-safe and idempotent.

Current Review:

  • One additive [Minor] residual-risk note (inline): at clamped-numSeg block scale, all inpoints of a single high-fan-out parent pin to one segment, giving an O(K²) probe chain and a per-segment overflow ceiling that raising FilterCapacity cannot lift — distinct from the already-acknowledged TransactionCount-sizing tradeoff. Fail-safe and extreme-fan-out only.
  • Most prior Copilot findings appear already addressed in the current revision: Upsert/Lookup now length-check the key before locate (no panic), computeLayout guards loadFactor>1/NaN, both maps reject FilterCapacity==0, Close checks Munmap before nil-ing data, and the locate routing comment is corrected.
  • The disk_parent_spends_map.go:93 TransactionCount-sizing thread is acknowledged by the author as a documented tradeoff — left as-is.

Well-tested: concurrency exactly-once, segment-full overflow, same-parent clustering parity, and disk-vs-memory parity. No blocking issues found; advisory only.

os.CreateTemp does not create parent directories. The Badger-backed
implementation this replaced did the equivalent os.MkdirAll, so pointing
block_diskMapDirs at a not-yet-existing path now made every block fail
validation (checkDuplicateTransactions/validOrderAndBlessed could not create
the disk maps). Restore MkdirAll(opts.Dir) before os.CreateTemp.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comment thread stores/mmaphash/table.go
Comment thread model/disk_tx_map_uint64.go
Comment thread model/disk_parent_spends_map.go
@icellan icellan self-assigned this Jun 10, 2026
icellan added 3 commits June 11, 2026 18:15
Addresses Copilot review comments:
- computeLayout: fall back to default for loadFactor outside (0,1], incl NaN/Inf
  (previously only <=0 was guarded; >1 undersized the table)
- NewDiskTxMapUint64 / NewDiskParentSpendsMap: reject FilterCapacity==0 up-front
  (fixed-capacity table; zero would build a minimal table that overflows at once)
…t 2x)

Addresses review (ordishs, PR bsv-blockchain#1053): the parentSpendsMap is now a fixed
capacity under block_diskMapDirs, so the previously-hardcoded TransactionCount*3
sizing is a hard cap. Replace it with block_parentSpendsCapacityMultiplier
(default 2) so operators can add headroom for consolidation-heavy blocks. A 0
value is treated as 1. Also add a same-parent-clustering test: inpoints sharing
one parent hash but differing in output index land in the same segment+bucket
(index is outside the segment/bucket windows), and must insert and dedupe
correctly under that probe-chain clustering.
@icellan

icellan commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@ordishs thanks — you're right, and I verified the clustering point against the code: an inpoint key is hash[0:32] ++ index[32:36], so inpoints sharing one parent hash but differing only in output index share key[0:8] (bucket) and key[8:16] (segment) → same segment, same start bucket, forming a probe chain. Consolidation/sweep traffic concentrates there exactly as you describe.

Addressed in this PR:

  1. Capacity is now configurableblock_parentSpendsCapacityMultiplier (commit 3d72b4f) replaces the hardcoded TransactionCount * 3. Default is 2 (kept modest to avoid over-allocating RAM for everyone; operators can raise it for consolidation-heavy deployments). 0 is treated as 1.
  2. Same-parent-clustering testTestDiskParentSpendsMap_SameParentClustering inserts many (sameParentHash, i) inpoints, asserting they all insert once and dedupe correctly under the single-segment probe chain.
  3. Fail-safe halt is unchanged — overflow surfaces as a processing/storage error that halts validation; never a BlockInvalidError, never silent acceptance.

Tracked as a follow-up (not in this PR): #1080 — grow the parentSpendsMap on ErrTableFull instead of halting, which turns the hard cap back into a soft hint and removes the stall-on-valid-block risk entirely. As you and the PR notes say, that plus on-cluster RAM-overflow validation is the gate before block_diskMapDirs is ever promoted to a default.

One honest caveat on the multiplier default: 2 is slightly below the previous hardcoded 3, so in the interim (before #1080) it trims the headroom a little — the rationale is that it's now tunable and the real robustness fix is #1080 rather than padding the multiplier for everyone.

Comment thread stores/mmaphash/table.go
@sonarqubecloud

Copy link
Copy Markdown

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — this is a well-executed replacement. The fail-safe error taxonomy is correct end-to-end (ErrTableFull → halt vs ErrTxExists/inserted=false → block invalid, discriminated correctly in both checkDuplicateTransactionsInSubtree and checkDuplicateInputs), the SetIfNotExists widening fixes a latent correctness bug in the old design, the old triple-locked slow path is replaced by something actually verifiable, and the test coverage (parity, fuzz, overflow, clustering, multi-disk distribution, block-level integration) is unusually strong. Verified locally: go test -race -short ./stores/mmaphash/ and the disk-map/block integration tests in ./model/ all pass.

One substantive request before block_diskMapDirs goes anywhere near a default, plus recommendations:

Request: same-parent fan-in has a hard per-segment ceiling (liveness risk)

All inpoints spending outputs of one parent tx share the segment window (key[8:16]), the bucket window (key[0:8]), and the disk-routing window (key[16:18]) — only the 4-byte index at key[32:36] differs, and it participates in nothing. So they land on one disk, in one segment, in one probe chain:

  • Hard ceiling: a valid block spending more than slotsPerSeg outputs of a single parent overflows that segment regardless of how empty the rest of the table is. At 410M-tx geometry that's ~131K (4 disks) to ~524K (1 disk) spends of one parent. Nothing in consensus prevents a parent with 500K outputs being swept in one block — such a block would permanently fail validation on disk-backed nodes (halt, retry, halt again) while in-memory nodes accept it. That's a chain-stall vector for the configuration, and raising the multiplier only buys linear headroom.
  • O(k²) probing: k same-bucket inserts cost ~k²/2 slot scans before the ceiling is even hit (~10⁹ probe steps at k=131K), under one segment lock.

TestDiskParentSpendsMap_SameParentClustering proves correctness under clustering but sizes capacity to the load, so it doesn't exercise the ceiling at realistic geometry. The "cannot occur for a valid block" claim assumes uniformly-distributed inpoints, which same-parent fan-in violates.

Suggested fix: mix the key tail into locate — e.g. bucketHash = key[0:8] ^ key[len-8:] and segHash = key[8:16] ^ bswap(key[len-8:]). For the 36-byte inpoint key, key[28:36] contains the index, spreading same-parent inpoints across all segments and buckets; for the 32-byte txMap key the tail is just more uniform hash bytes, so it's free. Two XORs remove both the ceiling and the quadratic probing. If you'd rather not change the hashing now, the limitation needs a loud note on block_diskMapDirs and the per-segment ceiling should be part of the on-cluster validation in the test plan.

Recommendations

  1. ENOSPC becomes SIGBUS — node crash, not a block error. The backing file is sparse (Truncate, never fallocated); pages allocate at first-write fault, and disk-full at that moment is SIGBUS → unrecoverable process crash mid-validation. Suggest unix.Fallocate(fd, 0, 0, fileBytes) on Linux at New() — fast metadata-only on ext4/xfs, converts disk-full into a clean StorageError at table creation (fall back to sparse where unsupported, e.g. darwin for dev).
  2. Default headroom quietly dropped 3× → 2×. The old code sized at TransactionCount * 3; the new default multiplier is 2, at the same moment the cap changed from soft (Badger grows) to hard (halt). Consider defaulting to 3 until the cluster-scale validation is done.

Minor

  • Upsert never updates an existing key's value — it's insert-if-absent-else-read. GetOrInsert would be the honest name.
  • Table.Close() nils t.data without synchronizing against in-flight Upsert/Lookup; safe under the current per-block create→use→defer Close lifecycle, but worth a one-line doc note since the package looks reusable.
  • FilterCapacity uint truncates on 32-bit platforms, and golang.org/x/sys/unix makes model Linux/darwin-only — both fine for our targets, just noting the portability change.
  • Stats().DiskBytesWritten is now a synthetic entries × slot size, not actual I/O — metric help text was updated accordingly, dashboards just need to know the meaning shifted.

@icellan icellan merged commit d946da6 into bsv-blockchain:main Jun 12, 2026
34 checks passed
icellan added a commit to icellan/teranode that referenced this pull request Jun 12, 2026
…hase

Bump go-tx-map to v1.4.0, which adds the Freeze/Clear lifecycle to the map
interfaces, and use it in Block.Valid.

Block.Valid is two-phase over b.txMap: checkDuplicateTransactions fills it
(one Put per tx), then validOrderAndBlessed only reads it — one Get per tx
plus one per parent, fanned out across every core. Each Get takes a per-bucket
RWMutex.RLock, whose reader-counter atomic add/sub cache-line ping-pongs across
cores and dominates the read phase on many-core validation nodes (the same
contention bsv-blockchain#1078 removed in block assembly's moveForwardBlock).

Block.Valid now calls b.txMap.Freeze() after the duplicate-check phase has
completed and flushed, before validOrderAndBlessed. On the pooled in-memory
SplitSwissMapUint64 this makes every read skip the RLock; releaseTxMap's
PutTxMap -> Clear path un-freezes the map on its way back to the pool.
checkDuplicateTransactions' internal errgroup.Wait is the happens-before edge
guaranteeing all writes precede the Freeze.

DiskTxMapUint64 (the mmap-backed alternative from bsv-blockchain#1053, used when DiskMapDirs
is set) implements the two new TxMap methods: Freeze is a write guard (Put
returns txmap.ErrMapFrozen); Clear only lifts the freeze, since the disk map is
ephemeral and single-use (released via Close, not pooled). Its mmap reads are
not made lock-free — the in-memory split map is the contended read path Freeze
targets.
icellan added a commit that referenced this pull request Jun 12, 2026
GetAndValidateSubtrees derives TransactionCount from the subtree lengths,
so a coinbase-only (empty) block has TransactionCount == 0. The mmap-backed
maps introduced in #1053 reject FilterCapacity == 0, so with
block_diskMapDirs configured checkDuplicateTransactions failed with
'DiskTxMapUint64: FilterCapacity must be > 0' on every empty block,
halting validation of valid blocks. validOrderAndBlessed had the same
problem sizing the parent-spends map (0 * multiplier = 0).

Clamp the derived capacity to a floor of 1 at both call sites, mirroring
the existing 'multiplier 0 is treated as 1' convention. The constructors
keep their strict zero-capacity guard for direct misuse.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants