Replace BadgerDB with off-heap mmap hash table in block-validation maps#1053
Conversation
makeHash now varies bytes [16:17] (the disk-routing window) alongside [0:3], so TestDiskTxMapUint64_MultiDisk and TestDiskParentSpendsMap_MultiDisk with numDisks=3 actually distribute entries across all disks instead of piling everything on disk 0. Both MultiDisk tests now assert nonEmpty>=2 and total==n, so a regression in the routing logic will be caught immediately.
…etrics; drop unused cuckoo dep
There was a problem hiding this comment.
Pull request overview
This PR replaces BadgerDB-backed, per-block “disk maps” used during block validation with a purpose-built off-heap mmap open-addressing hash table to reduce CPU overhead and Go heap/GC pressure while keeping the disk-backed validation path fast.
Changes:
- Introduces
stores/mmaphash(segmented, immediately-unlinked, mmap-backed hash table) plus unit/bench coverage. - Rewrites
DiskTxMapUint64andDiskParentSpendsMapto use onemmaphash.Tableper configured disk path and removes Badger/cuckoo machinery. - Updates
ParentSpendsMap.SetIfNotExiststo return(bool, error)and wires error handling + integration tests/benchmarks + settings/metrics help text updates.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
stores/mmaphash/table.go |
New mmap-backed segmented hash table implementation (core logic + lifecycle). |
stores/mmaphash/table_test.go |
Unit tests for sizing, semantics, collisions, overflow, and concurrency. |
stores/mmaphash/table_bench_test.go |
Micro-benchmark for Upsert performance/allocation. |
model/disk_tx_map_uint64.go |
Replaces Badger+cuckoo disk tx map with mmap-backed tables. |
model/disk_tx_map_uint64_test.go |
Updates tx map tests for new backend and stats semantics. |
model/disk_parent_spends_map.go |
Replaces Badger+cuckoo parent-spends map with mmap-backed tables; returns errors on capacity/storage failure. |
model/disk_parent_spends_map_test.go |
Updates parent-spends tests; adds overflow + parity + fuzz coverage. |
model/map.go |
Changes ParentSpendsMap.SetIfNotExists signature to (bool, error) and updates in-memory implementation. |
model/map_test.go |
Updates in-memory map tests/benchmarks for new signature. |
model/Block.go |
Handles ParentSpendsMap.SetIfNotExists errors during duplicate-input detection. |
model/Block_test.go |
Adds disk-map-dirs integration tests and a disk-vs-memory benchmark harness; adjusts existing tests for new signature. |
settings/block_settings.go |
Updates block_diskMapDirs documentation to describe mmap backend. |
model/metrics.go |
Updates Prometheus metric help strings to reflect new mmap semantics (no in-RAM filter, “logical bytes populated”). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…r-ram # Conflicts: # model/Block_test.go # model/disk_parent_spends_map.go
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-12 11:15 UTC |
Capacity sizing —
|
freemans13
left a comment
There was a problem hiding this comment.
Approved. Clean, well-scoped, well-tested replacement of BadgerDB with the off-heap mmap hash table — ~6× faster and ~40× lower allocation, with sound per-segment locking and fail-safe overflow semantics.
Two follow-ups worth addressing (non-blocking given the opt-in flag):
- txMap close leak: in Block.Valid the disk txMap is closed sequentially after validOrderAndBlessed rather than via defer, so an error from the flush or validOrderAndBlessed skips the close and leaks the mmap RSS. The parentSpendsMap is already defer-closed — txMap should match.
- Build constraint: stores/mmaphash/table.go imports golang.org/x/sys/unix with no //go:build tag, while txmetacache uses an OS-split. Add a build tag or confirm Windows is unsupported.
Residual risk to document: locate() derives both segment and start bucket from the tx-hash bytes only (not the output index), so all inpoints spending the same parent tx cluster into one segment — a liveness/halt risk on valid blocks with very-high-output-count parents, beyond the uniform-load overflow modeled in the summary.
Please confirm the test-plan checkboxes (especially -race on ./model/ and the multi-disk integration tests) are green before merge.
… comments - TestTableScale: skip unless RUN_MMAPHASH_SCALE=true (the default make test runs with -race and no -short, so 50M race-tracked ops blew the 10m package timeout and starved neighbouring packages) - New: reject Expected > 2^44 (avoids nextPow2 non-termination / int64 overflow) - Close: clear t.data only after a successful munmap - Upsert/Lookup: reject wrong key length instead of panicking/silent-corrupting - DiskTxMapUint64.Get: surface a Lookup error as a miss rather than ignoring it - fix stale locate() disk-routing comment
|
🤖 Claude Code Review Status: Complete Reviewed the BadgerDB → off-heap Consensus safety — looks sound. Identical keys always route to the same disk ( Current Review:
Well-tested: concurrency exactly-once, segment-full overflow, same-parent clustering parity, and disk-vs-memory parity. No blocking issues found; advisory only. |
os.CreateTemp does not create parent directories. The Badger-backed implementation this replaced did the equivalent os.MkdirAll, so pointing block_diskMapDirs at a not-yet-existing path now made every block fail validation (checkDuplicateTransactions/validOrderAndBlessed could not create the disk maps). Restore MkdirAll(opts.Dir) before os.CreateTemp.
Addresses Copilot review comments: - computeLayout: fall back to default for loadFactor outside (0,1], incl NaN/Inf (previously only <=0 was guarded; >1 undersized the table) - NewDiskTxMapUint64 / NewDiskParentSpendsMap: reject FilterCapacity==0 up-front (fixed-capacity table; zero would build a minimal table that overflows at once)
…t 2x) Addresses review (ordishs, PR bsv-blockchain#1053): the parentSpendsMap is now a fixed capacity under block_diskMapDirs, so the previously-hardcoded TransactionCount*3 sizing is a hard cap. Replace it with block_parentSpendsCapacityMultiplier (default 2) so operators can add headroom for consolidation-heavy blocks. A 0 value is treated as 1. Also add a same-parent-clustering test: inpoints sharing one parent hash but differing in output index land in the same segment+bucket (index is outside the segment/bucket windows), and must insert and dedupe correctly under that probe-chain clustering.
|
@ordishs thanks — you're right, and I verified the clustering point against the code: an inpoint key is Addressed in this PR:
Tracked as a follow-up (not in this PR): #1080 — grow the parentSpendsMap on One honest caveat on the multiplier default: 2 is slightly below the previous hardcoded 3, so in the interim (before #1080) it trims the headroom a little — the rationale is that it's now tunable and the real robustness fix is #1080 rather than padding the multiplier for everyone. |
|
ordishs
left a comment
There was a problem hiding this comment.
Approving — this is a well-executed replacement. The fail-safe error taxonomy is correct end-to-end (ErrTableFull → halt vs ErrTxExists/inserted=false → block invalid, discriminated correctly in both checkDuplicateTransactionsInSubtree and checkDuplicateInputs), the SetIfNotExists widening fixes a latent correctness bug in the old design, the old triple-locked slow path is replaced by something actually verifiable, and the test coverage (parity, fuzz, overflow, clustering, multi-disk distribution, block-level integration) is unusually strong. Verified locally: go test -race -short ./stores/mmaphash/ and the disk-map/block integration tests in ./model/ all pass.
One substantive request before block_diskMapDirs goes anywhere near a default, plus recommendations:
Request: same-parent fan-in has a hard per-segment ceiling (liveness risk)
All inpoints spending outputs of one parent tx share the segment window (key[8:16]), the bucket window (key[0:8]), and the disk-routing window (key[16:18]) — only the 4-byte index at key[32:36] differs, and it participates in nothing. So they land on one disk, in one segment, in one probe chain:
- Hard ceiling: a valid block spending more than
slotsPerSegoutputs of a single parent overflows that segment regardless of how empty the rest of the table is. At 410M-tx geometry that's ~131K (4 disks) to ~524K (1 disk) spends of one parent. Nothing in consensus prevents a parent with 500K outputs being swept in one block — such a block would permanently fail validation on disk-backed nodes (halt, retry, halt again) while in-memory nodes accept it. That's a chain-stall vector for the configuration, and raising the multiplier only buys linear headroom. - O(k²) probing: k same-bucket inserts cost ~k²/2 slot scans before the ceiling is even hit (~10⁹ probe steps at k=131K), under one segment lock.
TestDiskParentSpendsMap_SameParentClustering proves correctness under clustering but sizes capacity to the load, so it doesn't exercise the ceiling at realistic geometry. The "cannot occur for a valid block" claim assumes uniformly-distributed inpoints, which same-parent fan-in violates.
Suggested fix: mix the key tail into locate — e.g. bucketHash = key[0:8] ^ key[len-8:] and segHash = key[8:16] ^ bswap(key[len-8:]). For the 36-byte inpoint key, key[28:36] contains the index, spreading same-parent inpoints across all segments and buckets; for the 32-byte txMap key the tail is just more uniform hash bytes, so it's free. Two XORs remove both the ceiling and the quadratic probing. If you'd rather not change the hashing now, the limitation needs a loud note on block_diskMapDirs and the per-segment ceiling should be part of the on-cluster validation in the test plan.
Recommendations
- ENOSPC becomes SIGBUS — node crash, not a block error. The backing file is sparse (
Truncate, never fallocated); pages allocate at first-write fault, and disk-full at that moment is SIGBUS → unrecoverable process crash mid-validation. Suggestunix.Fallocate(fd, 0, 0, fileBytes)on Linux atNew()— fast metadata-only on ext4/xfs, converts disk-full into a cleanStorageErrorat table creation (fall back to sparse where unsupported, e.g. darwin for dev). - Default headroom quietly dropped 3× → 2×. The old code sized at
TransactionCount * 3; the new default multiplier is 2, at the same moment the cap changed from soft (Badger grows) to hard (halt). Consider defaulting to 3 until the cluster-scale validation is done.
Minor
Upsertnever updates an existing key's value — it's insert-if-absent-else-read.GetOrInsertwould be the honest name.Table.Close()nilst.datawithout synchronizing against in-flightUpsert/Lookup; safe under the current per-block create→use→defer Closelifecycle, but worth a one-line doc note since the package looks reusable.FilterCapacity uinttruncates on 32-bit platforms, andgolang.org/x/sys/unixmakesmodelLinux/darwin-only — both fine for our targets, just noting the portability change.Stats().DiskBytesWrittenis now a syntheticentries × slot size, not actual I/O — metric help text was updated accordingly, dashboards just need to know the meaning shifted.
…hase Bump go-tx-map to v1.4.0, which adds the Freeze/Clear lifecycle to the map interfaces, and use it in Block.Valid. Block.Valid is two-phase over b.txMap: checkDuplicateTransactions fills it (one Put per tx), then validOrderAndBlessed only reads it — one Get per tx plus one per parent, fanned out across every core. Each Get takes a per-bucket RWMutex.RLock, whose reader-counter atomic add/sub cache-line ping-pongs across cores and dominates the read phase on many-core validation nodes (the same contention bsv-blockchain#1078 removed in block assembly's moveForwardBlock). Block.Valid now calls b.txMap.Freeze() after the duplicate-check phase has completed and flushed, before validOrderAndBlessed. On the pooled in-memory SplitSwissMapUint64 this makes every read skip the RLock; releaseTxMap's PutTxMap -> Clear path un-freezes the map on its way back to the pool. checkDuplicateTransactions' internal errgroup.Wait is the happens-before edge guaranteeing all writes precede the Freeze. DiskTxMapUint64 (the mmap-backed alternative from bsv-blockchain#1053, used when DiskMapDirs is set) implements the two new TxMap methods: Freeze is a write guard (Put returns txmap.ErrMapFrozen); Clear only lifts the freeze, since the disk map is ephemeral and single-use (released via Close, not pooled). Its mmap reads are not made lock-free — the in-memory split map is the contended read path Freeze targets.
GetAndValidateSubtrees derives TransactionCount from the subtree lengths, so a coinbase-only (empty) block has TransactionCount == 0. The mmap-backed maps introduced in #1053 reject FilterCapacity == 0, so with block_diskMapDirs configured checkDuplicateTransactions failed with 'DiskTxMapUint64: FilterCapacity must be > 0' on every empty block, halting validation of valid blocks. validOrderAndBlessed had the same problem sizing the parent-spends map (0 * multiplier = 0). Clamp the derived capacity to a floor of 1 at both call sites, mirroring the existing 'multiplier 0 is treated as 1' convention. The constructors keep their strict zero-capacity guard for direct misuse.



Summary
Block validation's two disk-backed maps —
DiskParentSpendsMap(duplicate-input detection) andDiskTxMapUint64(duplicate-tx detection + parent ordering) — were backed by BadgerDB. Profiling a 410M-tx block on the scaling cluster showed the disk path's per-block cost was dominated by Badger machinery (memtable arenas, store open/close, write batching), not the actual workload, and Badger's durability/compaction apparatus is dead weight for what is a per-block ephemeral scratch structure.This replaces Badger with a purpose-built off-heap structure:
stores/mmaphashpackage — a fixed-slot, open-addressing (linear-probe) hash table over one sparse, immediately-unlinkedmmapfile, split into independently-locked segments (probing wraps within a segment, so a probe never crosses a lock boundary). Contents live in the OS page cache (off the Go heap, not GC-scanned) and spill to NVMe under pressure; cleanup is anunlink.mmaphash.Table. ~700 lines of Badger/cuckoo/writer-goroutine machinery removed.ParentSpendsMap.SetIfNotExistswidened to(bool, error)so a table-capacity overflow halts validation with a processing error instead of being silently misreported as a duplicate input. The txMap'sPutalready returns an error, so overflow there surfaces as a non-ErrTxExistserror that halts (a real duplicate still returnsErrTxExists→ block invalid).Public interfaces and call sites are otherwise unchanged; the in-memory maps and the no-
block_diskMapDirsdefault path are untouched.Results (BenchmarkBlock_ValidOrderAndBlessed_DiskVsMemory, realistic SHA-256 keys)
~6× faster and ~40× lower allocation than Badger, and the off-heap disk path now essentially matches in-memory speed while keeping map contents off the Go heap.
Test Plan
go test -race -tags testtxmetacache ./model/ ./stores/mmaphash/— unit, parity (disk vs in-memory), fuzz, concurrency exactly-once, collision/overflowgo test -tags testtxmetacache -run TestTableScale ./stores/mmaphash/— 50M-entry scale test (skipped under-short)go vet/staticcheckclean on both packagesTestBlock_Valid_DupTxDetected_DiskMapDirs,TestBlock_ValidOrderAndBlessed_DiskMapDirs(in-memory / single-disk / multi-disk)block_diskMapDirsto a defaultNotes / residual risk
The table is fixed-capacity (sized from
TransactionCount); overflow fails safe (halts) and, with uniformly-distributed tx hashes at load factor 0.5, cannot occur for a valid block. The big win holds when the table is mostly page-cache-resident (the scaling pods have the RAM); the RAM-overflow regime should be validated on-cluster.