perf(txmeta): v2 wire format + partition-aware receiver + pooled buffers by icellan · Pull Request #912 · bsv-blockchain/teranode

icellan · 2026-05-20T12:36:44Z

Summary

End-to-end optimisation of the txmeta path between the validator (producer) and the subtree-validation receiver. Targeting 10M tx/s/pod for the txmeta consume path.

Bench delivers 11.1M tx/s through the full producer→receiver pipeline at batch=1000 with 32 partitions on M3 Max against a real production-config Native cache, up from 749K-1.14M on the shard-fanout baseline.

What's in this PR

Wire format v2 + partition-aware producer

New v2 Kafka payload (magic 0xFF) carrying the producer-computed xxhash per entry so the receiver skips its own xxhash on the hot path.
Validator routes each tx to partition = (xxhash(hash) % BucketsCount) / (BucketsCount / NumPartitions). Each Kafka partition now maps to a disjoint contiguous range of receiver cache buckets — receivers writing concurrently across partitions take no cross-partition cache-bucket locks.
Receiver auto-detects via the magic byte; v1 messages continue to work.
Producer settings: validator_txmeta_wireFormat (v1|v2, default v1) and validator_txmeta_numPartitions (default 32, must divide BucketsCount).

Receiver rewrite

Drops the 256-way shard fan-out + worker pool + caught-up latch in txmetaHandler.go.
Parses v1 and v2 inline; dispatches all ADDs in one SetCacheMultiSequentialWithHashes call per Kafka message; DELETEs inline.
Parallelism comes from the per-partition Kafka consumer goroutines (one per partition per fetch from util/kafka/kafka_consumer.go), not from internal fan-out.

Cache: new sequential paths + inner-shard collapse

SetMultiSequential / SetMultiSequentialWithHashes on ImprovedCache — partition-aware twins of SetMulti. Same per-bucket grouping but no errgroup fan-out (the caller already has parallelism via the Kafka per-partition consumer goroutines).
bucketNative and bucketTrimmed per-bucket maps were previously constructed with NewNativeSplitLockFreeMapUint64(cap, BucketsCount) → 8192 outer × 8192 inner sub-maps ≈ 67M sub-maps per cache instance. cleanLockedMap rebuilt that on every chunk wrap. Inner shard count is now 1. deleteFromNativeSplitMapShard / deleteFromSwissSplitMapShard updated in lockstep.
Eviction behaviour unchanged: the cache's gen+idx-based rule operates on map values regardless of inner shard count.
bucketUnallocated and bucketPreallocated use plain Go maps without inner sharding — unaffected. Production today runs bucketUnallocated, so this win lands only when the bucket type is switched (see next).

TxMetaCacheBucketType setting

New subtreevalidation_txMetaCacheBucketType setting (unallocated|preallocated|trimmed|native, default unallocated = current prod behaviour).
Unknown values fall back to unallocated with a Warn log so a typo never silently changes the deployed allocator.
subtreevalidation_txMetaCacheTrimRatio also exposed as a typed setting (currently only affects Preallocated).

Pooled buffers

txmetaParseBuffers (receiver) pools keys/values/hashes/deletes slice headers plus single-arena keysBuf and contentBuf — per-entry hash/content allocations collapse to one append into a pre-sized arena.
txmetaPartitionsScratch (validator) pools the partitions outer slice + per-partition inner slices for sendTxMetaBatchV2.
Byte buffers published to franz-go are NOT pooled: no Publish callback hook yet for safe return.

Kafka producer manual partitioning

New Message.Partition field + KafkaProducerConfig.ManualPartitioning + WithManualPartitioning option. When enabled, registers kgo.ManualPartitioner and flushBuffered honours record.Partition. Off by default; existing StickyKeyPartitioner behaviour preserved.
daemon/daemon_kafka.go wires it for the txmeta producer when wireFormat=v2.

In-memory broker test surface

Adds InMemoryBroker.HasConsumer and DropTopic so e2e tests can wait for consumer-group registration and release retained-messages state at teardown. TruncateTopic also added for long bench runs.

Tests

Unit coverage for v2 serialization byte layout, partition routing rule, v2-format receiver parse, mixed ADD/DELETE handling, bucket-type resolver, end-to-end pipeline via the in-memory broker (txmeta_e2e_test.go).

Bench numbers

M3 Max, 32 partitions, 256-byte payload, real production-config Native cache (sync.Once-shared across bench invocations):

Benchmark	tx/s	allocs/op	bytes/op
`BenchmarkTxmetaProducerReceiver_V2_Batch100`	3.96M	33	2.1 KB
`BenchmarkTxmetaProducerReceiver_V2_Batch1000`	11.10M	70	4.4 KB

For context, baseline at start of optimisation work was 749K (batch=100) / 1.14M (batch=1000) tx/s with ~2400 allocs/op.

The bench is now scheduler-bound (~60% pthread_cond_wait, ~20% pthread_cond_signal). Cache writes, parse, and serialize allocations no longer appear in the top consumers.

Production rollout

Default behaviour is unchanged. Existing deploys keep bucketUnallocated, wireFormat=v1, no manual partitioning. New settings are opt-in.

To unlock the wins in prod (recommended sequence, one knob per canary):

Receiver supports v2 already (auto-detect). Roll the new build to subtree-validation pods first.
Switch validator emit to v2: set validator_txmeta_wireFormat=v2 and validator_txmeta_numPartitions to match the actual txmeta topic partition count (must divide BucketsCount=8192). Confirm receiver lag stays flat.
Switch cache bucket type: set subtreevalidation_txMetaCacheBucketType=native on one canary subtree-validation pod with production TxMetaCacheMaxMB. Watch RSS, cleanLockedMap allocation rate (heap pprof), subtreevalidation_set_tx_meta_cache_kafka histogram. Roll out once a chunk-rotation cycle is complete with healthy memory.

Settings honoured as today: TxMetaCacheMaxMB, txMetaCacheTrimRatio (still only affects Preallocated), TxMetaCacheEnabled.

Test plan

go build ./... clean
go vet clean across touched packages
Unit tests pass: handler v1/v2 parse, mixed actions, bucket type resolver, validator serializer, partition routing
E2E test pass (in-memory broker + real Native cache, both wire formats)
Cache unit tests pass under testtxmetacache build tag
Canary one subtree-validation pod with subtreevalidation_txMetaCacheBucketType=native at production cache size
Canary one validator pod with validator_txmeta_wireFormat=v2
Verify subtreevalidation_set_tx_meta_cache_kafka histogram + RSS in the canary

Known follow-ups (separate PR)

bucketUnallocated.cleanLockedMap still allocates a fresh map[uint64]uint64 on every chunk wrap (held 17 GB in-use on scaling-2). Worth a sync.Pool+clear(m) pass for the prod default bucket type.
Pool the v2 byte buffer published to franz-go via a producer callback hook (current code allocates per partition per batch).
Per-fetch goroutine spawn in kafka_consumer.go's EachPartition could be replaced with persistent partition workers fed by channels. Bench is scheduler-bound; production isn't (fetch rate is ~30/sec vs bench ~13k iter/sec), so this is a "bank headroom" change.

End-to-end optimisation of the txmeta path between the validator (producer) and the subtree-validation receiver. Bench delivers 11.1M tx/s through the full producer→receiver pipeline at batch=1000 with 32 partitions on M3 Max against a real production-config Native cache, up from 749K-1.14M on the shard-fanout baseline. What's in this change: 1. Wire format v2 (services/validator/Validator.go, services/subtreevalidation/txmetaHandler.go). Backwards-compatible alongside v1; receiver auto-detects via magic byte 0xFF. [1 byte] magic = 0xFF [1 byte] version = 0x02 [2 bytes] reserved [4 bytes] entry count per entry: [8 bytes] xxhash(tx hash) [32 bytes] tx hash [1 byte] action (0=ADD, 1=DELETE) [4 bytes] content length [N bytes] content Producer groups items by partition using bucketIdx = xxhash(hash) % BucketsCount bucketsPerPartition = BucketsCount / NumPartitions partition = bucketIdx / bucketsPerPartition so each Kafka partition's records map to a disjoint contiguous range of receiver cache buckets. Concurrent per-partition receiver writers touch disjoint buckets and run lock-free relative to each other. Producer setting validator_txmeta_wireFormat (v1|v2, default v1) and validator_txmeta_numPartitions (default 32, must divide BucketsCount). 2. Receiver rewrite (txmetaHandler.go). Drops the 256-way shard fan-out + worker pool + caught-up latch. Parses v1 and v2 inline, dispatches all ADDs in one SetCacheMultiSequentialWithHashes call (or SetCacheMultiSequential on v1) per Kafka message, DELETEs inline. Parallelism comes from the per-partition Kafka consumer goroutines (one per partition per fetch from util/kafka/ kafka_consumer.go), not from internal fan-out. 3. Cache: SetMultiSequential family (stores/txmetacache/ improved_cache.go + txmetacache.go). Partition-aware twin of SetMulti — same per-bucket grouping but no errgroup fan-out. The With-Hashes variant takes caller-supplied xxhash values so the receiver passes the on-wire v2 hash straight through. 4. Cache: inner-shard collapse (improved_cache.go). bucketNative and bucketTrimmed maps were constructed with NewNativeSplitLockFreeMapUint64(cap, BucketsCount), meaning every bucket map had 8192 inner sub-shards. 8192 outer * 8192 inner ~ 67M sub-maps per cache instance. cleanLockedMap rebuilt the inner-sharded map on every chunk wrap; that pattern dominated CPU (24% in NewNativeSplitLockFreeMapUint64) and heap (34.5 GB / 64% of bench allocations; 80% of in-use memory on the scaling-2 prod pod). Outer ImprovedCache already shards across BucketsCount outer buckets via h%BucketsCount, and b.mu serialises writes to one bucket. Inner shard count is now 1. deleteFromNativeSplitMapShard and deleteFromSwissSplitMapShard updated in lockstep — they previously routed via h%BucketsCount, which no longer matches the map's internal hash%nrOfBuckets dispatch with nrOfBuckets=1. Eviction behaviour unchanged: the cache's gen+idx-based rule operates on map values regardless of inner shard count; IterAll visits every entry the same way. 5. TxMetaCacheBucketType setting (settings/ subtreevalidation_settings.go, services/subtreevalidation/ Server.go). Bucket type was hard-coded to Unallocated; now selectable via subtreevalidation_txMetaCacheBucketType (unallocated|preallocated|trimmed|native, default unallocated). Unknown values fall back to Unallocated with a Warn log so a typo never silently changes the allocator. Production default is unchanged; canary one pod on "native" with production TxMetaCacheMaxMB before rolling out. Also exposes subtreevalidation_txMetaCacheTrimRatio as a typed setting (currently only affects Preallocated; gocore key was already read inside the cache constructor). 6. Pooled producer + receiver buffers (validator.go, txmetaHandler.go). txmetaParseBuffers (receiver) pools keys/values/hashes/deletes slice headers plus single-arena keysBuf and contentBuf so per-entry hash/content allocations collapse to one append into a pre-sized buffer. txmetaPartitionsScratch (validator) pools the partitions outer slice + per-partition inner slices for sendTxMetaBatchV2. The byte buffer published to franz-go is NOT pooled: no Publish callback hook yet for safe return. 7. Kafka producer manual partitioning (util/kafka/ kafka_producer_async.go). New Message.Partition field + KafkaProducerConfig.ManualPartitioning. When true, registers kgo.ManualPartitioner and flushBuffered honours record.Partition. Off by default — existing StickyKeyPartitioner behaviour preserved. New WithManualPartitioning ProducerOption; daemon/daemon_kafka.go passes it for the txmeta producer when wireFormat=v2. 8. In-memory broker test surface (util/kafka/in_memory_kafka/ in_memory_kafka.go). Adds InMemoryBroker.HasConsumer and DropTopic so e2e tests can wait for consumer-group registration (the broker drops messages produced to a topic with no consumers) and release retained-messages state at teardown. TruncateTopic also added for long bench runs that would otherwise pin gigabytes in the broker's per-topic message slice. 9. Tests (Validator_test.go, txmetaHandler_test.go, Server_coverage_test.go, txmeta_e2e_test.go). Unit coverage for v2 serialization (byte layout), partition routing rule, v2-format receiver parse, mixed ADD/DELETE handling, bucket type resolver, end-to-end pipeline via the in-memory broker. Bench numbers (M3 Max, 32 partitions, 256-byte payload, real production-config Native cache, sync.Once-shared across bench invocations): BenchmarkTxmetaProducerReceiver_V2_Batch100 3.96M tx/s 33 allocs/op BenchmarkTxmetaProducerReceiver_V2_Batch1000 11.10M tx/s 70 allocs/op Bench is now scheduler-bound (~60% pthread_cond_wait, ~20% pthread_cond_signal). Cache writes, parse, and serialize allocations no longer appear in the top consumers. Operational notes for production rollout: - Default behaviour unchanged. Existing deploys keep bucketUnallocated, wireFormat=v1, no manual partitioning. New setting opt-in. - To benefit from #1+#2 in prod: set validator_txmeta_wireFormat=v2 and validator_txmeta_numPartitions to match the actual txmeta topic partition count. - To benefit from #4 in prod: set subtreevalidation_txMetaCacheBucketType=native. Canary first; the deployed pod's bucketUnallocated path is unaffected. - TxMetaCacheMaxMB=73728, txMetaCacheTrimRatio=5 settings are honoured as today.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

github-actions · 2026-05-20T12:37:49Z

🤖 Claude Code Review

Status: Complete

This is a well-engineered performance optimization PR targeting 10M+ tx/s throughput for the txmeta cache pipeline. The changes are substantial but carefully structured with strong test coverage and backward compatibility.

Architecture Changes:

Introduces v2 wire format with producer-computed xxhash values to eliminate receiver-side rehashing
Removes 256-way shard fan-out worker pool in favor of Kafka's native per-partition parallelism
Collapses cache inner-shard count from 8192→1 per bucket (eliminates ~67M sub-maps)
Adds partition-aware routing so each Kafka partition maps to disjoint cache bucket ranges

Code Quality:

Comprehensive test coverage including unit tests, partition routing tests, and end-to-end tests
Proper handling of the testtxmetacache build tag edge cases (BucketsCount=8 vs 8192)
Backward compatible: v1 format still supported, new features are opt-in via settings
Extensive documentation in commit messages and code comments
Proper use of sync.Pool for buffer reuse

Safety & Rollout:

Default behavior unchanged (v1 format, unallocated bucket type)
Settings validation with fallback for unknown bucket types
Canary rollout plan documented in PR description
No breaking changes to existing deployments

Performance Results:

Benchmark shows 11.1M tx/s at batch=1000 (up from 749K-1.14M baseline)
Allocations reduced from ~2400 to 70 allocs/op
Memory efficiency improved significantly

No critical issues found. The implementation follows project conventions in CLAUDE.md and includes proper error handling, input validation, and defensive programming practices.

github-actions · 2026-05-20T12:52:25Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-912 (3a0be2c)

Summary

Regressions: 0
Improvements: 0
Unchanged: 144
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.835µ	1.831µ	~	0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	59.44n	59.33n	~	1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	59.33n	59.14n	~	0.200
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	59.40n	59.18n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	40.40n	37.02n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	69.83n	64.76n	~	0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	190.5n	177.0n	~	0.100
MiningCandidate_Stringify_Short-4	261.5n	262.2n	~	1.000
MiningCandidate_Stringify_Long-4	1.852µ	1.791µ	~	0.100
MiningSolution_Stringify-4	936.3n	921.0n	~	0.400
BlockInfo_MarshalJSON-4	1.820µ	1.776µ	~	0.200
NewFromBytes-4	141.9n	126.4n	~	0.100
AddTxBatchColumnar_Validation-4	2.472µ	2.481µ	~	0.100
OffsetValidationLoop-4	635.1n	637.7n	~	1.000
Mine_EasyDifficulty-4	60.49µ	60.73µ	~	0.400
Mine_WithAddress-4	7.423µ	6.842µ	~	0.400
BlockAssembler_AddTx-4	0.02640n	0.02611n	~	1.000
AddNode-4	10.59	10.59	~	1.000
AddNodeWithMap-4	11.80	10.39	~	0.200
DirectSubtreeAdd/4_per_subtree-4	57.96n	58.92n	~	0.400
DirectSubtreeAdd/64_per_subtree-4	29.29n	29.35n	~	1.000
DirectSubtreeAdd/256_per_subtree-4	28.36n	28.08n	~	0.300
DirectSubtreeAdd/1024_per_subtree-4	26.62n	26.80n	~	0.200
DirectSubtreeAdd/2048_per_subtree-4	26.22n	26.32n	~	0.100
SubtreeProcessorAdd/4_per_subtree-4	295.1n	296.8n	~	0.700
SubtreeProcessorAdd/64_per_subtree-4	291.4n	289.4n	~	1.000
SubtreeProcessorAdd/256_per_subtree-4	293.7n	289.9n	~	0.200
SubtreeProcessorAdd/1024_per_subtree-4	288.3n	277.7n	~	0.200
SubtreeProcessorAdd/2048_per_subtree-4	284.0n	277.9n	~	0.100
SubtreeProcessorRotate/4_per_subtree-4	283.3n	284.5n	~	1.000
SubtreeProcessorRotate/64_per_subtree-4	277.8n	281.6n	~	0.700
SubtreeProcessorRotate/256_per_subtree-4	286.7n	281.7n	~	0.100
SubtreeProcessorRotate/1024_per_subtree-4	287.1n	280.5n	~	0.100
SubtreeNodeAddOnly/4_per_subtree-4	55.37n	55.60n	~	0.400
SubtreeNodeAddOnly/64_per_subtree-4	36.26n	36.27n	~	1.000
SubtreeNodeAddOnly/256_per_subtree-4	35.35n	35.20n	~	0.100
SubtreeNodeAddOnly/1024_per_subtree-4	34.85n	34.65n	~	0.400
SubtreeCreationOnly/4_per_subtree-4	112.4n	113.5n	~	0.700
SubtreeCreationOnly/64_per_subtree-4	360.2n	362.8n	~	0.700
SubtreeCreationOnly/256_per_subtree-4	1.277µ	1.272µ	~	0.500
SubtreeCreationOnly/1024_per_subtree-4	3.947µ	3.875µ	~	0.200
SubtreeCreationOnly/2048_per_subtree-4	7.208µ	7.162µ	~	0.400
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	288.2n	283.6n	~	0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	288.5n	282.0n	~	0.100
ParallelGetAndSetIfNotExists/1k_nodes-4	2.023m	2.031m	~	0.700
ParallelGetAndSetIfNotExists/10k_nodes-4	5.271m	5.260m	~	1.000
ParallelGetAndSetIfNotExists/50k_nodes-4	7.662m	7.233m	~	0.100
ParallelGetAndSetIfNotExists/100k_nodes-4	10.21m	10.01m	~	0.200
SequentialGetAndSetIfNotExists/1k_nodes-4	1.781m	1.799m	~	0.100
SequentialGetAndSetIfNotExists/10k_nodes-4	4.560m	4.477m	~	0.700
SequentialGetAndSetIfNotExists/50k_nodes-4	13.89m	13.76m	~	0.700
SequentialGetAndSetIfNotExists/100k_nodes-4	25.73m	25.70m	~	0.700
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	2.063m	2.063m	~	1.000
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	8.489m	8.454m	~	0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	13.76m	13.43m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	1.807m	1.799m	~	0.700
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	9.229m	8.297m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	47.77m	44.67m	~	0.100
DiskTxMap_SetIfNotExists-4	3.671µ	3.620µ	~	1.000
DiskTxMap_SetIfNotExists_Parallel-4	3.263µ	3.420µ	~	0.100
DiskTxMap_ExistenceOnly-4	311.8n	307.0n	~	0.700
Queue-4	183.5n	184.4n	~	0.200
AtomicPointer-4	3.655n	3.246n	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	800.1µ	813.2µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4	765.3µ	766.4µ	~	1.000
ReorgOptimizations/AllMarkFalse/Old/10K-4	114.2µ	102.1µ	~	0.100
ReorgOptimizations/AllMarkFalse/New/10K-4	64.51µ	64.31µ	~	1.000
ReorgOptimizations/HashSlicePool/Old/10K-4	51.19µ	51.27µ	~	1.000
ReorgOptimizations/HashSlicePool/New/10K-4	11.17µ	11.10µ	~	0.200
ReorgOptimizations/NodeFlags/Old/10K-4	4.320µ	4.379µ	~	0.100
ReorgOptimizations/NodeFlags/New/10K-4	1.452µ	1.506µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	9.162m	9.194m	~	0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4	9.537m	9.684m	~	0.400
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.061m	1.067m	~	0.700
ReorgOptimizations/AllMarkFalse/New/100K-4	708.7µ	705.9µ	~	0.400
ReorgOptimizations/HashSlicePool/Old/100K-4	516.8µ	489.6µ	~	0.100
ReorgOptimizations/HashSlicePool/New/100K-4	201.9µ	205.2µ	~	0.700
ReorgOptimizations/NodeFlags/Old/100K-4	45.60µ	46.87µ	~	0.100
ReorgOptimizations/NodeFlags/New/100K-4	16.32µ	16.32µ	~	1.000
TxMapSetIfNotExists-4	49.43n	49.04n	~	0.400
TxMapSetIfNotExistsDuplicate-4	41.57n	41.47n	~	0.600
ChannelSendReceive-4	606.4n	596.1n	~	0.700
CalcBlockWork-4	502.2n	500.8n	~	1.000
CalculateWork-4	739.9n	730.8n	~	1.000
BuildBlockLocatorString_Helpers/Size_10-4	1.312µ	1.425µ	~	0.100
BuildBlockLocatorString_Helpers/Size_100-4	12.51µ	12.33µ	~	0.100
BuildBlockLocatorString_Helpers/Size_1000-4	157.0µ	124.9µ	~	0.100
CatchupWithHeaderCache-4	103.9m	103.9m	~	0.400
_prepareTxsPerLevel-4	413.5m	416.3m	~	1.000
_prepareTxsPerLevelOrdered-4	3.723m	4.144m	~	0.200
_prepareTxsPerLevel_Comparison/Original-4	421.5m	425.4m	~	1.000
_prepareTxsPerLevel_Comparison/Optimized-4	3.807m	4.258m	~	0.100
SubtreeSizes/10k_tx_4_per_subtree-4	1.380m	1.352m	~	1.000
SubtreeSizes/10k_tx_16_per_subtree-4	321.4µ	325.1µ	~	0.200
SubtreeSizes/10k_tx_64_per_subtree-4	76.82µ	78.46µ	~	0.100
SubtreeSizes/10k_tx_256_per_subtree-4	19.41µ	19.37µ	~	1.000
SubtreeSizes/10k_tx_512_per_subtree-4	9.547µ	9.659µ	~	0.200
SubtreeSizes/10k_tx_1024_per_subtree-4	4.746µ	4.806µ	~	0.400
SubtreeSizes/10k_tx_2k_per_subtree-4	2.378µ	2.364µ	~	1.000
BlockSizeScaling/10k_tx_64_per_subtree-4	75.38µ	76.49µ	~	0.200
BlockSizeScaling/10k_tx_256_per_subtree-4	19.09µ	19.08µ	~	1.000
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.764µ	4.776µ	~	1.000
BlockSizeScaling/50k_tx_64_per_subtree-4	399.5µ	401.4µ	~	1.000
BlockSizeScaling/50k_tx_256_per_subtree-4	95.44µ	96.30µ	~	0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4	23.53µ	23.84µ	~	0.400
SubtreeAllocations/small_subtrees_exists_check-4	159.1µ	162.6µ	~	0.100
SubtreeAllocations/small_subtrees_data_fetch-4	166.2µ	167.9µ	~	0.400
SubtreeAllocations/small_subtrees_full_validation-4	324.7µ	330.9µ	~	0.100
SubtreeAllocations/medium_subtrees_exists_check-4	9.304µ	9.564µ	~	0.100
SubtreeAllocations/medium_subtrees_data_fetch-4	9.527µ	9.942µ	~	0.100
SubtreeAllocations/medium_subtrees_full_validation-4	19.06µ	19.32µ	~	0.100
SubtreeAllocations/large_subtrees_exists_check-4	2.229µ	2.258µ	~	0.100
SubtreeAllocations/large_subtrees_data_fetch-4	2.325µ	2.407µ	~	0.100
SubtreeAllocations/large_subtrees_full_validation-4	4.769µ	4.864µ	~	0.100
_BufferPoolAllocation/16KB-4	4.648µ	3.915µ	~	0.700
_BufferPoolAllocation/32KB-4	8.321µ	7.566µ	~	0.400
_BufferPoolAllocation/64KB-4	16.72µ	15.94µ	~	0.700
_BufferPoolAllocation/128KB-4	34.18µ	24.40µ	~	0.100
_BufferPoolAllocation/512KB-4	109.5µ	106.1µ	~	1.000
_BufferPoolConcurrent/32KB-4	17.85µ	18.35µ	~	0.700
_BufferPoolConcurrent/64KB-4	29.32µ	28.52µ	~	0.400
_BufferPoolConcurrent/512KB-4	139.8µ	146.3µ	~	0.100
_SubtreeDeserializationWithBufferSizes/16KB-4	609.5µ	639.0µ	~	0.100
_SubtreeDeserializationWithBufferSizes/32KB-4	610.1µ	637.9µ	~	0.100
_SubtreeDeserializationWithBufferSizes/64KB-4	598.0µ	614.6µ	~	0.100
_SubtreeDeserializationWithBufferSizes/128KB-4	592.9µ	599.2µ	~	0.200
_SubtreeDeserializationWithBufferSizes/512KB-4	589.1µ	597.4µ	~	0.400
_SubtreeDataDeserializationWithBufferSizes/16KB-4	36.23m	36.22m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4	35.65m	35.94m	~	0.400
_SubtreeDataDeserializationWithBufferSizes/64KB-4	35.83m	35.58m	~	0.400
_SubtreeDataDeserializationWithBufferSizes/128KB-4	35.92m	35.36m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4	35.82m	35.28m	~	0.100
_PooledVsNonPooled/Pooled-4	831.8n	829.8n	~	0.100
_PooledVsNonPooled/NonPooled-4	7.648µ	7.220µ	~	0.100
_MemoryFootprint/Current_512KB_32concurrent-4	6.784µ	6.902µ	~	0.700
_MemoryFootprint/Proposed_32KB_32concurrent-4	9.293µ	9.306µ	~	0.700
_MemoryFootprint/Alternative_64KB_32concurrent-4	9.101µ	9.122µ	~	0.700
StoreBlock_Sequential/BelowCSVHeight-4	342.4µ	333.5µ	~	0.100
StoreBlock_Sequential/AboveCSVHeight-4	336.3µ	333.1µ	~	0.100
GetUtxoHashes-4	275.5n	276.9n	~	1.000
GetUtxoHashes_ManyOutputs-4	45.02µ	45.08µ	~	0.700
_NewMetaDataFromBytes-4	216.2n	213.0n	~	0.200
_Bytes-4	394.8n	392.0n	~	0.700
_MetaBytes-4	339.5n	335.6n	~	0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-20 13:45 UTC

…ers (#912) Cherry-pick of PR #912 onto this branch. This carries the txmeta hot-path work that was developed on top of #828 and shipped as a separate PR against main; landing it here keeps #828's branch in sync so the merged form is internally consistent and a no-op vs main once #912 is merged upstream. End-to-end optimisation of the txmeta path between the validator (producer) and the subtree-validation receiver. Bench delivers 11.1M tx/s through the full producer→receiver pipeline at batch=1000 with 32 partitions on M3 Max against a real production-config Native cache, up from 749K-1.14M on the shard-fanout baseline. What's in this change: 1. Wire format v2 (services/validator/Validator.go, services/subtreevalidation/txmetaHandler.go). Backwards-compatible alongside v1; receiver auto-detects via magic byte 0xFF. [1 byte] magic = 0xFF [1 byte] version = 0x02 [2 bytes] reserved [4 bytes] entry count per entry: [8 bytes] xxhash(tx hash) [32 bytes] tx hash [1 byte] action (0=ADD, 1=DELETE) [4 bytes] content length [N bytes] content Producer groups items by partition using bucketIdx = xxhash(hash) % BucketsCount bucketsPerPartition = BucketsCount / NumPartitions partition = bucketIdx / bucketsPerPartition so each Kafka partition's records map to a disjoint contiguous range of receiver cache buckets. Concurrent per-partition receiver writers touch disjoint buckets and run lock-free relative to each other. Producer setting validator_txmeta_wireFormat (v1|v2, default v1) and validator_txmeta_numPartitions (default 32, must divide BucketsCount). 2. Receiver rewrite (txmetaHandler.go). Drops the 256-way shard fan-out + worker pool + caught-up latch. Parses v1 and v2 inline, dispatches all ADDs in one SetCacheMultiSequentialWithHashes call (or SetCacheMultiSequential on v1) per Kafka message, DELETEs inline. Parallelism comes from the per-partition Kafka consumer goroutines, not from internal fan-out. 3. Cache: SetMultiSequential family (stores/txmetacache/ improved_cache.go + txmetacache.go). Partition-aware twin of SetMulti — same per-bucket grouping but no errgroup fan-out. The With-Hashes variant takes caller-supplied xxhash values so the receiver passes the on-wire v2 hash straight through. 4. Cache: inner-shard collapse (improved_cache.go). bucketNative and bucketTrimmed maps were constructed with NewNativeSplitLockFreeMapUint64(cap, BucketsCount), meaning every bucket map had 8192 inner sub-shards. cleanLockedMap rebuilt the inner-sharded map on every chunk wrap (24% CPU, 64% of bench allocations, 80% of in-use memory on the scaling-2 prod pod). Outer ImprovedCache already shards via h%BucketsCount and b.mu serialises writes — inner shard count is now 1. deleteFromNativeSplitMapShard / deleteFromSwissSplitMapShard updated in lockstep to route via shard 0. Eviction unchanged: cache's gen+idx rule operates on map values regardless of inner shard count. bucketUnallocated and bucketPreallocated use plain Go maps; unaffected. 5. TxMetaCacheBucketType setting (settings/ subtreevalidation_settings.go, services/subtreevalidation/ Server.go). Bucket type was hard-coded to Unallocated; now selectable via subtreevalidation_txMetaCacheBucketType (unallocated|preallocated|trimmed|native, default unallocated). Unknown values fall back to Unallocated with a Warn log. 6. Pooled producer + receiver buffers. txmetaParseBuffers (receiver) pools keys/values/hashes/deletes slice headers plus single-arena keysBuf and contentBuf. txmetaPartitionsScratch (validator) pools the partitions outer slice + per-partition inner slices. 7. Kafka producer manual partitioning (util/kafka/ kafka_producer_async.go). Message.Partition field + KafkaProducerConfig.ManualPartitioning + WithManualPartitioning option. daemon/daemon_kafka.go passes it for the txmeta producer when wireFormat=v2. 8. In-memory broker test surface. InMemoryBroker.HasConsumer, DropTopic, TruncateTopic for e2e tests + long bench runs. 9. Tests. v2 serialization byte layout, partition routing, v2-format receiver parse, mixed ADD/DELETE, bucket-type resolver, end-to-end pipeline via in-memory broker (txmeta_e2e_test.go). Bench numbers (M3 Max, 32 partitions, 256-byte payload, real production-config Native cache): BenchmarkTxmetaProducerReceiver_V2_Batch100 3.96M tx/s 33 allocs/op BenchmarkTxmetaProducerReceiver_V2_Batch1000 11.10M tx/s 70 allocs/op Bench is scheduler-bound after this change (~60% pthread_cond_wait, ~20% pthread_cond_signal). Cache writes, parse, and serialize allocations no longer appear in the top consumers.

The CI test job builds with the testtxmetacache tag where BucketsCount=8. Test_sendTxMetaBatchV2_PartitionRouting hard-coded numPartitions=32 and TestTxmetaE2E_V2_RoundTrip drove publishV2/serializeV2Partitioned with numPartitions=32, both of which then computed BucketsCount/numPartitions without the clamp the production sendTxMetaBatchV2 already has, panicking with integer divide by zero. - publishV2 and serializeV2Partitioned: mirror the production clamp (numPartitions>=1, bucketsPerPartition>=1, p capped at numPartitions-1) so the test math matches the routing code under both BucketsCount=8192 and BucketsCount=8 builds. - Test_sendTxMetaBatchV2_PartitionRouting: cap numPartitions to BucketsCount. With more partitions than buckets the production code clamps bucketsPerPartition to 1 and the test's "find two hashes in different partitions" loop would not terminate meaningfully. Also drop a stray blank line at the end of the Server struct so gci stops flagging Server.go.

CI runs with the testtxmetacache build tag which sets BucketsCount=8. The v2 partition-routing tests + bench helpers used numPartitions=32 hard-coded, so bucketsPerPartition = 8/32 = 0, then bucket/0 panics. - services/validator/Validator_test.go: Test_sendTxMetaBatchV2_PartitionRouting caps numPartitions to BucketsCount and skips when BucketsCount < 2. - services/subtreevalidation/txmeta_e2e_test.go: publishV2 and serializeV2Partitioned now apply the same cap that validator.sendTxMetaBatchV2 already had. Under testtxmetacache they collapse to a few partitions; under the production build they remain 32.

sonarqubecloud · 2026-05-20T13:43:23Z

Quality Gate failed

Failed conditions
73.7% Coverage on New Code (required ≥ 80%)
4.0% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

PR bsv-blockchain#912 (perf(txmeta): v2 wire format + partition-aware receiver) landed on main while this PR was open and added new uses of appendHeightToValue plus two new SetCacheMultiSequential* methods that call ImprovedCache's new SetMultiSequential / SetMultiSequentialWithHashes. My earlier "dead code removal" was wrong by the time it landed — height encoding is part of the active contract, not legacy. Changes: - Restore appendHeightToValue method, hitOldTx metric (declaration + atomic counter + Prometheus gauge), and noOfBlocksToKeepInTxMetaCache field + init. Functionally the height bytes are still ignored on read in the byte backend, but they're now load-bearing for the v2 receiver in bsv-blockchain#912 and the metric is still wired up. - Re-introduce the height suffix on SetCacheFromBytes and SetCacheMulti. - Add SetMultiSequential and SetMultiSequentialWithHashes to the cacheBackend interface; improvedCacheBackend forwards to the underlying *ImprovedCache; PointerCache shares its implementation with SetMultiFromBytes (the per-bucket goroutine fan-out doesn't exist there, so the partition-aware variant is identical for free). - Fix the empty-value SetCacheMulti subtest assertion: stored entry is again 4 bytes (the height suffix), not 0.

The v2 magic-byte detection was unsafe: v1 messages with entry counts 255, 511, 767, ... begin with byte 0xFF (the LE-uint32 low byte), so the naive `data[0] == 0xFF` check misclassified them. Count 767 in particular aliases the entire 4-byte v2 header `0xFF 0x02 0x00 0x00` exactly. Both legacy/netsync (this PR) and subtreevalidation (PR bsv-blockchain#912) had the bug, silently dropping the netsync announce or garbling the txmeta cache write. Detection now requires the full 4-byte header signature plus a plausibility check on the resulting v2 entry count (entries * minEntrySize must fit in the remaining buffer); any failure falls through to v1 parsing instead of dropping the message. Move the wire-format constants (action bytes, magic, version, header length, min entry size) into a new stores/txmetacache/wire.go so the producer (validator) and all consumers (subtreevalidation, legacy/netsync) reference one source of truth. Drops three local copies and prevents future drift. Regression tests cover the count=255 and count=767 collisions in both the netsync consumer and the subtreevalidation handler, asserting that the v1 dispatch path runs and the v2 path does not. Addresses Copilot review feedback on PR bsv-blockchain#954.

icellan requested review from Copilot, gokutheengineer and ordishs May 20, 2026 12:37

Copilot AI reviewed May 20, 2026

icellan added 2 commits May 20, 2026 15:20

icellan self-assigned this May 20, 2026

gokutheengineer approved these changes May 20, 2026

View reviewed changes

ordishs approved these changes May 20, 2026

View reviewed changes

icellan merged commit f3068fd into main May 20, 2026
30 checks passed

icellan deleted the feat/txmeta-v2-pool-perf branch May 20, 2026 16:50

This was referenced May 26, 2026

perf(txmetacache): pointer-backed cache backend + interface dispatch #915

Merged

perf(subtreevalidation): batch txmeta dispatch per shard, call SetCacheMulti #906

Closed

fix(legacy/netsync): auto-detect v2 txmeta wire format #954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(txmeta): v2 wire format + partition-aware receiver + pooled buffers#912

perf(txmeta): v2 wire format + partition-aware receiver + pooled buffers#912
icellan merged 3 commits into
mainfrom
feat/txmeta-v2-pool-perf

icellan commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

icellan commented May 20, 2026

Summary

What's in this PR

Wire format v2 + partition-aware producer

Receiver rewrite

Cache: new sequential paths + inner-shard collapse

TxMetaCacheBucketType setting

Pooled buffers

Kafka producer manual partitioning

In-memory broker test surface

Tests

Bench numbers

Production rollout

Test plan

Known follow-ups (separate PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison Report

Summary

Uh oh!

sonarqubecloud Bot commented May 20, 2026

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented May 20, 2026 •

edited

Loading

github-actions Bot commented May 20, 2026 •

edited

Loading