Skip to content

perf(txmeta): v2 wire format + partition-aware receiver + pooled buffers#912

Merged
icellan merged 3 commits into
mainfrom
feat/txmeta-v2-pool-perf
May 20, 2026
Merged

perf(txmeta): v2 wire format + partition-aware receiver + pooled buffers#912
icellan merged 3 commits into
mainfrom
feat/txmeta-v2-pool-perf

Conversation

@icellan

@icellan icellan commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

End-to-end optimisation of the txmeta path between the validator (producer) and the subtree-validation receiver. Targeting 10M tx/s/pod for the txmeta consume path.

Bench delivers 11.1M tx/s through the full producer→receiver pipeline at batch=1000 with 32 partitions on M3 Max against a real production-config Native cache, up from 749K-1.14M on the shard-fanout baseline.

What's in this PR

Wire format v2 + partition-aware producer

  • New v2 Kafka payload (magic 0xFF) carrying the producer-computed xxhash per entry so the receiver skips its own xxhash on the hot path.
  • Validator routes each tx to partition = (xxhash(hash) % BucketsCount) / (BucketsCount / NumPartitions). Each Kafka partition now maps to a disjoint contiguous range of receiver cache buckets — receivers writing concurrently across partitions take no cross-partition cache-bucket locks.
  • Receiver auto-detects via the magic byte; v1 messages continue to work.
  • Producer settings: validator_txmeta_wireFormat (v1|v2, default v1) and validator_txmeta_numPartitions (default 32, must divide BucketsCount).

Receiver rewrite

  • Drops the 256-way shard fan-out + worker pool + caught-up latch in txmetaHandler.go.
  • Parses v1 and v2 inline; dispatches all ADDs in one SetCacheMultiSequentialWithHashes call per Kafka message; DELETEs inline.
  • Parallelism comes from the per-partition Kafka consumer goroutines (one per partition per fetch from util/kafka/kafka_consumer.go), not from internal fan-out.

Cache: new sequential paths + inner-shard collapse

  • SetMultiSequential / SetMultiSequentialWithHashes on ImprovedCache — partition-aware twins of SetMulti. Same per-bucket grouping but no errgroup fan-out (the caller already has parallelism via the Kafka per-partition consumer goroutines).
  • bucketNative and bucketTrimmed per-bucket maps were previously constructed with NewNativeSplitLockFreeMapUint64(cap, BucketsCount) → 8192 outer × 8192 inner sub-maps ≈ 67M sub-maps per cache instance. cleanLockedMap rebuilt that on every chunk wrap. Inner shard count is now 1. deleteFromNativeSplitMapShard / deleteFromSwissSplitMapShard updated in lockstep.
  • Eviction behaviour unchanged: the cache's gen+idx-based rule operates on map values regardless of inner shard count.
  • bucketUnallocated and bucketPreallocated use plain Go maps without inner sharding — unaffected. Production today runs bucketUnallocated, so this win lands only when the bucket type is switched (see next).

TxMetaCacheBucketType setting

  • New subtreevalidation_txMetaCacheBucketType setting (unallocated|preallocated|trimmed|native, default unallocated = current prod behaviour).
  • Unknown values fall back to unallocated with a Warn log so a typo never silently changes the deployed allocator.
  • subtreevalidation_txMetaCacheTrimRatio also exposed as a typed setting (currently only affects Preallocated).

Pooled buffers

  • txmetaParseBuffers (receiver) pools keys/values/hashes/deletes slice headers plus single-arena keysBuf and contentBuf — per-entry hash/content allocations collapse to one append into a pre-sized arena.
  • txmetaPartitionsScratch (validator) pools the partitions outer slice + per-partition inner slices for sendTxMetaBatchV2.
  • Byte buffers published to franz-go are NOT pooled: no Publish callback hook yet for safe return.

Kafka producer manual partitioning

  • New Message.Partition field + KafkaProducerConfig.ManualPartitioning + WithManualPartitioning option. When enabled, registers kgo.ManualPartitioner and flushBuffered honours record.Partition. Off by default; existing StickyKeyPartitioner behaviour preserved.
  • daemon/daemon_kafka.go wires it for the txmeta producer when wireFormat=v2.

In-memory broker test surface

  • Adds InMemoryBroker.HasConsumer and DropTopic so e2e tests can wait for consumer-group registration and release retained-messages state at teardown. TruncateTopic also added for long bench runs.

Tests

  • Unit coverage for v2 serialization byte layout, partition routing rule, v2-format receiver parse, mixed ADD/DELETE handling, bucket-type resolver, end-to-end pipeline via the in-memory broker (txmeta_e2e_test.go).

Bench numbers

M3 Max, 32 partitions, 256-byte payload, real production-config Native cache (sync.Once-shared across bench invocations):

Benchmark tx/s allocs/op bytes/op
BenchmarkTxmetaProducerReceiver_V2_Batch100 3.96M 33 2.1 KB
BenchmarkTxmetaProducerReceiver_V2_Batch1000 11.10M 70 4.4 KB

For context, baseline at start of optimisation work was 749K (batch=100) / 1.14M (batch=1000) tx/s with ~2400 allocs/op.

The bench is now scheduler-bound (~60% pthread_cond_wait, ~20% pthread_cond_signal). Cache writes, parse, and serialize allocations no longer appear in the top consumers.

Production rollout

Default behaviour is unchanged. Existing deploys keep bucketUnallocated, wireFormat=v1, no manual partitioning. New settings are opt-in.

To unlock the wins in prod (recommended sequence, one knob per canary):

  1. Receiver supports v2 already (auto-detect). Roll the new build to subtree-validation pods first.
  2. Switch validator emit to v2: set validator_txmeta_wireFormat=v2 and validator_txmeta_numPartitions to match the actual txmeta topic partition count (must divide BucketsCount=8192). Confirm receiver lag stays flat.
  3. Switch cache bucket type: set subtreevalidation_txMetaCacheBucketType=native on one canary subtree-validation pod with production TxMetaCacheMaxMB. Watch RSS, cleanLockedMap allocation rate (heap pprof), subtreevalidation_set_tx_meta_cache_kafka histogram. Roll out once a chunk-rotation cycle is complete with healthy memory.

Settings honoured as today: TxMetaCacheMaxMB, txMetaCacheTrimRatio (still only affects Preallocated), TxMetaCacheEnabled.

Test plan

  • go build ./... clean
  • go vet clean across touched packages
  • Unit tests pass: handler v1/v2 parse, mixed actions, bucket type resolver, validator serializer, partition routing
  • E2E test pass (in-memory broker + real Native cache, both wire formats)
  • Cache unit tests pass under testtxmetacache build tag
  • Canary one subtree-validation pod with subtreevalidation_txMetaCacheBucketType=native at production cache size
  • Canary one validator pod with validator_txmeta_wireFormat=v2
  • Verify subtreevalidation_set_tx_meta_cache_kafka histogram + RSS in the canary

Known follow-ups (separate PR)

  • bucketUnallocated.cleanLockedMap still allocates a fresh map[uint64]uint64 on every chunk wrap (held 17 GB in-use on scaling-2). Worth a sync.Pool+clear(m) pass for the prod default bucket type.
  • Pool the v2 byte buffer published to franz-go via a producer callback hook (current code allocates per partition per batch).
  • Per-fetch goroutine spawn in kafka_consumer.go's EachPartition could be replaced with persistent partition workers fed by channels. Bench is scheduler-bound; production isn't (fetch rate is ~30/sec vs bench ~13k iter/sec), so this is a "bank headroom" change.

End-to-end optimisation of the txmeta path between the validator (producer)
and the subtree-validation receiver. Bench delivers 11.1M tx/s through the
full producer→receiver pipeline at batch=1000 with 32 partitions on M3 Max
against a real production-config Native cache, up from 749K-1.14M on the
shard-fanout baseline.

What's in this change:

1. Wire format v2 (services/validator/Validator.go,
   services/subtreevalidation/txmetaHandler.go). Backwards-compatible
   alongside v1; receiver auto-detects via magic byte 0xFF.

       [1 byte]  magic = 0xFF
       [1 byte]  version = 0x02
       [2 bytes] reserved
       [4 bytes] entry count
       per entry:
         [8 bytes]  xxhash(tx hash)
         [32 bytes] tx hash
         [1 byte]   action (0=ADD, 1=DELETE)
         [4 bytes]  content length
         [N bytes]  content

   Producer groups items by partition using
       bucketIdx           = xxhash(hash) % BucketsCount
       bucketsPerPartition = BucketsCount / NumPartitions
       partition           = bucketIdx / bucketsPerPartition

   so each Kafka partition's records map to a disjoint contiguous range
   of receiver cache buckets. Concurrent per-partition receiver writers
   touch disjoint buckets and run lock-free relative to each other.

   Producer setting validator_txmeta_wireFormat (v1|v2, default v1) and
   validator_txmeta_numPartitions (default 32, must divide BucketsCount).

2. Receiver rewrite (txmetaHandler.go). Drops the 256-way shard
   fan-out + worker pool + caught-up latch. Parses v1 and v2 inline,
   dispatches all ADDs in one SetCacheMultiSequentialWithHashes call
   (or SetCacheMultiSequential on v1) per Kafka message, DELETEs
   inline. Parallelism comes from the per-partition Kafka consumer
   goroutines (one per partition per fetch from util/kafka/
   kafka_consumer.go), not from internal fan-out.

3. Cache: SetMultiSequential family (stores/txmetacache/
   improved_cache.go + txmetacache.go). Partition-aware twin of
   SetMulti — same per-bucket grouping but no errgroup fan-out. The
   With-Hashes variant takes caller-supplied xxhash values so the
   receiver passes the on-wire v2 hash straight through.

4. Cache: inner-shard collapse (improved_cache.go). bucketNative
   and bucketTrimmed maps were constructed with
   NewNativeSplitLockFreeMapUint64(cap, BucketsCount), meaning every
   bucket map had 8192 inner sub-shards. 8192 outer * 8192 inner ~
   67M sub-maps per cache instance. cleanLockedMap rebuilt the
   inner-sharded map on every chunk wrap; that pattern dominated CPU
   (24% in NewNativeSplitLockFreeMapUint64) and heap (34.5 GB / 64%
   of bench allocations; 80% of in-use memory on the scaling-2 prod
   pod).

   Outer ImprovedCache already shards across BucketsCount outer
   buckets via h%BucketsCount, and b.mu serialises writes to one
   bucket. Inner shard count is now 1. deleteFromNativeSplitMapShard
   and deleteFromSwissSplitMapShard updated in lockstep — they
   previously routed via h%BucketsCount, which no longer matches the
   map's internal hash%nrOfBuckets dispatch with nrOfBuckets=1.

   Eviction behaviour unchanged: the cache's gen+idx-based rule
   operates on map values regardless of inner shard count; IterAll
   visits every entry the same way.

5. TxMetaCacheBucketType setting (settings/
   subtreevalidation_settings.go, services/subtreevalidation/
   Server.go). Bucket type was hard-coded to Unallocated; now
   selectable via subtreevalidation_txMetaCacheBucketType
   (unallocated|preallocated|trimmed|native, default unallocated).
   Unknown values fall back to Unallocated with a Warn log so a typo
   never silently changes the allocator. Production default is
   unchanged; canary one pod on "native" with production
   TxMetaCacheMaxMB before rolling out.

   Also exposes subtreevalidation_txMetaCacheTrimRatio as a typed
   setting (currently only affects Preallocated; gocore key was
   already read inside the cache constructor).

6. Pooled producer + receiver buffers (validator.go,
   txmetaHandler.go). txmetaParseBuffers (receiver) pools
   keys/values/hashes/deletes slice headers plus single-arena
   keysBuf and contentBuf so per-entry hash/content allocations
   collapse to one append into a pre-sized buffer.
   txmetaPartitionsScratch (validator) pools the partitions outer
   slice + per-partition inner slices for sendTxMetaBatchV2. The
   byte buffer published to franz-go is NOT pooled: no Publish
   callback hook yet for safe return.

7. Kafka producer manual partitioning (util/kafka/
   kafka_producer_async.go). New Message.Partition field +
   KafkaProducerConfig.ManualPartitioning. When true, registers
   kgo.ManualPartitioner and flushBuffered honours
   record.Partition. Off by default — existing
   StickyKeyPartitioner behaviour preserved. New
   WithManualPartitioning ProducerOption; daemon/daemon_kafka.go
   passes it for the txmeta producer when wireFormat=v2.

8. In-memory broker test surface (util/kafka/in_memory_kafka/
   in_memory_kafka.go). Adds InMemoryBroker.HasConsumer and
   DropTopic so e2e tests can wait for consumer-group registration
   (the broker drops messages produced to a topic with no
   consumers) and release retained-messages state at teardown.
   TruncateTopic also added for long bench runs that would
   otherwise pin gigabytes in the broker's per-topic message slice.

9. Tests (Validator_test.go, txmetaHandler_test.go,
   Server_coverage_test.go, txmeta_e2e_test.go). Unit coverage for
   v2 serialization (byte layout), partition routing rule,
   v2-format receiver parse, mixed ADD/DELETE handling, bucket
   type resolver, end-to-end pipeline via the in-memory broker.

Bench numbers (M3 Max, 32 partitions, 256-byte payload, real
production-config Native cache, sync.Once-shared across bench
invocations):

  BenchmarkTxmetaProducerReceiver_V2_Batch100   3.96M tx/s   33 allocs/op
  BenchmarkTxmetaProducerReceiver_V2_Batch1000 11.10M tx/s   70 allocs/op

Bench is now scheduler-bound (~60% pthread_cond_wait,
~20% pthread_cond_signal). Cache writes, parse, and serialize
allocations no longer appear in the top consumers.

Operational notes for production rollout:

- Default behaviour unchanged. Existing deploys keep
  bucketUnallocated, wireFormat=v1, no manual partitioning. New
  setting opt-in.
- To benefit from #1+#2 in prod: set validator_txmeta_wireFormat=v2
  and validator_txmeta_numPartitions to match the actual txmeta
  topic partition count.
- To benefit from #4 in prod: set
  subtreevalidation_txMetaCacheBucketType=native. Canary first;
  the deployed pod's bucketUnallocated path is unaffected.
- TxMetaCacheMaxMB=73728, txMetaCacheTrimRatio=5 settings are
  honoured as today.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions

github-actions Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

This is a well-engineered performance optimization PR targeting 10M+ tx/s throughput for the txmeta cache pipeline. The changes are substantial but carefully structured with strong test coverage and backward compatibility.

Architecture Changes:

  • Introduces v2 wire format with producer-computed xxhash values to eliminate receiver-side rehashing
  • Removes 256-way shard fan-out worker pool in favor of Kafka's native per-partition parallelism
  • Collapses cache inner-shard count from 8192→1 per bucket (eliminates ~67M sub-maps)
  • Adds partition-aware routing so each Kafka partition maps to disjoint cache bucket ranges

Code Quality:

  • Comprehensive test coverage including unit tests, partition routing tests, and end-to-end tests
  • Proper handling of the testtxmetacache build tag edge cases (BucketsCount=8 vs 8192)
  • Backward compatible: v1 format still supported, new features are opt-in via settings
  • Extensive documentation in commit messages and code comments
  • Proper use of sync.Pool for buffer reuse

Safety & Rollout:

  • Default behavior unchanged (v1 format, unallocated bucket type)
  • Settings validation with fallback for unknown bucket types
  • Canary rollout plan documented in PR description
  • No breaking changes to existing deployments

Performance Results:

  • Benchmark shows 11.1M tx/s at batch=1000 (up from 749K-1.14M baseline)
  • Allocations reduced from ~2400 to 70 allocs/op
  • Memory efficiency improved significantly

No critical issues found. The implementation follows project conventions in CLAUDE.md and includes proper error handling, input validation, and defensive programming practices.

@github-actions

github-actions Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-912 (3a0be2c)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.835µ 1.831µ ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 59.44n 59.33n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 59.33n 59.14n ~ 0.200
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 59.40n 59.18n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 40.40n 37.02n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 69.83n 64.76n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 190.5n 177.0n ~ 0.100
MiningCandidate_Stringify_Short-4 261.5n 262.2n ~ 1.000
MiningCandidate_Stringify_Long-4 1.852µ 1.791µ ~ 0.100
MiningSolution_Stringify-4 936.3n 921.0n ~ 0.400
BlockInfo_MarshalJSON-4 1.820µ 1.776µ ~ 0.200
NewFromBytes-4 141.9n 126.4n ~ 0.100
AddTxBatchColumnar_Validation-4 2.472µ 2.481µ ~ 0.100
OffsetValidationLoop-4 635.1n 637.7n ~ 1.000
Mine_EasyDifficulty-4 60.49µ 60.73µ ~ 0.400
Mine_WithAddress-4 7.423µ 6.842µ ~ 0.400
BlockAssembler_AddTx-4 0.02640n 0.02611n ~ 1.000
AddNode-4 10.59 10.59 ~ 1.000
AddNodeWithMap-4 11.80 10.39 ~ 0.200
DirectSubtreeAdd/4_per_subtree-4 57.96n 58.92n ~ 0.400
DirectSubtreeAdd/64_per_subtree-4 29.29n 29.35n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 28.36n 28.08n ~ 0.300
DirectSubtreeAdd/1024_per_subtree-4 26.62n 26.80n ~ 0.200
DirectSubtreeAdd/2048_per_subtree-4 26.22n 26.32n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 295.1n 296.8n ~ 0.700
SubtreeProcessorAdd/64_per_subtree-4 291.4n 289.4n ~ 1.000
SubtreeProcessorAdd/256_per_subtree-4 293.7n 289.9n ~ 0.200
SubtreeProcessorAdd/1024_per_subtree-4 288.3n 277.7n ~ 0.200
SubtreeProcessorAdd/2048_per_subtree-4 284.0n 277.9n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 283.3n 284.5n ~ 1.000
SubtreeProcessorRotate/64_per_subtree-4 277.8n 281.6n ~ 0.700
SubtreeProcessorRotate/256_per_subtree-4 286.7n 281.7n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 287.1n 280.5n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 55.37n 55.60n ~ 0.400
SubtreeNodeAddOnly/64_per_subtree-4 36.26n 36.27n ~ 1.000
SubtreeNodeAddOnly/256_per_subtree-4 35.35n 35.20n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 34.85n 34.65n ~ 0.400
SubtreeCreationOnly/4_per_subtree-4 112.4n 113.5n ~ 0.700
SubtreeCreationOnly/64_per_subtree-4 360.2n 362.8n ~ 0.700
SubtreeCreationOnly/256_per_subtree-4 1.277µ 1.272µ ~ 0.500
SubtreeCreationOnly/1024_per_subtree-4 3.947µ 3.875µ ~ 0.200
SubtreeCreationOnly/2048_per_subtree-4 7.208µ 7.162µ ~ 0.400
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 288.2n 283.6n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 288.5n 282.0n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.023m 2.031m ~ 0.700
ParallelGetAndSetIfNotExists/10k_nodes-4 5.271m 5.260m ~ 1.000
ParallelGetAndSetIfNotExists/50k_nodes-4 7.662m 7.233m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 10.21m 10.01m ~ 0.200
SequentialGetAndSetIfNotExists/1k_nodes-4 1.781m 1.799m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 4.560m 4.477m ~ 0.700
SequentialGetAndSetIfNotExists/50k_nodes-4 13.89m 13.76m ~ 0.700
SequentialGetAndSetIfNotExists/100k_nodes-4 25.73m 25.70m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.063m 2.063m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.489m 8.454m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.76m 13.43m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.807m 1.799m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 9.229m 8.297m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 47.77m 44.67m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.671µ 3.620µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.263µ 3.420µ ~ 0.100
DiskTxMap_ExistenceOnly-4 311.8n 307.0n ~ 0.700
Queue-4 183.5n 184.4n ~ 0.200
AtomicPointer-4 3.655n 3.246n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 800.1µ 813.2µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 765.3µ 766.4µ ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/10K-4 114.2µ 102.1µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 64.51µ 64.31µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/10K-4 51.19µ 51.27µ ~ 1.000
ReorgOptimizations/HashSlicePool/New/10K-4 11.17µ 11.10µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/10K-4 4.320µ 4.379µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.452µ 1.506µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.162m 9.194m ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4 9.537m 9.684m ~ 0.400
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.061m 1.067m ~ 0.700
ReorgOptimizations/AllMarkFalse/New/100K-4 708.7µ 705.9µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/100K-4 516.8µ 489.6µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 201.9µ 205.2µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/100K-4 45.60µ 46.87µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 16.32µ 16.32µ ~ 1.000
TxMapSetIfNotExists-4 49.43n 49.04n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 41.57n 41.47n ~ 0.600
ChannelSendReceive-4 606.4n 596.1n ~ 0.700
CalcBlockWork-4 502.2n 500.8n ~ 1.000
CalculateWork-4 739.9n 730.8n ~ 1.000
BuildBlockLocatorString_Helpers/Size_10-4 1.312µ 1.425µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 12.51µ 12.33µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 157.0µ 124.9µ ~ 0.100
CatchupWithHeaderCache-4 103.9m 103.9m ~ 0.400
_prepareTxsPerLevel-4 413.5m 416.3m ~ 1.000
_prepareTxsPerLevelOrdered-4 3.723m 4.144m ~ 0.200
_prepareTxsPerLevel_Comparison/Original-4 421.5m 425.4m ~ 1.000
_prepareTxsPerLevel_Comparison/Optimized-4 3.807m 4.258m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.380m 1.352m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 321.4µ 325.1µ ~ 0.200
SubtreeSizes/10k_tx_64_per_subtree-4 76.82µ 78.46µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 19.41µ 19.37µ ~ 1.000
SubtreeSizes/10k_tx_512_per_subtree-4 9.547µ 9.659µ ~ 0.200
SubtreeSizes/10k_tx_1024_per_subtree-4 4.746µ 4.806µ ~ 0.400
SubtreeSizes/10k_tx_2k_per_subtree-4 2.378µ 2.364µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 75.38µ 76.49µ ~ 0.200
BlockSizeScaling/10k_tx_256_per_subtree-4 19.09µ 19.08µ ~ 1.000
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.764µ 4.776µ ~ 1.000
BlockSizeScaling/50k_tx_64_per_subtree-4 399.5µ 401.4µ ~ 1.000
BlockSizeScaling/50k_tx_256_per_subtree-4 95.44µ 96.30µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.53µ 23.84µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 159.1µ 162.6µ ~ 0.100
SubtreeAllocations/small_subtrees_data_fetch-4 166.2µ 167.9µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 324.7µ 330.9µ ~ 0.100
SubtreeAllocations/medium_subtrees_exists_check-4 9.304µ 9.564µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 9.527µ 9.942µ ~ 0.100
SubtreeAllocations/medium_subtrees_full_validation-4 19.06µ 19.32µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.229µ 2.258µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.325µ 2.407µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.769µ 4.864µ ~ 0.100
_BufferPoolAllocation/16KB-4 4.648µ 3.915µ ~ 0.700
_BufferPoolAllocation/32KB-4 8.321µ 7.566µ ~ 0.400
_BufferPoolAllocation/64KB-4 16.72µ 15.94µ ~ 0.700
_BufferPoolAllocation/128KB-4 34.18µ 24.40µ ~ 0.100
_BufferPoolAllocation/512KB-4 109.5µ 106.1µ ~ 1.000
_BufferPoolConcurrent/32KB-4 17.85µ 18.35µ ~ 0.700
_BufferPoolConcurrent/64KB-4 29.32µ 28.52µ ~ 0.400
_BufferPoolConcurrent/512KB-4 139.8µ 146.3µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 609.5µ 639.0µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 610.1µ 637.9µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 598.0µ 614.6µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 592.9µ 599.2µ ~ 0.200
_SubtreeDeserializationWithBufferSizes/512KB-4 589.1µ 597.4µ ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.23m 36.22m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 35.65m 35.94m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/64KB-4 35.83m 35.58m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/128KB-4 35.92m 35.36m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.82m 35.28m ~ 0.100
_PooledVsNonPooled/Pooled-4 831.8n 829.8n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.648µ 7.220µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.784µ 6.902µ ~ 0.700
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.293µ 9.306µ ~ 0.700
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.101µ 9.122µ ~ 0.700
StoreBlock_Sequential/BelowCSVHeight-4 342.4µ 333.5µ ~ 0.100
StoreBlock_Sequential/AboveCSVHeight-4 336.3µ 333.1µ ~ 0.100
GetUtxoHashes-4 275.5n 276.9n ~ 1.000
GetUtxoHashes_ManyOutputs-4 45.02µ 45.08µ ~ 0.700
_NewMetaDataFromBytes-4 216.2n 213.0n ~ 0.200
_Bytes-4 394.8n 392.0n ~ 0.700
_MetaBytes-4 339.5n 335.6n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-20 13:45 UTC

icellan added a commit that referenced this pull request May 20, 2026
…ers (#912)

Cherry-pick of PR #912 onto this branch. This carries the txmeta hot-path
work that was developed on top of #828 and shipped as a separate PR against
main; landing it here keeps #828's branch in sync so the merged form is
internally consistent and a no-op vs main once #912 is merged upstream.

End-to-end optimisation of the txmeta path between the validator (producer)
and the subtree-validation receiver. Bench delivers 11.1M tx/s through the
full producer→receiver pipeline at batch=1000 with 32 partitions on M3 Max
against a real production-config Native cache, up from 749K-1.14M on the
shard-fanout baseline.

What's in this change:

1. Wire format v2 (services/validator/Validator.go,
   services/subtreevalidation/txmetaHandler.go). Backwards-compatible
   alongside v1; receiver auto-detects via magic byte 0xFF.

       [1 byte]  magic = 0xFF
       [1 byte]  version = 0x02
       [2 bytes] reserved
       [4 bytes] entry count
       per entry:
         [8 bytes]  xxhash(tx hash)
         [32 bytes] tx hash
         [1 byte]   action (0=ADD, 1=DELETE)
         [4 bytes]  content length
         [N bytes]  content

   Producer groups items by partition using
       bucketIdx           = xxhash(hash) % BucketsCount
       bucketsPerPartition = BucketsCount / NumPartitions
       partition           = bucketIdx / bucketsPerPartition

   so each Kafka partition's records map to a disjoint contiguous range
   of receiver cache buckets. Concurrent per-partition receiver writers
   touch disjoint buckets and run lock-free relative to each other.

   Producer setting validator_txmeta_wireFormat (v1|v2, default v1) and
   validator_txmeta_numPartitions (default 32, must divide BucketsCount).

2. Receiver rewrite (txmetaHandler.go). Drops the 256-way shard
   fan-out + worker pool + caught-up latch. Parses v1 and v2 inline,
   dispatches all ADDs in one SetCacheMultiSequentialWithHashes call
   (or SetCacheMultiSequential on v1) per Kafka message, DELETEs
   inline. Parallelism comes from the per-partition Kafka consumer
   goroutines, not from internal fan-out.

3. Cache: SetMultiSequential family (stores/txmetacache/
   improved_cache.go + txmetacache.go). Partition-aware twin of
   SetMulti — same per-bucket grouping but no errgroup fan-out. The
   With-Hashes variant takes caller-supplied xxhash values so the
   receiver passes the on-wire v2 hash straight through.

4. Cache: inner-shard collapse (improved_cache.go). bucketNative
   and bucketTrimmed maps were constructed with
   NewNativeSplitLockFreeMapUint64(cap, BucketsCount), meaning every
   bucket map had 8192 inner sub-shards. cleanLockedMap rebuilt the
   inner-sharded map on every chunk wrap (24% CPU, 64% of bench
   allocations, 80% of in-use memory on the scaling-2 prod pod).
   Outer ImprovedCache already shards via h%BucketsCount and b.mu
   serialises writes — inner shard count is now 1.
   deleteFromNativeSplitMapShard / deleteFromSwissSplitMapShard
   updated in lockstep to route via shard 0.

   Eviction unchanged: cache's gen+idx rule operates on map values
   regardless of inner shard count. bucketUnallocated and
   bucketPreallocated use plain Go maps; unaffected.

5. TxMetaCacheBucketType setting (settings/
   subtreevalidation_settings.go, services/subtreevalidation/
   Server.go). Bucket type was hard-coded to Unallocated; now
   selectable via subtreevalidation_txMetaCacheBucketType
   (unallocated|preallocated|trimmed|native, default unallocated).
   Unknown values fall back to Unallocated with a Warn log.

6. Pooled producer + receiver buffers. txmetaParseBuffers (receiver)
   pools keys/values/hashes/deletes slice headers plus single-arena
   keysBuf and contentBuf. txmetaPartitionsScratch (validator) pools
   the partitions outer slice + per-partition inner slices.

7. Kafka producer manual partitioning (util/kafka/
   kafka_producer_async.go). Message.Partition field +
   KafkaProducerConfig.ManualPartitioning + WithManualPartitioning
   option. daemon/daemon_kafka.go passes it for the txmeta producer
   when wireFormat=v2.

8. In-memory broker test surface. InMemoryBroker.HasConsumer,
   DropTopic, TruncateTopic for e2e tests + long bench runs.

9. Tests. v2 serialization byte layout, partition routing,
   v2-format receiver parse, mixed ADD/DELETE, bucket-type
   resolver, end-to-end pipeline via in-memory broker
   (txmeta_e2e_test.go).

Bench numbers (M3 Max, 32 partitions, 256-byte payload, real
production-config Native cache):

  BenchmarkTxmetaProducerReceiver_V2_Batch100   3.96M tx/s   33 allocs/op
  BenchmarkTxmetaProducerReceiver_V2_Batch1000 11.10M tx/s   70 allocs/op

Bench is scheduler-bound after this change (~60% pthread_cond_wait,
~20% pthread_cond_signal). Cache writes, parse, and serialize
allocations no longer appear in the top consumers.
icellan added 2 commits May 20, 2026 15:20
The CI test job builds with the testtxmetacache tag where BucketsCount=8.
Test_sendTxMetaBatchV2_PartitionRouting hard-coded numPartitions=32 and
TestTxmetaE2E_V2_RoundTrip drove publishV2/serializeV2Partitioned with
numPartitions=32, both of which then computed BucketsCount/numPartitions
without the clamp the production sendTxMetaBatchV2 already has, panicking
with integer divide by zero.

- publishV2 and serializeV2Partitioned: mirror the production clamp
  (numPartitions>=1, bucketsPerPartition>=1, p capped at numPartitions-1)
  so the test math matches the routing code under both BucketsCount=8192
  and BucketsCount=8 builds.
- Test_sendTxMetaBatchV2_PartitionRouting: cap numPartitions to
  BucketsCount. With more partitions than buckets the production code
  clamps bucketsPerPartition to 1 and the test's "find two hashes in
  different partitions" loop would not terminate meaningfully.

Also drop a stray blank line at the end of the Server struct so gci stops
flagging Server.go.
CI runs with the testtxmetacache build tag which sets BucketsCount=8.
The v2 partition-routing tests + bench helpers used numPartitions=32
hard-coded, so bucketsPerPartition = 8/32 = 0, then bucket/0 panics.

- services/validator/Validator_test.go: Test_sendTxMetaBatchV2_PartitionRouting
  caps numPartitions to BucketsCount and skips when BucketsCount < 2.
- services/subtreevalidation/txmeta_e2e_test.go: publishV2 and
  serializeV2Partitioned now apply the same cap that
  validator.sendTxMetaBatchV2 already had. Under testtxmetacache they
  collapse to a few partitions; under the production build they remain 32.
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
73.7% Coverage on New Code (required ≥ 80%)
4.0% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@icellan icellan self-assigned this May 20, 2026
@icellan icellan merged commit f3068fd into main May 20, 2026
30 checks passed
@icellan icellan deleted the feat/txmeta-v2-pool-perf branch May 20, 2026 16:50
icellan added a commit to icellan/teranode that referenced this pull request May 21, 2026
PR bsv-blockchain#912 (perf(txmeta): v2 wire format + partition-aware receiver) landed on
main while this PR was open and added new uses of appendHeightToValue plus
two new SetCacheMultiSequential* methods that call ImprovedCache's new
SetMultiSequential / SetMultiSequentialWithHashes. My earlier "dead code
removal" was wrong by the time it landed — height encoding is part of the
active contract, not legacy.

Changes:
- Restore appendHeightToValue method, hitOldTx metric (declaration + atomic
  counter + Prometheus gauge), and noOfBlocksToKeepInTxMetaCache field +
  init. Functionally the height bytes are still ignored on read in the byte
  backend, but they're now load-bearing for the v2 receiver in bsv-blockchain#912 and the
  metric is still wired up.
- Re-introduce the height suffix on SetCacheFromBytes and SetCacheMulti.
- Add SetMultiSequential and SetMultiSequentialWithHashes to the
  cacheBackend interface; improvedCacheBackend forwards to the underlying
  *ImprovedCache; PointerCache shares its implementation with
  SetMultiFromBytes (the per-bucket goroutine fan-out doesn't exist there,
  so the partition-aware variant is identical for free).
- Fix the empty-value SetCacheMulti subtest assertion: stored entry is
  again 4 bytes (the height suffix), not 0.
icellan added a commit to icellan/teranode that referenced this pull request May 26, 2026
PR bsv-blockchain#912 (perf(txmeta): v2 wire format + partition-aware receiver) landed on
main while this PR was open and added new uses of appendHeightToValue plus
two new SetCacheMultiSequential* methods that call ImprovedCache's new
SetMultiSequential / SetMultiSequentialWithHashes. My earlier "dead code
removal" was wrong by the time it landed — height encoding is part of the
active contract, not legacy.

Changes:
- Restore appendHeightToValue method, hitOldTx metric (declaration + atomic
  counter + Prometheus gauge), and noOfBlocksToKeepInTxMetaCache field +
  init. Functionally the height bytes are still ignored on read in the byte
  backend, but they're now load-bearing for the v2 receiver in bsv-blockchain#912 and the
  metric is still wired up.
- Re-introduce the height suffix on SetCacheFromBytes and SetCacheMulti.
- Add SetMultiSequential and SetMultiSequentialWithHashes to the
  cacheBackend interface; improvedCacheBackend forwards to the underlying
  *ImprovedCache; PointerCache shares its implementation with
  SetMultiFromBytes (the per-bucket goroutine fan-out doesn't exist there,
  so the partition-aware variant is identical for free).
- Fix the empty-value SetCacheMulti subtest assertion: stored entry is
  again 4 bytes (the height suffix), not 0.
icellan added a commit to icellan/teranode that referenced this pull request May 27, 2026
The v2 magic-byte detection was unsafe: v1 messages with entry counts
255, 511, 767, ... begin with byte 0xFF (the LE-uint32 low byte), so the
naive `data[0] == 0xFF` check misclassified them. Count 767 in
particular aliases the entire 4-byte v2 header `0xFF 0x02 0x00 0x00`
exactly. Both legacy/netsync (this PR) and subtreevalidation (PR bsv-blockchain#912)
had the bug, silently dropping the netsync announce or garbling the
txmeta cache write.

Detection now requires the full 4-byte header signature plus a
plausibility check on the resulting v2 entry count
(entries * minEntrySize must fit in the remaining buffer); any failure
falls through to v1 parsing instead of dropping the message.

Move the wire-format constants (action bytes, magic, version, header
length, min entry size) into a new stores/txmetacache/wire.go so the
producer (validator) and all consumers (subtreevalidation,
legacy/netsync) reference one source of truth. Drops three local copies
and prevents future drift.

Regression tests cover the count=255 and count=767 collisions in both
the netsync consumer and the subtreevalidation handler, asserting that
the v1 dispatch path runs and the v2 path does not.

Addresses Copilot review feedback on PR bsv-blockchain#954.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants