fix(kafka): unwind dual flush_frequency linger; cut linger to 10ms on high-fanout topics by freemans13 · Pull Request #840 · bsv-blockchain/teranode

freemans13 · 2026-05-11T11:21:39Z

Summary

Two complementary fixes for the txmeta producer-side latency regression that landed when the kafka client was switched from Sarama to franz-go (#611). Together they take p99 publish→consume latency from ~7 s to ~22 ms on the regression test. Already deployed to dev-scale-1/2 and confirmed working at production load — see "Post-deploy validation" below.

Commit 1: `cut flush_frequency to 10ms on high-fanout topics`

The Sarama → franz-go switch silently re-wired the URL query parameter flush_frequency from "max time between flushes" to franz-go's per-partition kgo.ProducerLinger. On the high-fanout txmeta topic (256 partitions, ~5 batched msgs/s/partition at peak), each partition rarely fills 1 MiB before the 1 s linger, so every record paid up to ~1 s of producer-side delay. The subtree-validator's local cache lagged the validator by 1–2 s and every subtree triggered ProcessTxMetaUsingCache → ThresholdExceededError → 1 s RetrySleep → retry.

Drops flush_frequency from 1s to 10ms on the three high-fanout topics: kafka_txmetaConfig, kafka_validatortxsConfig.operator, kafka_legacyInvConfig. Low-volume topics (invalidBlocks, rejectedTx, unitTest) keep their existing 1s.

Commit 2: `decouple outer batcher linger from flush_frequency`

A subtler footgun in the same area: KafkaProducerConfig.FlushFrequency was driving two lingers at once — franz-go's per-partition ProducerLinger (the user-facing knob), and the outer async-batcher's straggler-flush timer (an internal implementation detail). Setting flush_frequency=1s stacked two 1-second lingers on the same publish path.

Introduces a new URL query param outer_batcher_linger (field: OuterBatcherLinger, default 10ms) controlling only the outer batcher. flush_frequency now controls only kgo.ProducerLinger, which is what an operator looking at the URL expects.

Pre-deploy evidence

Production (Prometheus, dev-scale-1/2, Friday May 8 2026, 18:00–21:00 UTC at 1.28 M TPS peak):

Only txmeta-dev-scale-1-scale-1 has producer-buffer backlog (mean ≈ 72 k msgs across 20 propagation pods, peak 186 k). Every other topic stays at 0.
teranode_kafka_producer_produce_request_latency_seconds p99 ≈ 642 ms, p50 ≈ 87 ms.
validate_subtree_retry rate ≈ 1–2 / s, matching the ~1.2 subtree/s rate — basically every subtree retries.
validate_subtree_duration p99 mean = 16 s, max 127 s.
bless_missing_transaction_count rate = 0 — retries always eventually succeed; the cache does fill, it just lags.

TestLingerLatencyRegression (OrbStack-backed Redpanda, 32-partition topic, 200 records 25 ms apart):

Code state	`flush_frequency=1s` p50	`flush_frequency=1s` p99
Before either fix (stacked outer + franz-go linger)	4.49 s	6.95 s
After commit 2 only (single franz-go linger)	513 ms	1.01 s
After commit 1 (`flush_frequency=10ms` in settings.conf)	21 ms p50	22 ms p99

Post-deploy validation

The matching configmap patch (flush_frequency=1s → flush_frequency=5ms on txmeta and legacyInv) was applied to dev-scale-1/2 at 2026-05-11 11:27 UTC. After ~22 min of sustained ~1.30 M TPS:

Metric	Pre-fix (Fri peak)	Post-fix (Mon under load)	Change
Consumer rate variance	200 k – 2.2 M/s (260% range, visible "gaps")	1.28 – 1.32 M/s (3% range, smooth)	gaps gone
Producer buffered (txmeta)	mean 72 k, peak 186 k	max 2	~100 000× lower
Producer e2e latency p99	mean 408 ms, max 1.6 s	63 ms (flat)	~26× lower
Broker write latency p99	mean 642 ms, max 1.8 s	63 ms (flat)	~28× lower
Subtree-validator goroutines	mean 28–36 k, peaks 185 k–693 k	stable 3.8–4.0 k	~175× lower
/metrics scrape duration	mean ~140 ms, max 10 s (timing out)	5–11 ms	endpoint healthy
`validate_subtree_retry` rate	mean 0.9/s, peak 2.3/s (≈ 2 attempts/subtree)	mean 0.94/s = floor of 1/subtree	retries gone
`validate_subtree_duration` p99	mean 16 s, max 127 s	1.9 s and trending	~8× faster

The "Tx Meta read from Kafka /second" Grafana panel is now flat at ~1.3 M/s on both pods — no near-zero dips, no scrape-induced "gaps". That panel's behaviour was the originating symptom.

One thing flagged for monitoring, not a regression: bless_missing_transaction_count is now firing at very low rates (mean 0.23/s on scale-1, 0.85/s on scale-2 with one 20.76/s burst) where it was zero before. Pre-fix that path never fired because the ThresholdExceededError → 1 s retry short-circuited every cache miss. Post-fix, the retry doesn't trigger, so genuine cache misses fall through to the legitimate "fetch from UTXO store" path. The miss rate is microscopic (≈0.00007% of txs), so this is fine — but if it grows it's the right alarm signal to surface, because it'll mean the cache is undersized rather than being masked by the retry loop.

Follow-ups (intentionally out of scope here)

dev-scale-1/2 configmap update. ✅ Already applied (with flush_frequency=5ms rather than the 10ms in this PR's defaults — both work). scale-1-shared-config.kafka_txmetaConfig in teranode-argocd-deployments was patched; the matching PR there should be linked.
256 partitions for one Redpanda broker is over-provisioned for the actual record rate; consider dropping to 32–64. Not required for this fix, but a contributing factor to the slow broker write p99 (now moot under low-linger config).

Test plan

go vet ./util/kafka/ clean.
go build ./util/kafka/ clean.
go test -short -count=1 ./util/kafka/ passes (all unit tests, including new TestNewKafkaAsyncProducerFromURLOuterBatcherLinger cases).
go test -v -run TestLingerLatencyRegression -timeout 5m ./util/kafka/ passes locally with the numbers above.
Deployed to dev-scale-1/2; metrics confirm the regression is resolved (see "Post-deploy validation").
Reviewer to confirm no settings_local.conf override for terabuild / mainnet / testnet / teratestnet relies on the old semantic.

🤖 Generated with Claude Code

…k order (bsv-blockchain#717) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n#838)

The franz-go switch rewired the URL query param `flush_frequency` from Sarama's "max time between flushes" to franz-go's `kgo.ProducerLinger`, which is a PER-PARTITION linger. On the dev-scale-1/2 txmeta topic (256 partitions, ~5 batched msgs/s/partition at 1.2M TPS peak) each partition rarely fills 1MiB before the 1s linger, so every record paid ~1s of producer-side delay. The subtree-validator's local cache lagged the validator by 1-2s and every subtree hit ProcessTxMetaUsingCache's ThresholdExceededError -> 1s RetrySleep -> retry — visible as the "validate_subtree_retry" rate matching the subtree rate and as gaps in the "Tx Meta read from Kafka /second" Grafana panel. - settings.conf: txmeta, validatortxs.operator, legacyInv get flush_frequency=10ms (was 1s). Low-volume topics (invalidBlocks, rejectedTx, unitTest) keep 1s; their per-partition rate is low enough that latency doesn't matter. - util/kafka/kafka_producer_async.go: documentation block at the franz-go option site explaining the Sarama->franz-go semantic shift for each `flush_*` URL param, so the next operator doesn't re-introduce this. - util/kafka/linger_latency_regression_test.go: regression test spinning up Redpanda via testcontainers, demonstrating that flush_frequency=1s produces p50 latency 200x larger than flush_frequency=10ms on a 32-partition topic with sparse feed. NOTE: dev-scale-1/2 configmaps override flush_frequency=1s explicitly in scale-1-shared-config.kafka_txmetaConfig — those need a matching update in the teranode-argocd-deployments repo for the fix to land in those clusters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-11T11:22:51Z

🤖 Claude Code Review

Status: Complete

Summary

This PR addresses a critical Kafka producer latency regression introduced during the Sarama → franz-go migration. The changes successfully decouple two stacked lingers that caused 1-7s publish→consume delays at production scale.

Key Changes Validated

Core Kafka Fix (util/kafka/):

✅ Introduces OuterBatcherLinger field, decoupled from FlushFrequency
✅ Correctly defaults to 10ms when not specified
✅ Comprehensive unit test coverage verifying decoupling
✅ New regression test (TestLingerLatencyRegression) with clear hypothesis

Configuration Updates (settings.conf):

✅ Reduces flush_frequency from 1s → 10ms on high-fanout topics (txmeta, validatortxs, legacyInv)
✅ Low-volume topics correctly retain 1s linger
✅ Changes match production deployment that resolved the issue

Documentation (docs/topics/services/legacy.md):

✅ Accurately describes checkpoint-based validation optimization
✅ Clear distinction between checkpointed vs. non-checkpointed block handling

Concerns

[Minor] Documentation Scope:
The openapi/CHANGES.md file describes itself as "This PR" but references a different branch name (gokhan/swagger-rpc) than the actual PR branch. The OpenRPC spec addition appears unrelated to the Kafka linger fix — it should either be in a separate PR or the description should clarify why it is bundled here.

Production Validation

The PR description includes strong production evidence from dev-scale-1/2:

Producer latency p99: 642ms → 63ms (~10× improvement)
Subtree validation retries: eliminated
Buffered message backlog: 72k → 2 (~36,000× reduction)

The fix has already been deployed and validated at 1.3M TPS for 22+ minutes with no regressions.

Recommendation

Approve — The core Kafka changes are correct, well-tested, and production-proven. The documentation accuracy is good. Consider splitting the OpenRPC changes into a separate PR for clearer change tracking.

Splits the single FlushFrequency knob that previously drove both franz-go's per-partition ProducerLinger AND the outer async-batcher's straggler-flush timer. A new URL query param `outer_batcher_linger` (field: OuterBatcherLinger, default 10ms) controls only the outer batcher; `flush_frequency` now controls only kgo.ProducerLinger, which is what an operator looking at the URL expects. Without this fix, setting flush_frequency=1s — which on the dev-scale clusters was the intent of "match Sarama's 1s Flush.Frequency" — stacked two lingers on the same publish path. The regression test (sparse feed, 32 partitions) goes from p50=4.49s/p99=6.95s with the stacked behaviour to p50=513ms/p99=1.01s with the franz-go linger alone (and to ~22ms p99 once flush_frequency is also lowered). The settings.conf change in the first commit on this branch handles the second of those steps; this change handles the first. Adds unit-test coverage that: - the new URL param parses and applies (250ms test value), - flush_frequency=1s no longer influences OuterBatcherLinger. Updates the integration test commentary to reflect that the outer batcher's linger no longer stacks with franz-go's. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-11T11:34:12Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-11T11:37:34Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-840 (531d239)

Summary

Regressions: 0
Improvements: 0
Unchanged: 142
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.610µ	1.893µ	~	0.400
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	71.23n	71.23n	~	1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	71.22n	71.27n	~	1.000
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	71.28n	71.28n	~	0.800
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	38.67n	38.58n	~	1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	57.74n	58.75n	~	0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	152.2n	185.9n	~	0.100
MiningCandidate_Stringify_Short-4	225.4n	220.9n	~	0.100
MiningCandidate_Stringify_Long-4	1.664µ	1.663µ	~	1.000
MiningSolution_Stringify-4	847.7n	863.4n	~	0.200
BlockInfo_MarshalJSON-4	1.754µ	1.807µ	~	0.100
NewFromBytes-4	141.5n	127.6n	~	0.100
Mine_EasyDifficulty-4	60.61µ	60.58µ	~	1.000
Mine_WithAddress-4	6.730µ	6.666µ	~	0.100
DirectSubtreeAdd/4_per_subtree-4	58.32n	61.30n	~	0.100
DirectSubtreeAdd/64_per_subtree-4	28.27n	31.61n	~	0.100
DirectSubtreeAdd/256_per_subtree-4	27.27n	30.84n	~	0.100
DirectSubtreeAdd/1024_per_subtree-4	26.20n	29.29n	~	0.100
DirectSubtreeAdd/2048_per_subtree-4	25.99n	28.89n	~	0.100
SubtreeProcessorAdd/4_per_subtree-4	279.7n	283.1n	~	0.700
SubtreeProcessorAdd/64_per_subtree-4	275.6n	277.2n	~	0.100
SubtreeProcessorAdd/256_per_subtree-4	279.0n	280.4n	~	0.400
SubtreeProcessorAdd/1024_per_subtree-4	268.6n	272.1n	~	0.100
SubtreeProcessorAdd/2048_per_subtree-4	267.0n	274.0n	~	0.100
SubtreeProcessorRotate/4_per_subtree-4	273.9n	276.6n	~	0.100
SubtreeProcessorRotate/64_per_subtree-4	273.7n	277.4n	~	0.100
SubtreeProcessorRotate/256_per_subtree-4	273.2n	276.0n	~	0.100
SubtreeProcessorRotate/1024_per_subtree-4	273.0n	280.3n	~	0.100
SubtreeNodeAddOnly/4_per_subtree-4	54.35n	55.56n	~	0.100
SubtreeNodeAddOnly/64_per_subtree-4	34.30n	34.53n	~	0.300
SubtreeNodeAddOnly/256_per_subtree-4	33.36n	33.43n	~	0.700
SubtreeNodeAddOnly/1024_per_subtree-4	32.70n	32.78n	~	0.400
SubtreeCreationOnly/4_per_subtree-4	114.6n	113.7n	~	0.700
SubtreeCreationOnly/64_per_subtree-4	401.6n	403.6n	~	1.000
SubtreeCreationOnly/256_per_subtree-4	1.338µ	1.483µ	~	0.100
SubtreeCreationOnly/1024_per_subtree-4	4.349µ	4.431µ	~	0.200
SubtreeCreationOnly/2048_per_subtree-4	8.011µ	8.402µ	~	0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	268.2n	270.6n	~	0.400
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	269.3n	270.4n	~	0.700
ParallelGetAndSetIfNotExists/1k_nodes-4	804.2µ	584.6µ	~	0.100
ParallelGetAndSetIfNotExists/10k_nodes-4	1.577m	1.336m	~	0.100
ParallelGetAndSetIfNotExists/50k_nodes-4	6.732m	6.747m	~	0.700
ParallelGetAndSetIfNotExists/100k_nodes-4	13.63m	13.64m	~	1.000
SequentialGetAndSetIfNotExists/1k_nodes-4	653.4µ	665.1µ	~	0.100
SequentialGetAndSetIfNotExists/10k_nodes-4	2.783m	2.777m	~	1.000
SequentialGetAndSetIfNotExists/50k_nodes-4	10.38m	10.46m	~	0.100
SequentialGetAndSetIfNotExists/100k_nodes-4	19.90m	19.85m	~	1.000
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	637.7µ	630.7µ	~	0.700
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	4.263m	4.167m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	16.74m	16.66m	~	1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	701.2µ	704.7µ	~	1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	5.912m	5.798m	~	0.400
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	37.44m	38.06m	~	0.100
DiskTxMap_SetIfNotExists-4	3.735µ	3.970µ	~	1.000
DiskTxMap_SetIfNotExists_Parallel-4	3.606µ	3.562µ	~	0.700
DiskTxMap_ExistenceOnly-4	336.9n	312.8n	~	0.200
Queue-4	186.2n	185.8n	~	0.700
AtomicPointer-4	3.670n	3.279n	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	817.8µ	833.1µ	~	0.200
ReorgOptimizations/DedupFilterPipeline/New/10K-4	776.0µ	771.5µ	~	0.400
ReorgOptimizations/AllMarkFalse/Old/10K-4	122.8µ	115.0µ	~	0.700
ReorgOptimizations/AllMarkFalse/New/10K-4	64.46µ	64.86µ	~	0.700
ReorgOptimizations/HashSlicePool/Old/10K-4	56.75µ	61.53µ	~	0.100
ReorgOptimizations/HashSlicePool/New/10K-4	10.94µ	11.03µ	~	1.000
ReorgOptimizations/NodeFlags/Old/10K-4	4.469µ	4.466µ	~	1.000
ReorgOptimizations/NodeFlags/New/10K-4	1.572µ	1.572µ	~	1.000
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	9.338m	9.431m	~	0.400
ReorgOptimizations/DedupFilterPipeline/New/100K-4	10.106m	9.975m	~	1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.116m	1.175m	~	0.100
ReorgOptimizations/AllMarkFalse/New/100K-4	702.7µ	705.8µ	~	0.400
ReorgOptimizations/HashSlicePool/Old/100K-4	461.5µ	578.5µ	~	0.100
ReorgOptimizations/HashSlicePool/New/100K-4	205.7µ	201.4µ	~	0.400
ReorgOptimizations/NodeFlags/Old/100K-4	46.28µ	47.69µ	~	0.100
ReorgOptimizations/NodeFlags/New/100K-4	16.58µ	16.10µ	~	0.400
TxMapSetIfNotExists-4	46.40n	46.41n	~	0.500
TxMapSetIfNotExistsDuplicate-4	38.56n	38.73n	~	0.100
ChannelSendReceive-4	606.7n	614.9n	~	0.100
BlockAssembler_AddTx-4	0.03179n	0.03072n	~	1.000
AddNode-4	11.70	12.07	~	0.700
AddNodeWithMap-4	12.28	12.43	~	1.000
CalcBlockWork-4	504.2n	470.1n	~	0.100
CalculateWork-4	666.7n	633.9n	~	0.700
BuildBlockLocatorString_Helpers/Size_10-4	1.319µ	1.331µ	~	0.700
BuildBlockLocatorString_Helpers/Size_100-4	12.66µ	15.33µ	~	0.100
BuildBlockLocatorString_Helpers/Size_1000-4	157.9µ	124.8µ	~	0.100
CatchupWithHeaderCache-4	104.4m	104.3m	~	0.700
_BufferPoolAllocation/16KB-4	4.961µ	3.631µ	~	0.400
_BufferPoolAllocation/32KB-4	8.632µ	7.879µ	~	0.700
_BufferPoolAllocation/64KB-4	17.72µ	15.67µ	~	0.700
_BufferPoolAllocation/128KB-4	32.74µ	32.43µ	~	0.400
_BufferPoolAllocation/512KB-4	116.5µ	128.1µ	~	0.100
_BufferPoolConcurrent/32KB-4	18.62µ	20.40µ	~	0.100
_BufferPoolConcurrent/64KB-4	29.29µ	32.56µ	~	0.100
_BufferPoolConcurrent/512KB-4	146.2µ	159.4µ	~	0.100
_SubtreeDeserializationWithBufferSizes/16KB-4	635.5µ	680.5µ	~	0.100
_SubtreeDeserializationWithBufferSizes/32KB-4	664.6µ	669.9µ	~	1.000
_SubtreeDeserializationWithBufferSizes/64KB-4	656.0µ	665.4µ	~	0.400
_SubtreeDeserializationWithBufferSizes/128KB-4	679.0µ	668.5µ	~	0.700
_SubtreeDeserializationWithBufferSizes/512KB-4	669.7µ	686.7µ	~	0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4	36.27m	36.66m	~	0.400
_SubtreeDataDeserializationWithBufferSizes/32KB-4	36.17m	36.50m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4	36.11m	36.64m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4	35.82m	36.44m	~	0.200
_SubtreeDataDeserializationWithBufferSizes/512KB-4	35.82m	36.53m	~	0.100
_PooledVsNonPooled/Pooled-4	740.8n	743.6n	~	0.100
_PooledVsNonPooled/NonPooled-4	6.826µ	7.690µ	~	0.200
_MemoryFootprint/Current_512KB_32concurrent-4	7.210µ	7.775µ	~	0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4	9.603µ	11.844µ	~	0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4	9.345µ	11.048µ	~	0.100
SubtreeSizes/10k_tx_4_per_subtree-4	1.403m	1.359m	~	0.700
SubtreeSizes/10k_tx_16_per_subtree-4	332.1µ	317.2µ	~	0.200
SubtreeSizes/10k_tx_64_per_subtree-4	78.76µ	76.11µ	~	0.100
SubtreeSizes/10k_tx_256_per_subtree-4	19.74µ	19.22µ	~	0.100
SubtreeSizes/10k_tx_512_per_subtree-4	9.855µ	9.446µ	~	0.100
SubtreeSizes/10k_tx_1024_per_subtree-4	4.902µ	4.634µ	~	0.100
SubtreeSizes/10k_tx_2k_per_subtree-4	2.511µ	2.300µ	~	0.100
BlockSizeScaling/10k_tx_64_per_subtree-4	80.58µ	73.12µ	~	0.100
BlockSizeScaling/10k_tx_256_per_subtree-4	20.22µ	18.66µ	~	0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4	5.108µ	4.615µ	~	0.100
BlockSizeScaling/50k_tx_64_per_subtree-4	433.1µ	396.0µ	~	0.100
BlockSizeScaling/50k_tx_256_per_subtree-4	107.03µ	93.84µ	~	0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4	26.44µ	24.00µ	~	0.100
SubtreeAllocations/small_subtrees_exists_check-4	180.3µ	164.9µ	~	0.100
SubtreeAllocations/small_subtrees_data_fetch-4	179.7µ	170.0µ	~	0.100
SubtreeAllocations/small_subtrees_full_validation-4	361.9µ	332.5µ	~	0.100
SubtreeAllocations/medium_subtrees_exists_check-4	10.360µ	9.671µ	~	0.100
SubtreeAllocations/medium_subtrees_data_fetch-4	10.72µ	10.27µ	~	0.100
SubtreeAllocations/medium_subtrees_full_validation-4	20.82µ	19.52µ	~	0.100
SubtreeAllocations/large_subtrees_exists_check-4	2.455µ	2.333µ	~	0.100
SubtreeAllocations/large_subtrees_data_fetch-4	2.536µ	2.432µ	~	0.100
SubtreeAllocations/large_subtrees_full_validation-4	5.186µ	4.843µ	~	0.100
_prepareTxsPerLevel-4	394.6m	396.0m	~	0.400
_prepareTxsPerLevelOrdered-4	3.950m	4.015m	~	0.700
_prepareTxsPerLevel_Comparison/Original-4	406.4m	404.1m	~	0.700
_prepareTxsPerLevel_Comparison/Optimized-4	3.606m	3.502m	~	0.100
StoreBlock_Sequential/BelowCSVHeight-4	303.0µ	315.8µ	~	0.100
StoreBlock_Sequential/AboveCSVHeight-4	312.7µ	313.8µ	~	0.700
GetUtxoHashes-4	271.5n	274.2n	~	1.000
GetUtxoHashes_ManyOutputs-4	45.81µ	46.28µ	~	0.700
_NewMetaDataFromBytes-4	231.1n	230.5n	~	0.700
_Bytes-4	616.8n	608.7n	~	0.400
_MetaBytes-4	569.5n	558.6n	~	0.700

Threshold: >10% with p < 0.05 | Generated: 2026-05-14 08:57 UTC

…he (bsv-blockchain#842)

…c tests (bsv-blockchain#841)

…-blockchain#844)

…sv-blockchain#761)

…#762)

…Found shortcut (bsv-blockchain#770)

Co-authored-by: gokhan-sagirlar <gokhan.sagirlar@coinbase.com>

…queueDuringBlockMovement (bsv-blockchain#846)

…nstead of assuming valid (bsv-blockchain#778)

…blockchain#847)

…n Reset drain (bsv-blockchain#851)

…k counter by first-seen CreatedAt (bsv-blockchain#845)

…romMillis admit logic (bsv-blockchain#848)

…quency-linger-regression # Conflicts: # go.mod # stores/utxo/aerospike/spend.go

…ction (bsv-blockchain#768)

…veForwardBlock drain loss (bsv-blockchain#856)

…-wait logic (bsv-blockchain#859)

… with bounded shard worker pool (bsv-blockchain#858)

…sv-blockchain#836)

…ro headers (bsv-blockchain#718) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ckchain#628) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…duplicate-tx (bsv-blockchain#870)

Co-authored-by: Siggi <siggi.oskarsson@bsvassociation.org>

… subscription drift (refs bsv-blockchain#872) (bsv-blockchain#878)

bsv-blockchain#876)

bsv-blockchain#879)

…g follow-ups) (bsv-blockchain#880)

…ache, and block validation (bsv-blockchain#850) Co-authored-by: Simon Ordish <71426+ordishs@users.noreply.github.com>

…moveForwardBlock latency (bsv-blockchain#877)

…quency-linger-regression

freemans13 · 2026-05-19T12:50:45Z

Closing in favour of #894, which is the same two-commit kafka fix but rebased onto main instead of feat/teranode-native-ops.

#894 contains only the focused 4-file change (settings.conf + 3 files in util/kafka/), with the same plain-English description, production validation, and benchmark numbers. The native-ops branch carried a lot of unrelated diff (149 files) that was making this PR hard to review as a standalone kafka fix.

Same production validation applies — already deployed and confirmed on dev-scale-1/2 since 2026-05-11.

freemans13 and others added 4 commits May 11, 2026 09:00

fix(subtreevalidation): resolve cross-subtree tx dependencies in bloc…

fff4425

…k order (bsv-blockchain#717) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(tests): de-flake go test races on main (bsv-blockchain#837)

ae75cbc

fix(validator): synchronise mtpStore with sync.RWMutex (bsv-blockchai…

b55e1f1

…n#838)

freemans13 changed the title ~~fix(kafka): cut flush_frequency to 10ms on high-fanout topics~~ fix(kafka): unwind dual flush_frequency linger; cut linger to 10ms on high-fanout topics May 11, 2026

oskarszoon and others added 21 commits May 11, 2026 17:50

fix(validator): bump gobdk to hot-fix with per-CheckSig signature cac…

b6405b0

…he (bsv-blockchain#842)

test(blockassembly/subtreeprocessor): add clock seam for deterministi…

47705bd

…c tests (bsv-blockchain#841)

fix(legacy): suppress tx relay while FSM != RUNNING (bsv-blockchain#843)

b040705

fix(blockchain): reject FSM RUN below highest network checkpoint (bsv…

f8903b7

…-blockchain#844)

fix(model): reject duplicate subtree roots in top-level merkle tree (b…

7fb7d89

…sv-blockchain#761)

fix(utxo): propagate panic recovery as error in Spend (bsv-blockchain…

300574a

…#762)

fix(validator): require BlockIDs+!Conflicting+!Locked before ErrTxNot…

bf96b7c

…Found shortcut (bsv-blockchain#770)

feat: rpcserver swagger openrpc (bsv-blockchain#763)

6701866

Co-authored-by: gokhan-sagirlar <gokhan.sagirlar@coinbase.com>

fix(blockassembly/subtreeprocessor): zero-guard validFromMillis in de…

ed8ce03

…queueDuringBlockMovement (bsv-blockchain#846)

fix(blockvalidation): surface metadata fetch error in ValidateBlock i…

ecff37f

…nstead of assuming valid (bsv-blockchain#778)

fix(legacy): gate quickValidation on checkpoints, not FSM state (bsv-…

a1a8414

…blockchain#847)

chore(deps): bump go-chaincfg to v1.5.8 (bsv-blockchain#849)

fab9b5a

fix(blockassembly/subtreeprocessor): stop losing the boundary batch i…

5ed3da0

…n Reset drain (bsv-blockchain#851)

Test/p2p handler coverage (bsv-blockchain#822)

16c7e47

docs(validator): clarify ValidateTransactionBatch.Valid is always true (

caeacc1

bsv-blockchain#769)

fix(utxo): roll back ProcessConflicting on step-3+ failure (bsv-block…

d6eea7d

…chain#765)

fix(utxo/aerospike): close 4 silent-failure paths in sendStoreBatch (b…

b5d9e81

…sv-blockchain#853)

fix(legacy): merge blockID into pre-existing tx in createUtxos (bsv-b…

efaeedf

…lockchain#854)

fix(p2p): use per-connection context in HandleWebSocket (bsv-blockcha…

38b1a35

…in#774)

fix(blockchain): skip FSM RUN gate when transitioning from IDLE (bsv-…

adf9e66

…blockchain#855)

fix(ci): prevent zombie sticky comments on CCR timeout or race (bsv-b…

c852197

…lockchain#860)

oskarszoon and others added 3 commits May 14, 2026 09:41

fix(blockassembly): inverse ProcessConflicting in moveBackBlock — pic…

215d616

…k counter by first-seen CreatedAt (bsv-blockchain#845)

test(blockassembly/subtreeprocessor): rapid property tests for validF…

4cdd6c4

…romMillis admit logic (bsv-blockchain#848)

Merge remote-tracking branch 'upstream/main' into fix/kafka-flush-fre…

db6549c

…quency-linger-regression # Conflicts: # go.mod # stores/utxo/aerospike/spend.go

oskarszoon approved these changes May 14, 2026

View reviewed changes

ordishs and others added 8 commits May 14, 2026 10:39

fix(legacy/netsync): bounds-check parent output index in ExtendTransa…

974782b

…ction (bsv-blockchain#768)

test(blockassembly/subtreeprocessor): reproduce bsv-blockchain#852 mo…

6a369d9

…veForwardBlock drain loss (bsv-blockchain#856)

fix(subtreevalidation): fix retry counter inflation and extract retry…

20ef450

…-wait logic (bsv-blockchain#859)

fix(subtreevalidation): replace unbounded goroutine per Kafka message…

6eff0ab

… with bounded shard worker pool (bsv-blockchain#858)

fix(txmetacache): restore delete correctness for native and trimmed (b…

f145df7

…sv-blockchain#836)

tx metacache performance improvements (bsv-blockchain#820)

3c30bc9

fix(blockvalidation): don't short-circuit catchup on forked peer's ze…

c4fa275

…ro headers (bsv-blockchain#718) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(pruner): skip parent updates for already-pruned parents (bsv-blo…

12b6408

…ckchain#628) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

freemans13 self-assigned this May 14, 2026

shruggr and others added 11 commits May 15, 2026 16:35

fix(propagation): correct HTTP status mapping for missing-parent and …

4110a49

…duplicate-tx (bsv-blockchain#870)

Ensure atomic file store publication (bsv-blockchain#826)

e2c10f2

Co-authored-by: Siggi <siggi.oskarsson@bsvassociation.org>

Stream asset subtree responses to reduce memory use (bsv-blockchain#827)

7eb1aa6

fix(blockchain): root-cause fix for block-assembly stall after silent…

3090446

… subscription drift (refs bsv-blockchain#872) (bsv-blockchain#878)

perf(legacy): drop cumulative-stats replay; right-size getdata InvList (

9799a1b

bsv-blockchain#876)

MvP-4597 Upgrade bdk 1.2.4 (bsv-blockchain#839)

b61c1f8

perf(propagation): process /txs batch concurrently with ordered errors (

cfcb9ca

bsv-blockchain#879)

fix(asset): address review feedback from bsv-blockchain#827 (streamin…

5d3286d

…g follow-ups) (bsv-blockchain#880)

fix(setMinedMulti): enforce coverage invariant across store, model, c…

fe3e73a

…ache, and block validation (bsv-blockchain#850) Co-authored-by: Simon Ordish <71426+ordishs@users.noreply.github.com>

perf(blockassembly): recycle tx maps + bucket-sharded inserts to cut …

8fea71e

…moveForwardBlock latency (bsv-blockchain#877)

Merge remote-tracking branch 'upstream/main' into fix/kafka-flush-fre…

40b0fd3

…quency-linger-regression

freemans13 mentioned this pull request May 19, 2026

fix(kafka): stop producer waiting up to 1s per message on busy topics (p99 7s → 22ms) #894

Merged

7 tasks

freemans13 closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kafka): unwind dual flush_frequency linger; cut linger to 10ms on high-fanout topics#840

fix(kafka): unwind dual flush_frequency linger; cut linger to 10ms on high-fanout topics#840
freemans13 wants to merge 48 commits into
bsv-blockchain:feat/teranode-native-opsfrom
freemans13:fix/kafka-flush-frequency-linger-regression

freemans13 commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

freemans13 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

freemans13 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1: cut flush_frequency to 10ms on high-fanout topics

Commit 2: decouple outer batcher linger from flush_frequency

Pre-deploy evidence

Post-deploy validation

Follow-ups (intentionally out of scope here)

Test plan

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes Validated

Concerns

Production Validation

Recommendation

Uh oh!

sonarqubecloud Bot commented May 11, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison Report

Summary

Uh oh!

freemans13 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

freemans13 commented May 11, 2026 •

edited

Loading

Commit 1: `cut flush_frequency to 10ms on high-fanout topics`

Commit 2: `decouple outer batcher linger from flush_frequency`

github-actions Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading