fix(kafka): unwind dual flush_frequency linger; cut linger to 10ms on high-fanout topics#840
Conversation
…k order (bsv-blockchain#717) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The franz-go switch rewired the URL query param `flush_frequency` from
Sarama's "max time between flushes" to franz-go's `kgo.ProducerLinger`,
which is a PER-PARTITION linger. On the dev-scale-1/2 txmeta topic
(256 partitions, ~5 batched msgs/s/partition at 1.2M TPS peak) each
partition rarely fills 1MiB before the 1s linger, so every record paid
~1s of producer-side delay. The subtree-validator's local cache lagged
the validator by 1-2s and every subtree hit ProcessTxMetaUsingCache's
ThresholdExceededError -> 1s RetrySleep -> retry — visible as the
"validate_subtree_retry" rate matching the subtree rate and as gaps in
the "Tx Meta read from Kafka /second" Grafana panel.
- settings.conf: txmeta, validatortxs.operator, legacyInv get
flush_frequency=10ms (was 1s). Low-volume topics (invalidBlocks,
rejectedTx, unitTest) keep 1s; their per-partition rate is low
enough that latency doesn't matter.
- util/kafka/kafka_producer_async.go: documentation block at the
franz-go option site explaining the Sarama->franz-go semantic
shift for each `flush_*` URL param, so the next operator doesn't
re-introduce this.
- util/kafka/linger_latency_regression_test.go: regression test
spinning up Redpanda via testcontainers, demonstrating that
flush_frequency=1s produces p50 latency 200x larger than
flush_frequency=10ms on a 32-partition topic with sparse feed.
NOTE: dev-scale-1/2 configmaps override flush_frequency=1s explicitly
in scale-1-shared-config.kafka_txmetaConfig — those need a matching
update in the teranode-argocd-deployments repo for the fix to land
in those clusters.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Claude Code Review Status: Complete SummaryThis PR addresses a critical Kafka producer latency regression introduced during the Sarama → franz-go migration. The changes successfully decouple two stacked lingers that caused 1-7s publish→consume delays at production scale. Key Changes ValidatedCore Kafka Fix (util/kafka/):
Configuration Updates (settings.conf):
Documentation (docs/topics/services/legacy.md):
Concerns[Minor] Documentation Scope: Production ValidationThe PR description includes strong production evidence from dev-scale-1/2:
The fix has already been deployed and validated at 1.3M TPS for 22+ minutes with no regressions. RecommendationApprove — The core Kafka changes are correct, well-tested, and production-proven. The documentation accuracy is good. Consider splitting the OpenRPC changes into a separate PR for clearer change tracking. |
Splits the single FlushFrequency knob that previously drove both franz-go's per-partition ProducerLinger AND the outer async-batcher's straggler-flush timer. A new URL query param `outer_batcher_linger` (field: OuterBatcherLinger, default 10ms) controls only the outer batcher; `flush_frequency` now controls only kgo.ProducerLinger, which is what an operator looking at the URL expects. Without this fix, setting flush_frequency=1s — which on the dev-scale clusters was the intent of "match Sarama's 1s Flush.Frequency" — stacked two lingers on the same publish path. The regression test (sparse feed, 32 partitions) goes from p50=4.49s/p99=6.95s with the stacked behaviour to p50=513ms/p99=1.01s with the franz-go linger alone (and to ~22ms p99 once flush_frequency is also lowered). The settings.conf change in the first commit on this branch handles the second of those steps; this change handles the first. Adds unit-test coverage that: - the new URL param parses and applies (250ms test value), - flush_frequency=1s no longer influences OuterBatcherLinger. Updates the integration test commentary to reflect that the outer batcher's linger no longer stacks with franz-go's. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-14 08:57 UTC |
Co-authored-by: gokhan-sagirlar <gokhan.sagirlar@coinbase.com>
…queueDuringBlockMovement (bsv-blockchain#846)
…nstead of assuming valid (bsv-blockchain#778)
…k counter by first-seen CreatedAt (bsv-blockchain#845)
…romMillis admit logic (bsv-blockchain#848)
…quency-linger-regression # Conflicts: # go.mod # stores/utxo/aerospike/spend.go
…veForwardBlock drain loss (bsv-blockchain#856)
… with bounded shard worker pool (bsv-blockchain#858)
…ro headers (bsv-blockchain#718) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckchain#628) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Siggi <siggi.oskarsson@bsvassociation.org>
… subscription drift (refs bsv-blockchain#872) (bsv-blockchain#878)
…ache, and block validation (bsv-blockchain#850) Co-authored-by: Simon Ordish <71426+ordishs@users.noreply.github.com>
…moveForwardBlock latency (bsv-blockchain#877)
…quency-linger-regression
|
Closing in favour of #894, which is the same two-commit kafka fix but rebased onto #894 contains only the focused 4-file change (settings.conf + 3 files in util/kafka/), with the same plain-English description, production validation, and benchmark numbers. The native-ops branch carried a lot of unrelated diff (149 files) that was making this PR hard to review as a standalone kafka fix. Same production validation applies — already deployed and confirmed on dev-scale-1/2 since 2026-05-11. |



Summary
Two complementary fixes for the txmeta producer-side latency regression that landed when the kafka client was switched from Sarama to franz-go (#611). Together they take p99 publish→consume latency from ~7 s to ~22 ms on the regression test. Already deployed to dev-scale-1/2 and confirmed working at production load — see "Post-deploy validation" below.
Commit 1:
cut flush_frequency to 10ms on high-fanout topicsThe Sarama → franz-go switch silently re-wired the URL query parameter
flush_frequencyfrom "max time between flushes" to franz-go's per-partitionkgo.ProducerLinger. On the high-fanout txmeta topic (256 partitions, ~5 batched msgs/s/partition at peak), each partition rarely fills 1 MiB before the 1 s linger, so every record paid up to ~1 s of producer-side delay. The subtree-validator's local cache lagged the validator by 1–2 s and every subtree triggeredProcessTxMetaUsingCache→ThresholdExceededError→ 1 sRetrySleep→ retry.Drops
flush_frequencyfrom1sto10mson the three high-fanout topics:kafka_txmetaConfig,kafka_validatortxsConfig.operator,kafka_legacyInvConfig. Low-volume topics (invalidBlocks,rejectedTx,unitTest) keep their existing1s.Commit 2:
decouple outer batcher linger from flush_frequencyA subtler footgun in the same area:
KafkaProducerConfig.FlushFrequencywas driving two lingers at once — franz-go's per-partitionProducerLinger(the user-facing knob), and the outer async-batcher's straggler-flush timer (an internal implementation detail). Settingflush_frequency=1sstacked two 1-second lingers on the same publish path.Introduces a new URL query param
outer_batcher_linger(field:OuterBatcherLinger, default10ms) controlling only the outer batcher.flush_frequencynow controls onlykgo.ProducerLinger, which is what an operator looking at the URL expects.Pre-deploy evidence
Production (Prometheus, dev-scale-1/2, Friday May 8 2026, 18:00–21:00 UTC at 1.28 M TPS peak):
txmeta-dev-scale-1-scale-1has producer-buffer backlog (mean ≈ 72 k msgs across 20 propagation pods, peak 186 k). Every other topic stays at 0.teranode_kafka_producer_produce_request_latency_secondsp99 ≈ 642 ms, p50 ≈ 87 ms.validate_subtree_retryrate ≈ 1–2 / s, matching the ~1.2 subtree/s rate — basically every subtree retries.validate_subtree_durationp99 mean = 16 s, max 127 s.bless_missing_transaction_countrate = 0 — retries always eventually succeed; the cache does fill, it just lags.TestLingerLatencyRegression(OrbStack-backed Redpanda, 32-partition topic, 200 records 25 ms apart):flush_frequency=1sp50flush_frequency=1sp99flush_frequency=10msin settings.conf)Post-deploy validation
The matching configmap patch (
flush_frequency=1s→flush_frequency=5mson txmeta and legacyInv) was applied to dev-scale-1/2 at 2026-05-11 11:27 UTC. After ~22 min of sustained ~1.30 M TPS:validate_subtree_retryratevalidate_subtree_durationp99The "Tx Meta read from Kafka /second" Grafana panel is now flat at ~1.3 M/s on both pods — no near-zero dips, no scrape-induced "gaps". That panel's behaviour was the originating symptom.
One thing flagged for monitoring, not a regression:
bless_missing_transaction_countis now firing at very low rates (mean 0.23/s on scale-1, 0.85/s on scale-2 with one 20.76/s burst) where it was zero before. Pre-fix that path never fired because theThresholdExceededError → 1 s retryshort-circuited every cache miss. Post-fix, the retry doesn't trigger, so genuine cache misses fall through to the legitimate "fetch from UTXO store" path. The miss rate is microscopic (≈0.00007% of txs), so this is fine — but if it grows it's the right alarm signal to surface, because it'll mean the cache is undersized rather than being masked by the retry loop.Follow-ups (intentionally out of scope here)
flush_frequency=5msrather than the10msin this PR's defaults — both work).scale-1-shared-config.kafka_txmetaConfiginteranode-argocd-deploymentswas patched; the matching PR there should be linked.Test plan
go vet ./util/kafka/clean.go build ./util/kafka/clean.go test -short -count=1 ./util/kafka/passes (all unit tests, including newTestNewKafkaAsyncProducerFromURLOuterBatcherLingercases).go test -v -run TestLingerLatencyRegression -timeout 5m ./util/kafka/passes locally with the numbers above.settings_local.confoverride for terabuild / mainnet / testnet / teratestnet relies on the old semantic.🤖 Generated with Claude Code