feat: optimize handling of policy-rejected txs from other miners#799
Conversation
|
🤖 Claude Code Review Status: Complete Current Review: The design is sound: policy-rejected txs are cached only as an optimization, are never blessed directly (they still go through full One new issue:
Minor (pre-existing open thread): the new History:
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-10 10:47 UTC |
3f62753 to
e6db8b8
Compare
|
Inline review comment for services/subtreevalidation/policy_rejected_tx.go:100 [Major] Missing hash verification - The handler trusts the hash from the Kafka message without verifying it matches the actual transaction. See the main review comment for details and code example. |
acded8e to
f76186e
Compare
c893f0f to
725c310
Compare
725c310 to
6875b4e
Compare
oskarszoon
left a comment
There was a problem hiding this comment.
Direction is right — txPolicyRejectedCache short-circuits redundant policy work from peer-validated rejections. Three things worth resolving before merge:
- Integration test checkboxes in PR body remain unchecked despite the 5 fixup commits. Worth confirming they pass before re-requesting review.
- Cache uses random eviction. Under an adversarial peer sending high volumes of distinct policy-rejected tx hashes, the cache thrashes and provides no hit-rate guarantee — turning the optimization into pure overhead at exactly the load where it's supposed to help. Consider size-bounded FIFO or LRU, or document the eviction strategy + assumed attack model explicitly.
policyRejectedTxMessageHandlervalidates the hash before insert (good) but has no rate-limit on the Kafka consumer. A flood ofKAFKA_TX_POLICY_REJECTEDmessages can spike GC pressure and starve the cache anyway. Worth a sentence in the longdesc on expected message volume + a note if backpressure handling lives elsewhere.
b711dae to
25a8912
Compare
25a8912 to
f23aae1
Compare
0dbaefd to
75958f0
Compare
oskarszoon
left a comment
There was a problem hiding this comment.
Re-review against 75958f0de. The prior Critical (320 MB → broker) is genuinely fixed, consensus/validation equivalence checks out (cached txs are re-blessed from scratch — the cache can't launder an invalid tx or substitute a conflicting one), and cache-poisoning is soundly defended. Two load/DoS issues block.
P1 — the "64 MB" cache bound is entry-count, not bytes (policy_rejected_tx.go:34-47). maxSize = maxBytes/500 evicts on entry count, but each entry holds a *bt.Tx up to 1 MB. At default 64 MB that's ~131k entries → ~187–256 MB under normal traffic, and ~128 GB under an adversarial flood of distinct ~1 MB policy-rejected txs → pod OOM. The setting longdesc says "maximum memory in MB"; it isn't. Track cumulative tx.Size() and evict by bytes, or rename to …MaxEntries and fix the "max %d MB" log line. (This is the real form of the earlier eviction concern — wrong dimension, not wrong order; eviction order is fine.)
P1 — producer publish is unsafe under load (Validator.go:~503-544). (a) Publish(...) is a blocking channel send on the validator hot path — under Kafka backpressure it stalls validateTransaction; make it non-blocking drop-on-full (cache is best-effort, a drop just falls back to the HTTP fetch path). (b) No catchup/FSM gate — it floods Kafka on every policy rejection during CATCHINGBLOCKS/LEGACYSYNCING, where the sibling rejected-tx producer deliberately stays quiet; gate it behind the same FSM check.
Minor: the consumer's "single goroutine / natural backpressure" comment is inaccurate (goroutine-per-partition-per-fetch); integration-test checkboxes still unchecked (lookup-before-fetch has no test).
| defer c.mu.RUnlock() | ||
|
|
||
| e, ok := c.entries[hash] | ||
| return e.tx, ok |
There was a problem hiding this comment.
[Major] Cache returns a shared *bt.Tx that is later mutated in place — potential data race.
Get (and lookupPolicyRejectedTxs) hand out the cached *bt.Tx pointer directly. A resolved tx is then passed to blessMissingTransaction → ValidateWithOptions → extendTransaction, which mutates the transaction in place: PreviousOutputsDecorate writes tx.Inputs[i].PreviousTxScript/PreviousTxSatoshis and then tx.SetExtended(true) (Validator.go:1517-1531).
If two subtree validations resolve the same policy-rejected tx hash from the cache concurrently (e.g. the same not-yet-mined zero-fee tx appears in subtrees from two peers validated in parallel), both goroutines read !tx.IsExtended() and call PreviousOutputsDecorate on the same object → concurrent writes to the shared bt.Tx inputs. go test -race (mandated by AGENTS.md) would flag this if the path is exercised. The HTTP-fetch fallback does not have this issue because each fetch parses fresh bytes into a new *bt.Tx.
Suggestion: return an independent copy on lookup (tx.Clone()), or store the raw bytes in the cache and parse a fresh *bt.Tx per Get. Cloning on a hit is still far cheaper than the HTTP roundtrip this cache is replacing.
|
oskarszoon
left a comment
There was a problem hiding this comment.
Re-review against aa77c5fb0. Both P1s from the prior pass are fixed and verified against current code (Go mechanics + load + offensive re-attack):
- B1 (cache OOM) — cache is now a true cumulative-byte bound (
policy_rejected_tx.go:curBytes/maxBytes, evict bytx.Size()). The ~128 GB distinct-1MB-flood vector is gone;TestTxPolicyRejectedCache_BoundedByBytespins the invariant and a concurrent-flood probe held under-race. - B2 (publish unsafe) — publish is non-blocking
TryPublish(drop-on-full, lossless → HTTP fetch) and FSM-gated duringCATCHINGBLOCKS/LEGACYSYNCINGvia a cached atomic read. No hot-path stall, no catchup flood.
Not blocking, but worth addressing before merge:
- Memory-sizing honesty.
curBytescounts serializedtx.Size(), not the retained*bt.Tx+map heap — measured ~2.75× for large txs, ~6–7× for small. At the 64 MB default that's ~440 MB actual RSS under typical traffic. The longdesc still says "maximum memory in MB"; tighten the default or state the multiplier so operators don't undersize pod limits. Validator.go:567serializes + marshals (~2 MB) before theTryPublishdrop check. Under sustained backpressure that's dead allocation per rejected tx. Defer the marshal until after confirming buffer space.- Producer/consumer size caps mismatch:
KafkaMaxMessageBytes= 1,048,576 vsmaxCachedTxBytes= 1,000,000. A tx in the 1,000,001–1,048,576 B band passes the producer guard, publishes, then gets silently dropped consumer-side — wasted publish + guaranteed cache miss. Derive one from the other. - The B2 fixes aren't unit-tested —
TestPublishPolicyRejectedTxnilsblockchainClient(skips the FSM gate) and the mockTryPublishalways returns true (skips drop-on-full). Cache-poisoning hash-mismatch (PF6) still has no direct test; the defense itself is present and correct.
The merge applied cleanly textually but two signatures gained parameters independently, so callers no longer matched: - util.SafeSetLimit gained a leading logger arg (upstream); update the postgres BatchPreviousOutputsDecorate caller to pass s.logger. - validator.New (policyRejectedTx producer, upstream bsv-blockchain#799) and subtreevalidation.New (p2pClient, this branch) each gained a param; the merge auto-combined both, leaving upstream's new legacy_* integration tests one arg short. Pass nil for the missing client, matching the existing tests. Caught by golangci-lint/go vet (test files; go build alone does not compile them). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-legacysyncing Eleven upstream commits, several touching legacy code this branch also changed. Resolutions: - netsync/manager.go: upstream bsv-blockchain#1067 deleted the catchingBlocks early-return in handleBlockMsg (the swallowed orphan tip is the legacy batch-continuation signal; swallowing it stalled sync). Took upstream's always-request behaviour and updated the surrounding comments to post-LEGACYSYNCING wording. - subtreevalidation/Server.go: kept upstream bsv-blockchain#1065's richer rationale for disabling block-assembly additions, on this branch's CATCHINGBLOCKS-only condition. - Re-applied the LEGACYSYNCING removal to symbols upstream reintroduced: validator publishPolicyRejectedTx gate (bsv-blockchain#799) now checks CATCHINGBLOCKS only; the new legacy gate-pinning tests (bsv-blockchain#1065) drive CATCHINGBLOCKS instead of the removed state; comment references updated in adaptivefetch, Validator_test, and the Server.go incident note.




Summary
KAFKA_TX_POLICY_REJECTEDtopicblessMissingTransactionfor full validationNew configuration
KAFKA_TX_POLICY_REJECTEDtx-policy-rejectedkafka_txPolicyRejectedConfig""subtreevalidation_txPolicyRejectedCacheEnabledtruesubtreevalidation_txPolicyRejectedCacheMaxMB64Test plan