feat: adaptive subtree fetch strategy for CheckBlockSubtrees#646
feat: adaptive subtree fetch strategy for CheckBlockSubtrees#646freemans13 wants to merge 16 commits into
Conversation
Replace the broken localTxsAvailable heuristic with an adaptive pre-check + sample strategy. The old code compared block assembly's total pool count against the block's transaction count — a false positive that skipped downloading subtree data even when specific transactions (e.g. coinbase splitting txs) were missing locally. New approach: - Block assembly pre-check fast-paths obvious cases (empty BA = bulk download) - Sample ~100 tx hashes from TxMetaCache to verify local availability - Phase 2 revalidation uses convergence loop instead of fail-on-first This fixes frequent invalid block errors on dev-scale-2 caused by cross-subtree dependencies on transactions only present on dev-scale-1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Claude Code Review Status: Complete Current Review: Critical Issue:
Minor Issues:
Summary: History:
|
BSV peers take 7+ minutes to prepare block data for blocks deep in the chain, but the stall handler disconnects after 5 minutes. This causes an infinite sync loop where the node never receives block data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| if hash.Equal(subtreepkg.CoinbasePlaceholderHashValue) { | ||
| continue | ||
| } | ||
| found, _ := cache.GetMetaCached(ctx, hash, &meta.Data{}) |
There was a problem hiding this comment.
[Minor - Acceptable] GetMetaCached returns (data, found bool) without error parameter in current implementation. The sampling logic treats cache misses and errors uniformly as found=false, which is appropriate for a best-effort heuristic. No change needed.
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-02 09:10 UTC |
There was a problem hiding this comment.
Pull request overview
This PR updates CheckBlockSubtrees to make subtree fetching/validation more robust by replacing the previous block-assembly count heuristic with an adaptive pre-check + TxMetaCache sampling strategy, and by adding a convergence-style sequential retry loop for phase-2 revalidation to better handle cross-subtree dependencies.
Changes:
- Add configurable sampling parameters (
LocalAvailabilitySampleSize,LocalAvailabilityThreshold) to decide whether to bulk-downloadsubtree_dataor rely on local availability + fallback fetching. - Introduce
sampleLocalAvailability(+ unit tests) to estimate local TxMetaCache coverage using evenly-spaced hash sampling. - Change phase-2 sequential revalidation to retry failed subtrees until progress stops (convergence loop).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| settings/subtreevalidation_settings.go | Adds new settings for sampling size/threshold used by the adaptive subtree fetch decision. |
| services/subtreevalidation/sample_local_availability.go | Implements TxMetaCache hit-rate sampling for subtree tx-hash availability. |
| services/subtreevalidation/sample_local_availability_test.go | Unit tests covering key sampling behaviors (empty/all/none/coinbase/sample limits/no cache). |
| services/subtreevalidation/check_block_subtrees.go | Replaces old heuristic with pre-check + sampling-based bulk-download decision; adds convergence loop for phase-2 revalidation; adjusts subtree_data fetch/extraction flow based on bulkDownload. |
| services/legacy/peer/peer.go | Increases block stall response timeout from 5m to 15m. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| LocalAvailabilitySampleSize int `key:"subtreevalidation_local_availability_sample_size" desc:"Number of tx hashes to sample for local availability check" default:"100" category:"SubtreeValidation" usage:"Sample size for deciding bulk vs skip download" type:"int" longdesc:"### Purpose\nControls how many transaction hashes are sampled from a subtree to estimate local availability during block validation.\n\n### How It Works\nWhen a block arrives and block assembly has data, this many evenly-spaced transaction hashes are checked against the TxMetaCache. If the hit rate exceeds LocalAvailabilityThreshold, the bulk subtree_data download is skipped.\n\n### Recommendations\n- **100** (default) - Fast (~microseconds) and statistically reliable\n- Increase for very large subtrees where sampling accuracy matters more"` | ||
| LocalAvailabilityThreshold float64 `key:"subtreevalidation_local_availability_threshold" desc:"Hit rate above which bulk subtree_data download is skipped" default:"0.9" category:"SubtreeValidation" usage:"Threshold for local availability sampling" type:"float64" longdesc:"### Purpose\nSets the minimum TxMetaCache hit rate from sampling that triggers skipping the bulk subtree_data download.\n\n### How It Works\nAfter sampling transaction hashes from a subtree, if the hit rate is above this threshold, the system assumes most transactions are available locally and skips the expensive subtree_data download. ValidateSubtreeInternal handles the few missing transactions via its 3-stage fallback.\n\n### Recommendations\n- **0.9** (default) - Skip download when 90%+ of sampled txs are cached locally\n- Lower values skip downloads more aggressively (risk more fallback work)\n- Higher values trigger bulk downloads more often (safer but more bandwidth)"` |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Resolved] Settings are now properly wired in settings.go:579-580 with getInt and getFloat64.
There was a problem hiding this comment.
Fixed — added getInt and getFloat64 wiring in settings.go for both LocalAvailabilitySampleSize (default 100) and LocalAvailabilityThreshold (default 0.9). Also added clamping at the call site as a safety net.
| if sampleSize > len(nodes) { | ||
| sampleSize = len(nodes) | ||
| } | ||
|
|
||
| // Evenly spaced sampling across the subtree for deterministic, representative coverage | ||
| step := len(nodes) / sampleSize | ||
| if step == 0 { | ||
| step = 1 |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Resolved] Guard added at line 27-29 of sample_local_availability.go clamping sampleSize to 100 when <= 0.
There was a problem hiding this comment.
Fixed — added a guard that clamps sampleSize to 100 when <= 0, preventing the divide-by-zero.
| hitRate := u.sampleLocalAvailability(ctx, sampleSubtree, u.settings.SubtreeValidation.LocalAvailabilitySampleSize) | ||
| if hitRate < u.settings.SubtreeValidation.LocalAvailabilityThreshold { | ||
| bulkDownload = true | ||
| u.logger.Infof("[CheckBlockSubtrees] Sample check: %.1f%% hit rate (threshold %.1f%%) — switching to bulk download", | ||
| hitRate*100, u.settings.SubtreeValidation.LocalAvailabilityThreshold*100) |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Resolved] Settings properly wired and clamping logic added at check_block_subtrees.go:206-213.
There was a problem hiding this comment.
Fixed — added clamping with safe defaults (sampleSize=100, threshold=0.9) before use, and also wired the settings in settings.go with getInt/getFloat64.
| sampleSubtree, _ = subtreepkg.NewSubtreeFromReader(subtreeReader) | ||
| subtreeReader.Close() |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Resolved] Errors are now handled gracefully at check_block_subtrees.go:158-163, falling back to nil/bulk download on failure.
There was a problem hiding this comment.
Fixed — now handling the NewSubtreeFromReader error (falls back to nil/bulk download on failure) and using defer for proper Close with error logging.
| sampleSubtree, _ = subtreepkg.NewIncompleteTreeByLeafCount(len(subtreeNodeBytes) / chainhash.HashSize) | ||
| if sampleSubtree != nil { | ||
| var nodeHash chainhash.Hash | ||
| for i := 0; i < len(subtreeNodeBytes)/chainhash.HashSize; i++ { | ||
| copy(nodeHash[:], subtreeNodeBytes[i*chainhash.HashSize:(i+1)*chainhash.HashSize]) | ||
| if nodeHash.Equal(subtreepkg.CoinbasePlaceholderHashValue) { | ||
| _ = sampleSubtree.AddCoinbaseNode() | ||
| } else { | ||
| _ = sampleSubtree.AddNode(nodeHash, 0, 0) | ||
| } |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Resolved] Validation added at check_block_subtrees.go:171-173, with proper error handling for AddCoinbaseNode and AddNode at lines 186-197.
There was a problem hiding this comment.
Fixed — added validation for response size (must be multiple of hash size), error handling for AddCoinbaseNode/AddNode calls (sets sampleSubtree to nil on failure), and error handling for NewIncompleteTreeByLeafCount.
| if len(remainingSubtrees) > 0 && len(remainingSubtrees) >= previousCount { | ||
| // No progress — return error for the first still-failing subtree | ||
| return nil, errors.WrapGRPC(errors.NewProcessingError( | ||
| "[CheckBlockSubtreesRequest] Failed to validate %d subtrees after %d convergence attempts, first: %s", | ||
| len(remainingSubtrees), attempt+1, remainingSubtrees[0].String(), lastErr, | ||
| )) |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Resolved] The error tracking was fixed - no longer applicable to current code structure.
There was a problem hiding this comment.
Fixed — now tracking errors per subtree in a map so the wrapped error corresponds to the reported subtree hash.
| // stallResponseTimeoutBlocks is the maximum amount of time a peer will | ||
| // wait for a block message to be received after sending a getdata | ||
| // message. | ||
| stallResponseTimeoutBlocks = 5 * time.Minute | ||
| stallResponseTimeoutBlocks = 15 * time.Minute |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Acknowledged] Timeout change is intentional per commit a2387a5 to prevent premature peer disconnects during large block validation.
There was a problem hiding this comment.
The timeout change from 5m to 15m was an intentional separate fix (commit a2387a5) to prevent premature disconnections from slower peers during large block downloads. It's on this branch because it was discovered during testing. Happy to move it to a separate PR if preferred.
|
- Wire LocalAvailabilitySampleSize and LocalAvailabilityThreshold in settings.go - Add safe default clamping for sample size and threshold at call site - Guard against sampleSize <= 0 in sampleLocalAvailability - Validate peer response size (multiple of hash size) and handle AddNode errors - Handle NewSubtreeFromReader errors with proper defer Close - Track per-subtree errors so error message matches reported subtree hash Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| leafCount := len(subtreeNodeBytes) / chainhash.HashSize | ||
| var treeErr error | ||
| sampleSubtree, treeErr = subtreepkg.NewIncompleteTreeByLeafCount(leafCount) | ||
| if treeErr != nil { | ||
| u.logger.Infof("[CheckBlockSubtrees] Failed to create sample subtree with %d leaves: %v", leafCount, treeErr) |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Duplicate] See updated assessment in comment 3011363207.
There was a problem hiding this comment.
[Critical] Security Issue: Missing subtree hash verification
After constructing sampleSubtree from peer-supplied bytes (lines 176-211), the code must verify the computed root hash matches firstHash before using it for sampling decisions.
Risk: A malicious peer can send arbitrary subtree data that does not match the expected hash, causing incorrect bulk download decisions that waste bandwidth or skip necessary downloads.
Fix: Add validation after line 209 to verify sampleSubtree.RootHash().Equal(firstHash), setting sampleSubtree to nil if the hashes do not match (falling back to bulk download).
This mirrors the hash verification already present at line 417 for the main subtree fetch path.
| if hash.Equal(subtreepkg.CoinbasePlaceholderHashValue) { | ||
| continue | ||
| } | ||
| found, _ := cache.GetMetaCached(ctx, hash, &meta.Data{}) |
Copilot
AI
Mar 30, 2026
•
There was a problem hiding this comment.
[Minor - Acceptable] GetMetaCached returns (data, found bool) - the error parameter was removed in the current implementation. Errors manifest as found=false, which is the correct behavior for sampling (treat errors as misses). Current code is correct.
Re-add blockAssemblyClient to subtreevalidation Server after PR bsv-blockchain#647 revert removed it. Wire it up in daemon and fix test callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Block validation's sequential revalidation step (end of CheckBlockSubtrees)
did a single pass through the list of subtrees and returned on the first
validation failure. This breaks for blocks with many cross-subtree
transaction dependencies: if subtree B spends an output created by a tx
in subtree A, and B is attempted before A finishes populating the
internal tx cache, B fails with TX_MISSING_PARENT even though the parent
becomes available as soon as A's revalidation completes.
Symptom observed on teratestnet: block 15631 (9,216 txs, 1,305 with
cross-subtree parents) consistently failed with:
[processTransactionsInLevels] Completed processing with 1305 errors,
1305 transactions added to orphanage
The orphanage kept the children but block validation is all-or-nothing —
they could never be retried, so teratestnet stalled indefinitely at 15,630.
Fix: replace the single-pass revalidation with a convergence loop that
retries still-failing subtrees until either every subtree succeeds or
an iteration makes no progress (meaning the failures are real, not
ordering artefacts). Maximum iterations is bounded by the number of
subtrees since each iteration must make at least one unit of progress
to continue.
This is the "Phase 2 convergence loop" from PR bsv-blockchain#646, which I extracted
into this standalone PR as recommended in the spec for PR bsv-blockchain#715.
Testing:
- make lint: 0 issues
- go test -tags testtxmetacache ./services/subtreevalidation/... -run TestCheckBlockSubtrees: 22 passed
- Integration test: deploy to teratestnet and verify sync past 15631
🤖 Generated with [Claude Code](https://claude.com/claude-code)
oskarszoon
left a comment
There was a problem hiding this comment.
Has a merge conflict with main — please rebase before review.
…fetch # Conflicts: # settings/settings.go
# Conflicts: # services/subtreevalidation/check_block_subtrees.go # test/utils/aerospike/aerospike.go
# Conflicts: # services/subtreevalidation/check_block_subtrees.go
|
|
Superseded by #745. Both PRs rewrite the same decision point in
#745 is the more general approach and sits in the maintained lineage (replaces #539 and #715, deliberately avoids the wall-clock/FSM gating that got #598 reverted via #647, and ships a regression guard against reintroducing it). Keeping both would mean two competing adaptive-fetch mechanisms in the same code path, so closing this one in favour of #745. Note for the #745 deploy-verify: this PR was the fix for blocks from dev-scale-1 being marked invalid on dev-scale-2 — the old count-based |


Summary
localTxsAvailableheuristic (count-based) with adaptive pre-check + sample strategy that checks actual TxMetaCache contentProblem
On dev-scale-2, blocks from dev-scale-1 were frequently marked invalid. The root cause:
localTxsAvailablecompared block assembly's total pool count against the block's transaction count. With TX blasters feeding dev-scale-2 continuously, the pool count was always high — but specific transactions (coinbase splitting txs created only on dev-scale-1) were missing. This causedprocessTransactionsInLevels()to be skipped entirely, and Phase 2 revalidation failed on the first subtree error without giving parent subtrees a chance to resolve.Test plan
sampleLocalAvailabilityunit tests (6 cases: empty subtree, all cached, none cached, coinbase skip, sample size limits, no cache impl)TestCheckBlockSubtreestests passsubtreevalidationtest suite passes (34.7s)make lintpasses (0 issues)make testpasses (8183 tests, no failures in changed code)make smoketest— no failures in changed code🤖 Generated with Claude Code