fix(subtreevalidation): resolve cross-subtree tx dependencies in block order#717
Conversation
|
🤖 Claude Code Review Status: Complete Summary: No issues found. This PR implements a well-reasoned fix for cross-subtree transaction dependency ordering during block validation catch-up. Key changes verified:
Test coverage:
Documentation: PR description and inline comments clearly explain the three-phase validation approach, performance characteristics across different dependency densities, and relationship to real-world failure (teratestnet block 15,631). History:
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-06 13:01 UTC |
…k order Block validation's parallel subtree pass raced on cross-subtree parent dependencies. When subtree B spent an output from subtree A and both were validated concurrently, B could observe the cache before A had populated it and fail with TX_MISSING_PARENT. The subsequent sequential pass walked failures in goroutine-completion order, so dependent subtrees still appeared before their parents — producing the same failure on retry. Two small changes fix this: 1. Collect parallel-validation failures positionally into a bool slice indexed by block-subtree order. The sequential pass then iterates missingSubtrees and only re-runs entries whose slot is marked failed. Because each subtree's predecessors are validated before it, a single pass suffices — no retry loop, no convergence tracking. 2. processTransactionsInLevels defers the failure when every error is a missing-parent error, so later batches get a chance to run and surface the parent. Non-recoverable errors (invalid txs, processing errors) still fail fatally. Observed on teratestnet at block 15,631 (9,216 txs, 1,305 with cross-subtree parents) where the previous behaviour stalled sync indefinitely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3d6d7d2 to
35da5ec
Compare
There was a problem hiding this comment.
Pull request overview
Fixes sync/catch-up failures caused by cross-subtree transaction dependencies by ensuring failed subtree revalidation runs in block order (so parents are validated before dependents).
Changes:
- Collects parallel subtree-validation failures positionally (by subtree index) instead of goroutine-completion order.
- Re-validates failed subtrees in a single ordered pass over
missingSubtrees. - Updates
processTransactionsInLevelsto treat “all errors are missing-parent” as non-fatal (defer to ordered revalidation), and adjusts unit tests accordingly.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| services/subtreevalidation/check_block_subtrees.go | Switches failure tracking to positional indexing and revalidates failures in block order; defers batch failure when errors are exclusively missing-parent. |
| services/subtreevalidation/check_block_subtrees_test.go | Updates processTransactionsInLevels tests to reflect missing-parent errors being deferred (non-fatal). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Parallel validation failures are collected positionally so the sequential | ||
| // revalidation pass below walks them in block-subtree order. Cross-subtree | ||
| // parent dependencies within a block only resolve left-to-right; arbitrary | ||
| // goroutine-completion order would leave children ahead of their parents. | ||
| failedParallel := make([]bool, len(missingSubtrees)) |
There was a problem hiding this comment.
The new phase-2 positional failure tracking + phase-3 ordered single-pass revalidation is not covered by a test in this PR. Given this is a consensus-adjacent validation path and the ordering behavior is the core fix, please add a test that fails in parallel due to a cross-subtree parent dependency and then succeeds when revalidated in missingSubtrees order (and would have failed if revalidated in goroutine-completion order). This helps prevent regressions in the phase-2/phase-3 interaction.
There was a problem hiding this comment.
Added unit test coverage in commit bab48f3: TestValidateMissingSubtreesWithOrderedRetry with four subtests covering the phase-2/phase-3 contract. The two ordering-critical subtests (CrossSubtreeDependencies_RevalidateInBlockOrder and RevalidationOrderWouldFailIfNotBlockOrder) model chain dependencies where phase 2 fails for everything except subtree 0, then assert phase 3 resolves the failures in strict missingSubtrees order — any other walk order would reproduce the original TxMissingParent bug. The PersistentFailureInPhase3_IsReturned subtest asserts real failures still surface. The phase-2/phase-3 logic was extracted into a helper validateMissingSubtreesWithOrderedRetry taking a validate closure so these invariants can be tested without full subtree/peer/store infrastructure.
Extracts the phase-2 parallel + phase-3 ordered-sequential revalidation into a testable helper (validateMissingSubtreesWithOrderedRetry) that takes a validate function, so the ordering contract can be unit tested without full subtree/peer/store infrastructure. Adds TestValidateMissingSubtreesWithOrderedRetry with four subtests: - AllParallelSucceed_NoRevalidation — phase 3 is a no-op when phase 2 succeeds for every subtree. - CrossSubtreeDependencies_RevalidateInBlockOrder — chain-dependency case where phase 2 only succeeds for subtree 0, and phase 3 must resolve subtrees 1..n in block order. - PersistentFailureInPhase3_IsReturned — a non-recoverable failure in phase 3 surfaces to the caller. - RevalidationOrderWouldFailIfNotBlockOrder — every subtree fails in phase 2; phase 3's ordered walk is the only path to success. Any non-block-order walk would reproduce the original bug. Addresses Copilot review comment on PR bsv-blockchain#717. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Critical] Redundant error check at line 868-869 In check_block_subtrees.go:868-869, there is a redundant errors.Is(err, errors.ErrTxInvalid) check that is already nested inside a conditional (line 864) that guarantees this condition is true. This means the code will always return err for any invalid transaction, preventing the function from reaching line 875 and completing all levels. Impact: This defeats the purpose of the missing-parent deferral logic added in lines 910-922. When a truly invalid transaction is encountered, the function returns early instead of continuing to process remaining levels and deferring to phase-3 revalidation. Suggested fix: Remove the redundant nested check on line 868. The outer condition already ensures we are handling invalid (non-policy) transactions. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The outer branch at line 864 already guarantees ErrTxInvalid && !ErrTxPolicy, so the nested `if errors.Is(err, errors.ErrTxInvalid)` was always true and the return unconditional. Inline the return and note that phase 3 can't resolve these, so the fail-fast is intentional (unlike missing-parent errors, which take the ErrTxMissingParent branch and are deferred to phase 3). Pre-existing dead check, flagged during PR bsv-blockchain#717 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Fixed in a0b2534 — inlined the return. One correction on the impact analysis: this branch doesn't defeat the missing-parent deferral. Missing-parent errors take the Also noting for context: the redundant check was pre-existing (introduced 2025-08-28, not by this PR), but worth cleaning up while the file is open. |
|



Summary
CheckBlockSubtreesvalidates a block's subtrees in three phases:processTransactionsInLevelsprocesses each batch of subtrees' transactions in dependency-level order, populating the UTXO store and validator cache.CheckBlockSubtreesConcurrency). Failures are collected for phase 3.Phase 2's parallelism is justified at scale (e.g. on dev-scale-1, ~600 subtrees × ~1M txs per block) because each subtree's validation is largely independent I/O + cache lookups. But concurrency races on cross-subtree parent dependencies: when subtree B spends an output created by a tx in subtree A and they run at the same time, B may observe the cache before A has populated it and fail with
TX_MISSING_PARENT.When this code path runs
This is the catch-up / sync path, not steady-state. A well-connected teranode processes each subtree via the subtree-notify flow and builds its local copy ahead of the block announcement; the
missingSubtreeslist at the top ofCheckBlockSubtreesis empty and the function returns immediately. Only when a teranode falls behind (or missed a subtree notification) does it arrive here with subtrees that need to be fetched from a peer and validated — meaning this code primarily affects sync throughput, not block-time latency.The bug
Phase 2 collected failures into a shared slice via a mutex-guarded append from inside the per-subtree goroutine:
Goroutines run concurrently and append as they finish failing, so the order of
revalidateSubtreesreflects goroutine-completion order, which is non-deterministic. Phase 3'sfor _, subtreeHash := range revalidateSubtreesdoes iterate sequentially, but it iterates the wrong list — a child subtree can sit ahead of its parent and fail re-validation for the sameTX_MISSING_PARENTreason it failed in phase 2.On top of that, phase 1 failed the whole block with a processing error as soon as any missing-parent tx appeared, so phase 3 often never ran at all.
Observed on teratestnet: block 15,631 (9,216 txs, 1,305 with cross-subtree parents) consistently failed and stalled sync at height 15,630.
Fix
Two small changes:
1. Phase 2 collects failures positionally (block order)
Instead of a mutex-guarded slice of failures in arrival order, phase 2 writes into a
failedParallel []boolindexed by the subtree's position inmissingSubtrees(the canonical block order).2. Phase 3 becomes a single ordered pass
Iterate
missingSubtreesand re-validate any entry wherefailedParallel[i]is set. Because block subtrees are in a dependency-respecting order and each earlier subtree is re-validated first, one pass suffices — no convergence loop, no retry limits, no error bookkeeping.3. Phase 1 defers when all errors are missing-parent
processTransactionsInLevelsstill counts missing-parent errors separately. When every error is a missing-parent error, it returnsnilinstead of failing — letting later batches run and phase 3 resolve the dependencies. Non-recoverable errors (invalid txs, processing errors) still fail fatally.Why this is simpler than a convergence loop
A convergence loop (retry until every subtree succeeds or no progress is made) also works, but its complexity exists only to compensate for unordered failure collection. Once phase 3 walks failures in block order, each subtree's predecessors have always been validated before it, so one pass is always sufficient. If a subtree fails in that pass, the failure is real — a retry wouldn't help, and surfacing it immediately is the correct behavior.
Net change: ~26 source lines added, half of which are explanatory comments.
Performance note: dense vs sparse dependency chains
Phase 2's parallelism value scales inversely with cross-subtree dependency density:
The fix is correct in both cases: phase 3's ordered single-pass converges in one sweep regardless of density, so worst-case wall time is bounded by the sequential cost of cross-subtree dep resolution (which is inescapable — chains this tight must serialize). The convergence loop was O(n) passes in the worst case; this is O(1).
Whether phase 2 is worth keeping for dense-chain sync is a separate perf question (a future tunable or density heuristic could skip it). For now it's cheap to leave in because failed attempts fail fast on cache misses, and removing it would regress sparse sync.
Test plan
go test -tags testtxmetacache ./services/subtreevalidation/...— 281 / 281 passgo build ./...cleanThe change is additive on the happy path: blocks where every subtree succeeds in phase 2 skip phase 3 entirely, byte-for-byte identical to before.
Relationship to #715 / #646
This supersedes the earlier convergence-loop version of this PR (and the unrelated gate-logic work in #646, handled by #715).
🤖 Generated with Claude Code