Skip to content

fix(subtreevalidation): fix retry counter inflation and extract retry-wait logic#859

Merged
gokutheengineer merged 3 commits into
bsv-blockchain:mainfrom
gokutheengineer:gokhan/fix-validatesubtreeinternal
May 14, 2026
Merged

fix(subtreevalidation): fix retry counter inflation and extract retry-wait logic#859
gokutheengineer merged 3 commits into
bsv-blockchain:mainfrom
gokutheengineer:gokhan/fix-validatesubtreeinternal

Conversation

@gokutheengineer

@gokutheengineer gokutheengineer commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Two correctness fixes to ValidateSubtreeInternal: a metric counter that was over-counting on every validation, and a retry-wait select that silently swallowed context cancellation instead of propagating it.

Changes

Retry counter (SubtreeValidation.go)

  • prometheusSubtreeValidationValidateSubtreeRetry was incremented unconditionally on attempt=1, inflating the metric by 1 on every successful first-attempt validation
  • Counter now only increments when attempt > 1 — i.e. on actual retries

Retry-wait logic

  • Extracted the three-case select into waitForRetryOrPriority(ctx, subtreeHash, duration)
  • The old select used break inside a case, which exits the select but not the retry loop — context cancellation was silently ignored
  • The new helper correctly returns a ContextCanceledError on ctx.Done(), propagating cancellation all the way up
  • Priority early-exit (isPrioritySubtreeCheckActive) is preserved via a 10 ms polling ticker

Tests

  • TestValidateSubtreeInternal_RetryBackoffAndMetrics — asserts:
    • With maxRetries=2, the counter increments by exactly 2 (not 3)
    • Total elapsed time is ≥ 2 × retrySleepDuration, confirming the backoff is actually applied between retries

@gokutheengineer gokutheengineer changed the title fix: validate subtree internal fix(subtreevalidation): fix retry counter inflation and extract retry-wait logic May 13, 2026
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Current Review:

The PR correctly fixes two bugs:

  1. Retry counter inflation - Now only increments on actual retries (attempt > 1) instead of every attempt
  2. Context cancellation silently ignored - The old select used break which only exited the select statement, not the retry loop. The new waitForRetryOrPriority helper correctly propagates cancellation via return

Minor suggestion: Test coverage validates the retry counter and backoff timing, but could be strengthened with:

  • A test for context cancellation during retry wait (the main bug this PR fixes)
  • A test for priority early-exit behavior

The implementation itself is sound - the suggestion is optional test enhancement only.

Comment thread services/subtreevalidation/SubtreeValidation.go Outdated
@gokutheengineer gokutheengineer requested review from liam and ordishs May 13, 2026 12:58
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-859 (f287db5)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.578µ 1.775µ ~ 0.200
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.31n 71.16n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.00n 71.22n ~ 0.200
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.34n 71.07n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 37.21n 33.59n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 57.97n 57.35n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 155.5n 134.2n ~ 0.200
MiningCandidate_Stringify_Short-4 219.2n 220.1n ~ 1.000
MiningCandidate_Stringify_Long-4 1.669µ 1.624µ ~ 0.200
MiningSolution_Stringify-4 863.2n 850.6n ~ 0.100
BlockInfo_MarshalJSON-4 1.761µ 1.777µ ~ 0.100
NewFromBytes-4 128.1n 129.1n ~ 0.400
Mine_EasyDifficulty-4 60.66µ 60.48µ ~ 0.100
Mine_WithAddress-4 6.851µ 6.806µ ~ 0.700
BlockAssembler_AddTx-4 0.02649n 0.02843n ~ 0.700
AddNode-4 10.54 11.22 ~ 0.200
AddNodeWithMap-4 10.92 10.74 ~ 0.700
DirectSubtreeAdd/4_per_subtree-4 59.91n 61.49n ~ 0.400
DirectSubtreeAdd/64_per_subtree-4 31.33n 30.89n ~ 0.600
DirectSubtreeAdd/256_per_subtree-4 30.47n 29.48n ~ 0.100
DirectSubtreeAdd/1024_per_subtree-4 29.07n 28.17n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 28.52n 27.74n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 275.9n 276.9n ~ 0.700
SubtreeProcessorAdd/64_per_subtree-4 271.3n 271.5n ~ 1.000
SubtreeProcessorAdd/256_per_subtree-4 271.5n 275.6n ~ 0.200
SubtreeProcessorAdd/1024_per_subtree-4 263.7n 266.9n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 266.2n 268.4n ~ 0.300
SubtreeProcessorRotate/4_per_subtree-4 270.8n 272.1n ~ 0.400
SubtreeProcessorRotate/64_per_subtree-4 267.6n 269.3n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 268.4n 271.6n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 268.8n 268.9n ~ 0.800
SubtreeNodeAddOnly/4_per_subtree-4 53.84n 54.04n ~ 0.700
SubtreeNodeAddOnly/64_per_subtree-4 34.32n 34.25n ~ 1.000
SubtreeNodeAddOnly/256_per_subtree-4 33.31n 33.34n ~ 1.000
SubtreeNodeAddOnly/1024_per_subtree-4 32.64n 32.65n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 111.5n 113.6n ~ 0.200
SubtreeCreationOnly/64_per_subtree-4 391.9n 392.8n ~ 1.000
SubtreeCreationOnly/256_per_subtree-4 1.318µ 1.319µ ~ 0.600
SubtreeCreationOnly/1024_per_subtree-4 4.396µ 4.414µ ~ 0.400
SubtreeCreationOnly/2048_per_subtree-4 8.082µ 7.896µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 270.9n 267.0n ~ 0.200
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 268.4n 266.7n ~ 0.200
ParallelGetAndSetIfNotExists/1k_nodes-4 806.3µ 578.7µ ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 1.578m 1.309m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 6.646m 6.537m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 13.17m 13.19m ~ 1.000
SequentialGetAndSetIfNotExists/1k_nodes-4 653.6µ 653.9µ ~ 1.000
SequentialGetAndSetIfNotExists/10k_nodes-4 2.737m 2.862m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 10.26m 10.35m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 19.85m 19.76m ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 632.6µ 631.8µ ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.168m 4.187m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 16.42m 16.25m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 703.5µ 696.4µ ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 5.699m 5.710m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 37.55m 37.18m ~ 0.100
DiskTxMap_SetIfNotExists-4 4.089µ 4.270µ ~ 0.100
DiskTxMap_SetIfNotExists_Parallel-4 3.901µ 4.164µ ~ 0.400
DiskTxMap_ExistenceOnly-4 333.9n 385.8n ~ 0.100
Queue-4 203.9n 216.4n ~ 0.100
AtomicPointer-4 8.133n 8.146n ~ 0.400
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 784.0µ 809.4µ ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/10K-4 750.5µ 826.9µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 113.2µ 133.6µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 58.32µ 58.63µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 65.58µ 63.70µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/10K-4 11.79µ 11.80µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/10K-4 5.348µ 5.929µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.847µ 1.965µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 12.19m 12.83m ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.86m 13.11m ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.178m 1.275m ~ 0.100
ReorgOptimizations/AllMarkFalse/New/100K-4 725.6µ 725.3µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/100K-4 602.7µ 589.2µ ~ 0.200
ReorgOptimizations/HashSlicePool/New/100K-4 311.2µ 325.9µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/100K-4 56.17µ 55.21µ ~ 0.400
ReorgOptimizations/NodeFlags/New/100K-4 18.57µ 18.47µ ~ 1.000
TxMapSetIfNotExists-4 50.45n 50.48n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 43.46n 43.48n ~ 1.000
ChannelSendReceive-4 675.7n 678.0n ~ 1.000
CalcBlockWork-4 494.1n 506.0n ~ 0.400
CalculateWork-4 682.8n 689.4n ~ 0.200
BuildBlockLocatorString_Helpers/Size_10-4 1.319µ 1.354µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 12.71µ 12.87µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 125.9µ 127.4µ ~ 0.700
CatchupWithHeaderCache-4 104.2m 104.1m ~ 0.400
_prepareTxsPerLevel-4 408.8m 400.2m ~ 0.200
_prepareTxsPerLevelOrdered-4 4.014m 3.606m ~ 0.700
_prepareTxsPerLevel_Comparison/Original-4 403.2m 396.6m ~ 0.700
_prepareTxsPerLevel_Comparison/Optimized-4 3.579m 3.488m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.005m 1.016m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 236.7µ 236.2µ ~ 1.000
SubtreeSizes/10k_tx_64_per_subtree-4 56.19µ 57.63µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 14.09µ 14.11µ ~ 0.600
SubtreeSizes/10k_tx_512_per_subtree-4 6.892µ 6.914µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 3.455µ 3.574µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 1.731µ 1.726µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 54.81µ 55.40µ ~ 0.200
BlockSizeScaling/10k_tx_256_per_subtree-4 13.91µ 13.86µ ~ 0.700
BlockSizeScaling/10k_tx_1024_per_subtree-4 3.444µ 3.436µ ~ 0.700
BlockSizeScaling/50k_tx_64_per_subtree-4 294.1µ 290.9µ ~ 0.100
BlockSizeScaling/50k_tx_256_per_subtree-4 69.67µ 70.44µ ~ 0.400
BlockSizeScaling/50k_tx_1024_per_subtree-4 17.12µ 17.40µ ~ 0.100
SubtreeAllocations/small_subtrees_exists_check-4 116.5µ 120.0µ ~ 0.100
SubtreeAllocations/small_subtrees_data_fetch-4 123.2µ 135.1µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 235.3µ 245.4µ ~ 0.100
SubtreeAllocations/medium_subtrees_exists_check-4 7.101µ 7.079µ ~ 1.000
SubtreeAllocations/medium_subtrees_data_fetch-4 7.429µ 8.158µ ~ 0.200
SubtreeAllocations/medium_subtrees_full_validation-4 13.78µ 14.13µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 1.651µ 1.687µ ~ 0.200
SubtreeAllocations/large_subtrees_data_fetch-4 1.759µ 1.969µ ~ 0.200
SubtreeAllocations/large_subtrees_full_validation-4 3.481µ 3.475µ ~ 1.000
_BufferPoolAllocation/16KB-4 3.448µ 3.356µ ~ 0.700
_BufferPoolAllocation/32KB-4 7.313µ 9.241µ ~ 0.100
_BufferPoolAllocation/64KB-4 19.33µ 15.14µ ~ 0.100
_BufferPoolAllocation/128KB-4 30.71µ 31.47µ ~ 0.100
_BufferPoolAllocation/512KB-4 118.1µ 137.2µ ~ 0.100
_BufferPoolConcurrent/32KB-4 18.66µ 17.21µ ~ 0.100
_BufferPoolConcurrent/64KB-4 27.83µ 27.60µ ~ 0.400
_BufferPoolConcurrent/512KB-4 171.8µ 170.3µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/16KB-4 722.3µ 725.2µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/32KB-4 732.0µ 701.0µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 724.9µ 699.2µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 725.9µ 709.8µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/512KB-4 720.2µ 713.5µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 38.57m 37.87m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4 38.38m 37.99m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/64KB-4 38.60m 38.03m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/128KB-4 38.23m 37.68m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 38.62m 37.64m ~ 0.100
_PooledVsNonPooled/Pooled-4 825.5n 824.6n ~ 0.500
_PooledVsNonPooled/NonPooled-4 6.939µ 7.313µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 8.659µ 8.198µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 11.56µ 11.86µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 11.29µ 11.07µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 324.1µ 326.3µ ~ 1.000
StoreBlock_Sequential/AboveCSVHeight-4 331.2µ 326.1µ ~ 0.400
GetUtxoHashes-4 258.5n 257.6n ~ 0.200
GetUtxoHashes_ManyOutputs-4 42.86µ 42.89µ ~ 1.000
_NewMetaDataFromBytes-4 237.4n 237.8n ~ 0.700
_Bytes-4 620.0n 617.5n ~ 0.700
_MetaBytes-4 562.5n 562.0n ~ 0.500

Threshold: >10% with p < 0.05 | Generated: 2026-05-13 13:08 UTC

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
73.3% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fixes are correct and the bug is real.

The break-inside-select footgun was silently swallowing ctx.Done() — classic Go gotcha, canonical fix (extract to a helper that returns). Nice.

Worth noting that this PR also fixes a second, undocumented bug: the original three-way time.After race made the 10 ms case win almost every iteration, so BlockValidation.RetrySleep (default 1s) was effectively being ignored — retries waited ~10 ms instead of the configured duration. The new helper honours the setting properly. Good catch, but worth flagging in the PR body since this changes production retry latency: with ValidationMaxRetries=5, the worst-case retry window grows from ~50 ms to ~5 s. Probably the intended behaviour (the setting docs say "Sleep duration between validation retries"), but ops should know to expect slower failure paths under "missing txmeta" storms. Worth watching prometheusSubtreeValidationValidateSubtreeRetry post-deploy.

Suggestions (non-blocking):

  1. Add a test for the actual headline fix — ctx cancellation propagation. The new test verifies the metric and backoff timing, but not that ctx.Done() returns ContextCanceledError instead of being swallowed. A direct unit test on waitForRetryOrPriority would also cheaply cover the priority early-exit path:

    func TestWaitForRetryOrPriority_ContextCancellation(t *testing.T) {
        // ctx canceled mid-wait → returns ContextCanceledError, doesn't block for full retrySleep
    }
  2. Error wrapping: errors.NewContextCanceledError("...: %v", subtreeHash, ctx.Err()) formats ctx.Err() into the message string, so errors.Is(err, context.Canceled) won't find it via the wrapped chain — only via the custom error code. If anything upstream relies on errors.Is(err, context.Canceled), it'll miss. Minor — depends on how callers detect cancellation.

  3. Minor behaviour drift: if retrySleepDuration <= 0 || isPrioritySubtreeCheckActive(...) now short-circuits with zero wait when priority is already active on entry; the old code waited up to 10 ms first. Almost certainly an improvement, just uncovered by tests.

Strengths to call out:

  • Counter delta asserted with Equal(2) not >= — future inflation regressions will fail loudly.
  • time.NewTimer/time.NewTicker with defer Stop() plugs the old time.After leak in the retry hot path.

@gokutheengineer gokutheengineer merged commit 20ef450 into bsv-blockchain:main May 14, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants