fix(subtreevalidation): fix retry counter inflation and extract retry-wait logic#859
Conversation
|
🤖 Claude Code Review Status: Complete Current Review: The PR correctly fixes two bugs:
Minor suggestion: Test coverage validates the retry counter and backoff timing, but could be strengthened with:
The implementation itself is sound - the suggestion is optional test enhancement only. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-13 13:08 UTC |
|
ordishs
left a comment
There was a problem hiding this comment.
Both fixes are correct and the bug is real.
The break-inside-select footgun was silently swallowing ctx.Done() — classic Go gotcha, canonical fix (extract to a helper that returns). Nice.
Worth noting that this PR also fixes a second, undocumented bug: the original three-way time.After race made the 10 ms case win almost every iteration, so BlockValidation.RetrySleep (default 1s) was effectively being ignored — retries waited ~10 ms instead of the configured duration. The new helper honours the setting properly. Good catch, but worth flagging in the PR body since this changes production retry latency: with ValidationMaxRetries=5, the worst-case retry window grows from ~50 ms to ~5 s. Probably the intended behaviour (the setting docs say "Sleep duration between validation retries"), but ops should know to expect slower failure paths under "missing txmeta" storms. Worth watching prometheusSubtreeValidationValidateSubtreeRetry post-deploy.
Suggestions (non-blocking):
-
Add a test for the actual headline fix — ctx cancellation propagation. The new test verifies the metric and backoff timing, but not that
ctx.Done()returnsContextCanceledErrorinstead of being swallowed. A direct unit test onwaitForRetryOrPrioritywould also cheaply cover the priority early-exit path:func TestWaitForRetryOrPriority_ContextCancellation(t *testing.T) { // ctx canceled mid-wait → returns ContextCanceledError, doesn't block for full retrySleep }
-
Error wrapping:
errors.NewContextCanceledError("...: %v", subtreeHash, ctx.Err())formatsctx.Err()into the message string, soerrors.Is(err, context.Canceled)won't find it via the wrapped chain — only via the custom error code. If anything upstream relies onerrors.Is(err, context.Canceled), it'll miss. Minor — depends on how callers detect cancellation. -
Minor behaviour drift:
if retrySleepDuration <= 0 || isPrioritySubtreeCheckActive(...)now short-circuits with zero wait when priority is already active on entry; the old code waited up to 10 ms first. Almost certainly an improvement, just uncovered by tests.
Strengths to call out:
- Counter delta asserted with
Equal(2)not>=— future inflation regressions will fail loudly. time.NewTimer/time.NewTickerwithdefer Stop()plugs the oldtime.Afterleak in the retry hot path.


Summary
Two correctness fixes to
ValidateSubtreeInternal: a metric counter that was over-counting on every validation, and a retry-waitselectthat silently swallowed context cancellation instead of propagating it.Changes
Retry counter (
SubtreeValidation.go)prometheusSubtreeValidationValidateSubtreeRetrywas incremented unconditionally onattempt=1, inflating the metric by 1 on every successful first-attempt validationattempt > 1— i.e. on actual retriesRetry-wait logic
selectintowaitForRetryOrPriority(ctx, subtreeHash, duration)selectusedbreakinside acase, which exits the select but not the retry loop — context cancellation was silently ignoredContextCanceledErroronctx.Done(), propagating cancellation all the way upisPrioritySubtreeCheckActive) is preserved via a 10 ms polling tickerTests
TestValidateSubtreeInternal_RetryBackoffAndMetrics— asserts:maxRetries=2, the counter increments by exactly 2 (not 3)