Skip to content

fix flaky TestPrivateTxSubmissionRetry timing races#2089

Merged
lucca30 merged 1 commit intodevelopfrom
lmartins/fix-private-submission-tests
Mar 4, 2026
Merged

fix flaky TestPrivateTxSubmissionRetry timing races#2089
lucca30 merged 1 commit intodevelopfrom
lmartins/fix-private-submission-tests

Conversation

@lucca30
Copy link
Copy Markdown
Contributor

@lucca30 lucca30 commented Feb 26, 2026

Description

Problem

Two subtests in TestPrivateTxSubmissionRetry failed intermittently on CI:

--- FAIL: TestPrivateTxSubmissionRetry/retry_succeeds_after_N_attempts (4.10s)
    multiclient_test.go:925: expected server 0 to be called 3 times
                             expected: 3, actual: 2

--- FAIL: TestPrivateTxSubmissionRetry/retry_stops_when_tx_found_in_local_database (2.10s)
    multiclient_test.go:968: expected server 0 to be called only once, no retries after tx found
                             expected: 1, actual: 2

The tests passed consistently on a developer laptop but failed on CI runners, making the root cause non-obvious.

Root Cause

There were two interrelated timing races, both stemming from the same pattern: the tests used time.Sleep(N × retryInterval + 100ms) to wait for background retry goroutines to complete, then immediately asserted on call counts.

Race 1 — insufficient buffer (retry_succeeds_after_N_attempts):

retryPrivateTxSubmission sleeps privateTxRetryInterval (2s) before each attempt. After its second sleep the goroutine wakes at t ≈ 4s and spawns concurrent HTTP call goroutines. The test asserts at t = 4.1s — only 100ms later. On a loaded CI runner, those HTTP calls took longer than 100ms to complete, so the assertion ran before the third call was recorded:

t=0s    initial submission → counts[0]=1, retry goroutine spawned
t=2s    retry 1 → counts[0]=2, goroutine starts second 2s sleep
t=4s    goroutine wakes, spawns HTTP goroutines for retry 2
t=4.1s  test assertion → sees counts[0]=2  ← RACE: retry 2 not done yet
t=4s+ε  retry 2 goroutines complete → counts[0]=3   (too late)

Race 2 — cross-subtest goroutine leakage (retry_stops_when_tx_found_in_local_database):

All subtests share the same rpcServers array. Subtest 1's retry goroutine outlived subtest 1: when the test assertion at t=4.1s raced subtest 1's second retry (still in progress), the test ended and subtest 2 immediately installed new handlers on the shared servers. Subtest 1's still-running HTTP goroutines then landed on subtest 2's server 0 handler, spuriously incrementing subtest 2's callCounts[0] from 1 to 2.

Reproducing Consistently

The failure only appeared on loaded machines because the 100ms buffer was tight. To reproduce it locally we temporarily injected a 150ms sleep inside the retry goroutine's HTTP call to simulate CI scheduler overhead:

go func(client *rpc.Client, idx int) {
    defer retryWg.Done()
    time.Sleep(150 * time.Millisecond) // simulate CI latency
    err := client.CallContext(...)

With this, both subtests failed 3/3 runs with the exact same error messages as CI. After removing the sleep and applying the fix, both subtests passed 3/3 under the same simulated conditions.

Fix

multiclient.go — two changes:

  1. Added retryInterval time.Duration field to multiClient (zero value falls back to the 2s production constant). This lets tests set a short interval (10ms) so the retry goroutine completes near-instantly, eliminating the slow sleeping from the test timeline entirely.

  2. submitPrivateTx now returns (error, <-chan struct{}). When a retry goroutine is started, the channel is closed when the goroutine finishes. This gives callers a deterministic signal instead of forcing them to guess how long to sleep.

service.go — updated the single production call site to discard the done channel (err, _ = ...), as the service has no use for it.

multiclient_test.go — all retry subtests updated to:

  • Set mc.retryInterval = 10 * time.Millisecond after construction
  • Replace every time.Sleep(N × retryInterval + 100ms) with select { case <-done: case <-time.After(5s): t.Fatal(...) }

This eliminates both races: <-done guarantees the goroutine has fully completed before any assertion runs, and since subtest N's goroutine is provably done before subtest N+1 installs new handlers, there is no cross-subtest leakage.

Result

  • TestPrivateTxSubmissionRetry was previously taking ~55s per run (dominated by 2s sleep intervals) and failing intermittently. It now runs in ~2.3s total and passes 10/10 with -race.
  • No behavioural change to production retry logic — only the retry interval is now injectable for testing.

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

@lucca30 lucca30 requested review from a team and manav2401 February 26, 2026 04:00
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
28.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.60%. Comparing base (22fc1cb) to head (301a7ab).
⚠️ Report is 15 commits behind head on develop.

Files with missing lines Patch % Lines
eth/relay/multiclient.go 78.57% 2 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (80.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2089      +/-   ##
===========================================
- Coverage    50.61%   50.60%   -0.01%     
===========================================
  Files          875      875              
  Lines       151813   151827      +14     
===========================================
+ Hits         76833    76838       +5     
- Misses       69902    69908       +6     
- Partials      5078     5081       +3     
Files with missing lines Coverage Δ
eth/relay/service.go 95.62% <100.00%> (ø)
eth/relay/multiclient.go 91.89% <78.57%> (-0.94%) ⬇️

... and 17 files with indirect coverage changes

Files with missing lines Coverage Δ
eth/relay/service.go 95.62% <100.00%> (ø)
eth/relay/multiclient.go 91.89% <78.57%> (-0.94%) ⬇️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lucca30 lucca30 requested a review from a team March 2, 2026 20:29
@lucca30 lucca30 merged commit d4b1cef into develop Mar 4, 2026
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants