fix flaky TestPrivateTxSubmissionRetry timing races by lucca30 · Pull Request #2089 · 0xPolygon/bor

lucca30 · 2026-02-26T04:00:29Z

Description

Problem

Two subtests in TestPrivateTxSubmissionRetry failed intermittently on CI:

--- FAIL: TestPrivateTxSubmissionRetry/retry_succeeds_after_N_attempts (4.10s)
    multiclient_test.go:925: expected server 0 to be called 3 times
                             expected: 3, actual: 2

--- FAIL: TestPrivateTxSubmissionRetry/retry_stops_when_tx_found_in_local_database (2.10s)
    multiclient_test.go:968: expected server 0 to be called only once, no retries after tx found
                             expected: 1, actual: 2

The tests passed consistently on a developer laptop but failed on CI runners, making the root cause non-obvious.

Root Cause

There were two interrelated timing races, both stemming from the same pattern: the tests used time.Sleep(N × retryInterval + 100ms) to wait for background retry goroutines to complete, then immediately asserted on call counts.

Race 1 — insufficient buffer (retry_succeeds_after_N_attempts):

retryPrivateTxSubmission sleeps privateTxRetryInterval (2s) before each attempt. After its second sleep the goroutine wakes at t ≈ 4s and spawns concurrent HTTP call goroutines. The test asserts at t = 4.1s — only 100ms later. On a loaded CI runner, those HTTP calls took longer than 100ms to complete, so the assertion ran before the third call was recorded:

t=0s    initial submission → counts[0]=1, retry goroutine spawned
t=2s    retry 1 → counts[0]=2, goroutine starts second 2s sleep
t=4s    goroutine wakes, spawns HTTP goroutines for retry 2
t=4.1s  test assertion → sees counts[0]=2  ← RACE: retry 2 not done yet
t=4s+ε  retry 2 goroutines complete → counts[0]=3   (too late)

Race 2 — cross-subtest goroutine leakage (retry_stops_when_tx_found_in_local_database):

All subtests share the same rpcServers array. Subtest 1's retry goroutine outlived subtest 1: when the test assertion at t=4.1s raced subtest 1's second retry (still in progress), the test ended and subtest 2 immediately installed new handlers on the shared servers. Subtest 1's still-running HTTP goroutines then landed on subtest 2's server 0 handler, spuriously incrementing subtest 2's callCounts[0] from 1 to 2.

Reproducing Consistently

The failure only appeared on loaded machines because the 100ms buffer was tight. To reproduce it locally we temporarily injected a 150ms sleep inside the retry goroutine's HTTP call to simulate CI scheduler overhead:

go func(client *rpc.Client, idx int) {
    defer retryWg.Done()
    time.Sleep(150 * time.Millisecond) // simulate CI latency
    err := client.CallContext(...)

With this, both subtests failed 3/3 runs with the exact same error messages as CI. After removing the sleep and applying the fix, both subtests passed 3/3 under the same simulated conditions.

Fix

multiclient.go — two changes:

Added retryInterval time.Duration field to multiClient (zero value falls back to the 2s production constant). This lets tests set a short interval (10ms) so the retry goroutine completes near-instantly, eliminating the slow sleeping from the test timeline entirely.
submitPrivateTx now returns (error, <-chan struct{}). When a retry goroutine is started, the channel is closed when the goroutine finishes. This gives callers a deterministic signal instead of forcing them to guess how long to sleep.

service.go — updated the single production call site to discard the done channel (err, _ = ...), as the service has no use for it.

multiclient_test.go — all retry subtests updated to:

Set mc.retryInterval = 10 * time.Millisecond after construction
Replace every time.Sleep(N × retryInterval + 100ms) with select { case <-done: case <-time.After(5s): t.Fatal(...) }

This eliminates both races: <-done guarantees the goroutine has fully completed before any assertion runs, and since subtest N's goroutine is provably done before subtest N+1 installs new handlers, there is no cross-subtest leakage.

Result

TestPrivateTxSubmissionRetry was previously taking ~55s per run (dominated by 2s sleep intervals) and failing intermittently. It now runs in ~2.3s total and passes 10/10 with -race.
No behavioural change to production retry logic — only the retry interval is now injectable for testing.

Changes

Bugfix (non-breaking change that solves an issue)
Hotfix (change that solves an urgent issue, and requires immediate attention)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)
Changes only for a subset of nodes

sonarqubecloud · 2026-02-26T04:01:51Z

Quality Gate failed

Failed conditions
28.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

codecov · 2026-02-26T04:27:25Z

Codecov Report

❌ Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.60%. Comparing base (22fc1cb) to head (301a7ab).
⚠️ Report is 15 commits behind head on develop.

Files with missing lines	Patch %	Lines
eth/relay/multiclient.go	78.57%	2 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (80.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2089      +/-   ##
===========================================
- Coverage    50.61%   50.60%   -0.01%     
===========================================
  Files          875      875              
  Lines       151813   151827      +14     
===========================================
+ Hits         76833    76838       +5     
- Misses       69902    69908       +6     
- Partials      5078     5081       +3

Files with missing lines	Coverage Δ
eth/relay/service.go	`95.62% <100.00%> (ø)`
eth/relay/multiclient.go	`91.89% <78.57%> (-0.94%)`	⬇️

... and 17 files with indirect coverage changes

Files with missing lines	Coverage Δ
eth/relay/service.go	`95.62% <100.00%> (ø)`
eth/relay/multiclient.go	`91.89% <78.57%> (-0.94%)`	⬇️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

fix consistency on tests

301a7ab

lucca30 requested review from a team and manav2401 February 26, 2026 04:00

manav2401 approved these changes Mar 2, 2026

View reviewed changes

lucca30 requested a review from a team March 2, 2026 20:29

pratikspatil024 approved these changes Mar 4, 2026

View reviewed changes

lucca30 merged commit d4b1cef into develop Mar 4, 2026
19 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix flaky TestPrivateTxSubmissionRetry timing races#2089

fix flaky TestPrivateTxSubmissionRetry timing races#2089
lucca30 merged 1 commit intodevelopfrom
lmartins/fix-private-submission-tests

lucca30 commented Feb 26, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Feb 26, 2026

Uh oh!

codecov bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lucca30 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Root Cause

Reproducing Consistently

Fix

Result

Changes

Uh oh!

sonarqubecloud bot commented Feb 26, 2026

Quality Gate failed

Uh oh!

codecov bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucca30 commented Feb 26, 2026 •

edited

Loading

codecov bot commented Feb 26, 2026 •

edited

Loading