fix flaky TestPrivateTxSubmissionRetry timing races#2089
Merged
Conversation
|
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (80.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2089 +/- ##
===========================================
- Coverage 50.61% 50.60% -0.01%
===========================================
Files 875 875
Lines 151813 151827 +14
===========================================
+ Hits 76833 76838 +5
- Misses 69902 69908 +6
- Partials 5078 5081 +3
... and 17 files with indirect coverage changes
🚀 New features to boost your workflow:
|
manav2401
approved these changes
Mar 2, 2026
pratikspatil024
approved these changes
Mar 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Description
Problem
Two subtests in
TestPrivateTxSubmissionRetryfailed intermittently on CI:The tests passed consistently on a developer laptop but failed on CI runners, making the root cause non-obvious.
Root Cause
There were two interrelated timing races, both stemming from the same pattern: the tests used
time.Sleep(N × retryInterval + 100ms)to wait for background retry goroutines to complete, then immediately asserted on call counts.Race 1 — insufficient buffer (
retry_succeeds_after_N_attempts):retryPrivateTxSubmissionsleepsprivateTxRetryInterval(2s) before each attempt. After its second sleep the goroutine wakes att ≈ 4sand spawns concurrent HTTP call goroutines. The test asserts att = 4.1s— only 100ms later. On a loaded CI runner, those HTTP calls took longer than 100ms to complete, so the assertion ran before the third call was recorded:Race 2 — cross-subtest goroutine leakage (
retry_stops_when_tx_found_in_local_database):All subtests share the same
rpcServersarray. Subtest 1's retry goroutine outlived subtest 1: when the test assertion att=4.1sraced subtest 1's second retry (still in progress), the test ended and subtest 2 immediately installed new handlers on the shared servers. Subtest 1's still-running HTTP goroutines then landed on subtest 2's server 0 handler, spuriously incrementing subtest 2'scallCounts[0]from 1 to 2.Reproducing Consistently
The failure only appeared on loaded machines because the 100ms buffer was tight. To reproduce it locally we temporarily injected a 150ms sleep inside the retry goroutine's HTTP call to simulate CI scheduler overhead:
With this, both subtests failed 3/3 runs with the exact same error messages as CI. After removing the sleep and applying the fix, both subtests passed 3/3 under the same simulated conditions.
Fix
multiclient.go— two changes:Added
retryInterval time.Durationfield tomultiClient(zero value falls back to the 2s production constant). This lets tests set a short interval (10ms) so the retry goroutine completes near-instantly, eliminating the slow sleeping from the test timeline entirely.submitPrivateTxnow returns(error, <-chan struct{}). When a retry goroutine is started, the channel is closed when the goroutine finishes. This gives callers a deterministic signal instead of forcing them to guess how long to sleep.service.go— updated the single production call site to discard the done channel (err, _ = ...), as the service has no use for it.multiclient_test.go— all retry subtests updated to:mc.retryInterval = 10 * time.Millisecondafter constructiontime.Sleep(N × retryInterval + 100ms)withselect { case <-done: case <-time.After(5s): t.Fatal(...) }This eliminates both races:
<-doneguarantees the goroutine has fully completed before any assertion runs, and since subtest N's goroutine is provably done before subtest N+1 installs new handlers, there is no cross-subtest leakage.Result
TestPrivateTxSubmissionRetrywas previously taking ~55s per run (dominated by 2s sleep intervals) and failing intermittently. It now runs in ~2.3s total and passes 10/10 with-race.Changes