Fix flaky E2E tests caused by port allocation race condition#4078
Closed
aponcedeleonch wants to merge 1 commit intomainfrom
Closed
Fix flaky E2E tests caused by port allocation race condition#4078aponcedeleonch wants to merge 1 commit intomainfrom
aponcedeleonch wants to merge 1 commit intomainfrom
Conversation
1ec841f to
25594bf
Compare
rdimitrov
previously approved these changes
Mar 10, 2026
JAORMX
reviewed
Mar 10, 2026
| // portMu protects recentlyAllocated from concurrent access | ||
| portMu sync.Mutex | ||
| // recentlyAllocated tracks ports returned by FindAvailable/FindOrUsePort | ||
| // to prevent TOCTOU races when multiple goroutines allocate ports concurrently. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4078 +/- ##
==========================================
+ Coverage 68.52% 68.53% +0.01%
==========================================
Files 447 447
Lines 45679 45715 +36
==========================================
+ Hits 31300 31332 +32
Misses 11963 11963
- Partials 2416 2420 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2 tasks
7483fc3 to
6ddeb3a
Compare
Port allocation in pkg/networking/port.go had a TOCTOU race: IsAvailable() opens then closes listeners, creating a window where concurrent goroutines get the same port. This caused HTTP 500s during bulk workload creation in E2E tests, leading to cascade suite timeouts. Add process-level mutex and recently-allocated port tracking to prevent concurrent goroutines from receiving the same port. Wrap port errors with ErrPortUnavailable sentinel and map to HTTP 503 in the API layer. Add retry helper and NodeTimeout decorators to bulk E2E tests for resilience. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6ddeb3a to
871d78f
Compare
Member
Author
|
I thought port allocation was the issue for the flaky tests but turns out it wasn't or at least now the e2e tests seem to be more stable without getting this merged. Closing this PR for now |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pkg/networking/port.gohas a TOCTOU race —IsAvailable()opens/closes listeners, creating a window where concurrent goroutines allocate the same portType of change
Test plan
task test)task lint-fix)Changes
pkg/networking/port.gosync.Mutex+recentlyAllocatedmap with 30s TTL to prevent TOCTOU races inFindAvailable()andFindOrUsePort()pkg/runner/config.goErrPortUnavailablesentinel error; wrap all port allocation failures with itpkg/api/v1/workload_service.goErrPortUnavailableto HTTP 503; wrapSaveState/RunWorkloadDetachederrors with explicithttperrcodestest/e2e/api_workloads_test.gocreateWorkloadWithRetryhelper (3 attempts, 2s backoff on 5xx)test/e2e/api_workload_lifecycle_test.gocreateWorkloadWithRetryin bulk stop/restart/delete tests; addNodeTimeout(5m)to prevent cascade failuresDoes this introduce a user-facing change?
Port allocation errors during workload creation now return HTTP 503 (Service Unavailable) instead of a generic 500 (Internal Server Error), giving clients a clearer signal to retry.
Special notes for reviewers
recentlyAllocatedmap uses a 30s TTL with lazy purging — entries are cleaned up on the nextFindAvailable()/FindOrUsePort()call, not via a background goroutineFindOrUsePortinlines the fallback port search rather than callingFindAvailable()to avoid double-locking the mutexGenerated with Claude Code