Skip to content

Fix flaky E2E tests caused by port allocation race condition#4078

Closed
aponcedeleonch wants to merge 1 commit intomainfrom
fix/flaky-e2e-port-allocation-race
Closed

Fix flaky E2E tests caused by port allocation race condition#4078
aponcedeleonch wants to merge 1 commit intomainfrom
fix/flaky-e2e-port-allocation-race

Conversation

@aponcedeleonch
Copy link
Copy Markdown
Member

Summary

  • E2E Tests Core jobs are flaky on main, failing across consecutive CI runs with HTTP 500 on workload creation during bulk operations, followed by suite timeout cascades
  • Root cause: port allocation in pkg/networking/port.go has a TOCTOU race — IsAvailable() opens/closes listeners, creating a window where concurrent goroutines allocate the same port
  • Fix adds process-level mutex and recently-allocated port tracking to prevent concurrent goroutines from receiving the same port
  • Port allocation errors now surface as HTTP 503 (Service Unavailable) instead of generic 500, and bulk E2E tests use retry + NodeTimeout for resilience

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Changes

File Change
pkg/networking/port.go Add sync.Mutex + recentlyAllocated map with 30s TTL to prevent TOCTOU races in FindAvailable() and FindOrUsePort()
pkg/runner/config.go Add ErrPortUnavailable sentinel error; wrap all port allocation failures with it
pkg/api/v1/workload_service.go Map ErrPortUnavailable to HTTP 503; wrap SaveState/RunWorkloadDetached errors with explicit httperr codes
test/e2e/api_workloads_test.go Add response body logging on unexpected status; add createWorkloadWithRetry helper (3 attempts, 2s backoff on 5xx)
test/e2e/api_workload_lifecycle_test.go Use createWorkloadWithRetry in bulk stop/restart/delete tests; add NodeTimeout(5m) to prevent cascade failures

Does this introduce a user-facing change?

Port allocation errors during workload creation now return HTTP 503 (Service Unavailable) instead of a generic 500 (Internal Server Error), giving clients a clearer signal to retry.

Special notes for reviewers

  • The recentlyAllocated map uses a 30s TTL with lazy purging — entries are cleaned up on the next FindAvailable()/FindOrUsePort() call, not via a background goroutine
  • FindOrUsePort inlines the fallback port search rather than calling FindAvailable() to avoid double-locking the mutex
  • The retry helper is intentionally only used in bulk operation tests where the race was observed; single-workload tests keep strict assertions

Generated with Claude Code

@aponcedeleonch aponcedeleonch force-pushed the fix/flaky-e2e-port-allocation-race branch from 1ec841f to 25594bf Compare March 10, 2026 17:35
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 10, 2026
rdimitrov
rdimitrov previously approved these changes Mar 10, 2026
// portMu protects recentlyAllocated from concurrent access
portMu sync.Mutex
// recentlyAllocated tracks ports returned by FindAvailable/FindOrUsePort
// to prevent TOCTOU races when multiple goroutines allocate ports concurrently.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use sync map?

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 41.30435% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.53%. Comparing base (de0eb9c) to head (871d78f).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/api/v1/workload_service.go 0.00% 12 Missing and 1 partial ⚠️
pkg/networking/port.go 60.00% 8 Missing and 4 partials ⚠️
pkg/runner/config.go 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4078      +/-   ##
==========================================
+ Coverage   68.52%   68.53%   +0.01%     
==========================================
  Files         447      447              
  Lines       45679    45715      +36     
==========================================
+ Hits        31300    31332      +32     
  Misses      11963    11963              
- Partials     2416     2420       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 10, 2026
@aponcedeleonch aponcedeleonch force-pushed the fix/flaky-e2e-port-allocation-race branch from 7483fc3 to 6ddeb3a Compare March 11, 2026 08:48
Port allocation in pkg/networking/port.go had a TOCTOU race: IsAvailable()
opens then closes listeners, creating a window where concurrent goroutines
get the same port. This caused HTTP 500s during bulk workload creation in
E2E tests, leading to cascade suite timeouts.

Add process-level mutex and recently-allocated port tracking to prevent
concurrent goroutines from receiving the same port. Wrap port errors with
ErrPortUnavailable sentinel and map to HTTP 503 in the API layer. Add
retry helper and NodeTimeout decorators to bulk E2E tests for resilience.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 11, 2026
@aponcedeleonch aponcedeleonch force-pushed the fix/flaky-e2e-port-allocation-race branch from 6ddeb3a to 871d78f Compare March 11, 2026 08:49
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 11, 2026
@aponcedeleonch
Copy link
Copy Markdown
Member Author

I thought port allocation was the issue for the flaky tests but turns out it wasn't or at least now the e2e tests seem to be more stable without getting this merged. Closing this PR for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants