Skip to content

Fix flaky TestServer_HealthMonitoring_Enabled and related tests#3990

Merged
JAORMX merged 1 commit intomainfrom
issue-3985
Mar 4, 2026
Merged

Fix flaky TestServer_HealthMonitoring_Enabled and related tests#3990
JAORMX merged 1 commit intomainfrom
issue-3985

Conversation

@JAORMX
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX commented Mar 4, 2026

Summary

  • Replace time.Sleep-based synchronization with require.Eventually polling in three flaky health monitoring server tests
  • Increase Timeout (10ms → 5s) and DegradedThreshold (5ms → 2s) so mock calls don't falsely trigger degraded/timeout status under CI load or race detector
  • Replace trailing time.Sleep shutdown waits with proper select on errCh

Root Cause

The tests used time.Sleep(200ms) to wait for health checks and had a DegradedThreshold of 5ms. Under CI load or with -race, mock calls could exceed 5ms, causing backends to be marked "degraded" instead of "healthy". Additionally, calling WaitForInitialHealthChecks() from the test goroutine races with Start() (which runs in a separate goroutine) because the server's Ready() channel fires before healthMonitor.Start() calls WaitGroup.Add().

Test plan

  • go test -race -run TestServer_HealthMonitoring_Enabled -count=20 — 20/20 pass, 0 races
  • go test -race -run TestServer_HandleBackendHealth_Enabled -count=20 — 20/20 pass, 0 races
  • go test -race -run TestServer_Stop_StopsHealthMonitor -count=20 — 20/20 pass, 0 races
  • go test -race -run TestServer_Health -count=5 — full suite passes

Fixes #3985

🤖 Generated with Claude Code

Replace time.Sleep-based synchronization with require.Eventually polling
in three health monitoring server tests. The tests were flaky because:

- DegradedThreshold (5ms) and Timeout (10ms) were too tight for CI/race
  detector, causing mock calls to exceed thresholds
- time.Sleep waits are inherently racy and non-deterministic

Changes:
- Use require.Eventually to poll for expected health status instead of
  sleeping a fixed duration
- Increase Timeout from 10ms to 5s and DegradedThreshold from 5ms to 2s
  (mock calls never take that long, but prevents false degraded status)
- Replace trailing time.Sleep shutdown waits with select on errCh
- Add proper server startup error handling in HandleBackendHealth test

Verified with -race -count=20 on all three tests (60/60 pass, 0 races).

Fixes #3985

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/XS Extra small PR: < 100 lines changed label Mar 4, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.58%. Comparing base (c46ae47) to head (af90801).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3990      +/-   ##
==========================================
+ Coverage   68.56%   68.58%   +0.02%     
==========================================
  Files         437      437              
  Lines       44662    44662              
==========================================
+ Hits        30621    30633      +12     
+ Misses      11657    11647      -10     
+ Partials     2384     2382       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX merged commit a38033f into main Mar 4, 2026
36 checks passed
@JAORMX JAORMX deleted the issue-3985 branch March 4, 2026 09:54
reyortiz3 pushed a commit that referenced this pull request Mar 4, 2026
Replace time.Sleep-based synchronization with require.Eventually polling
in three health monitoring server tests. The tests were flaky because:

- DegradedThreshold (5ms) and Timeout (10ms) were too tight for CI/race
  detector, causing mock calls to exceed thresholds
- time.Sleep waits are inherently racy and non-deterministic

Changes:
- Use require.Eventually to poll for expected health status instead of
  sleeping a fixed duration
- Increase Timeout from 10ms to 5s and DegradedThreshold from 5ms to 2s
  (mock calls never take that long, but prevents false degraded status)
- Replace trailing time.Sleep shutdown waits with select on errCh
- Add proper server startup error handling in HandleBackendHealth test

Verified with -race -count=20 on all three tests (60/60 pass, 0 races).

Fixes #3985

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestServer_HealthMonitoring_Enabled

2 participants