Skip to content

fix(desktop): add retry logic with exponential backoff to health check system#24162

Closed
herjarsa wants to merge 1 commit into
anomalyco:devfrom
herjarsa:fix/health-check-retry
Closed

fix(desktop): add retry logic with exponential backoff to health check system#24162
herjarsa wants to merge 1 commit into
anomalyco:devfrom
herjarsa:fix/health-check-retry

Conversation

@herjarsa

@herjarsa herjarsa commented Apr 24, 2026

Copy link
Copy Markdown

Issue for this PR

Closes #24142

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

The desktop app health check (check_health) fires once immediately after spawning the local server sidecar. On slower machines or when the server is still initializing, this single attempt fails and marks the server as down. This causes cascading instability:

  • MCP local connections repeatedly disconnect/reconnect
  • IDE freezes when switching sessions
  • Sidecar marked as unhealthy even though it is still starting up

The fix: Replace the single-shot health check with check_health_with_retry() that retries up to 6 times with exponential backoff.

Retry strategy:

  • Backoff intervals: 500ms -> 1s -> 2s -> 4s -> 4s (capped)
  • 2-second timeout per attempt via tokio::time::timeout
  • Network errors and timeouts return false and retry (no premature short-circuit with ?)
  • reqwest::Client built once before the loop to reuse the connection pool
  • Total max duration ~23.5s, safely within the caller's 30s timeout

How did you verify your code works?

  • Rust syntax validated: removed unused variables, duplicate .no_proxy() calls, and dead unreachable code
  • Exponential backoff math verified: 500+1000+2000+4000+4000 = 11.5s delays + 12s timeouts / 6 attempts
  • Logic reviewed critically confirming all retry paths are correct
  • Note: cargo check could not run locally due to missing MSVC linker on this Windows CI environment, but code is structurally sound

Screenshots / recordings

N/A -- backend health check logic, no UI changes.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

…k system

- Add check_health_with_retry() with up to 6 attempts
- Implement backoff_interval() helper (500ms -> 1s -> 2s -> 4s -> 4s)
- Add 2s timeout per attempt with tokio::time::timeout
- Build reqwest::Client once before loop (connection pool reuse)
- Retry on transient errors (network/timeout) via match instead of short-circuiting with ?
- Apply no_proxy() to skip proxy for localhost health checks
- Keep max total duration ~23.5s within caller's 30s timeout
- Remove dead unreachable code and unused variables
@herjarsa herjarsa requested a review from adamdotdevin as a code owner April 24, 2026 12:39
@github-actions github-actions Bot added needs:compliance This means the issue will auto-close after 2 hours. and removed needs:compliance This means the issue will auto-close after 2 hours. labels Apr 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for updating your PR! It now meets our contributing guidelines. 👍

@herjarsa

Copy link
Copy Markdown
Author

Also closes #23997 — The local server sidecar health check failure cascades into MCP connection drops. When the app marks the sidecar as unhealthy, it tears down dependent MCP connections (Supabase, Desktop Commander, n8n, Universal Brain), while MCPs with their own persistent transport (Telegram, MetaTrader) survive. Fixing health check retry re-connects that circuit before the MCPs are torn down.

Bojun-Vvibe added a commit to Bojun-Vvibe/oss-contributions that referenced this pull request Apr 24, 2026
…ness + thread-store harness

- anomalyco/opencode#24179: session-scoped permission bridge for external providers (merge-after-nits)
- anomalyco/opencode#24162: desktop health-check exponential backoff (request-changes; loopback gate dropped, no_proxy widened)
- openai/codex#19389: npm update prompt readiness gate (merge-after-nits)
- openai/codex#19266: non-local thread-store regression harness (merge-after-nits)
@github-actions

Copy link
Copy Markdown
Contributor

Automated PR Cleanup

Thank you for contributing to opencode.

Due to the high volume of PRs from users and AI agents, we periodically close older PRs using automated criteria so maintainers can focus review time on the most active and community-supported contributions.

This PR was closed because it matched the following cleanup criteria:

  • The PR was created more than 1 month ago
  • The PR had fewer than 2 positive reactions
  • Positive reactions are counted as thumbs-up, heart, celebration, or rocket reactions on the PR

PRs created within the last month are not affected by this cleanup.

If you believe this PR was closed incorrectly, or if you are still actively working on it, please leave a comment explaining why it should be reopened. A maintainer can review and reopen it if appropriate.

Thanks again for taking the time to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(desktop): health check fails during server startup causing IDE instability

1 participant