fix(desktop): add retry logic with exponential backoff to health check system#24162
fix(desktop): add retry logic with exponential backoff to health check system#24162herjarsa wants to merge 1 commit into
Conversation
…k system - Add check_health_with_retry() with up to 6 attempts - Implement backoff_interval() helper (500ms -> 1s -> 2s -> 4s -> 4s) - Add 2s timeout per attempt with tokio::time::timeout - Build reqwest::Client once before loop (connection pool reuse) - Retry on transient errors (network/timeout) via match instead of short-circuiting with ? - Apply no_proxy() to skip proxy for localhost health checks - Keep max total duration ~23.5s within caller's 30s timeout - Remove dead unreachable code and unused variables
|
Thanks for updating your PR! It now meets our contributing guidelines. 👍 |
|
Also closes #23997 — The local server sidecar health check failure cascades into MCP connection drops. When the app marks the sidecar as unhealthy, it tears down dependent MCP connections (Supabase, Desktop Commander, n8n, Universal Brain), while MCPs with their own persistent transport (Telegram, MetaTrader) survive. Fixing health check retry re-connects that circuit before the MCPs are torn down. |
…ness + thread-store harness - anomalyco/opencode#24179: session-scoped permission bridge for external providers (merge-after-nits) - anomalyco/opencode#24162: desktop health-check exponential backoff (request-changes; loopback gate dropped, no_proxy widened) - openai/codex#19389: npm update prompt readiness gate (merge-after-nits) - openai/codex#19266: non-local thread-store regression harness (merge-after-nits)
|
Automated PR Cleanup Thank you for contributing to opencode. Due to the high volume of PRs from users and AI agents, we periodically close older PRs using automated criteria so maintainers can focus review time on the most active and community-supported contributions. This PR was closed because it matched the following cleanup criteria:
PRs created within the last month are not affected by this cleanup. If you believe this PR was closed incorrectly, or if you are still actively working on it, please leave a comment explaining why it should be reopened. A maintainer can review and reopen it if appropriate. Thanks again for taking the time to contribute. |
Issue for this PR
Closes #24142
Type of change
What does this PR do?
The desktop app health check (
check_health) fires once immediately after spawning the local server sidecar. On slower machines or when the server is still initializing, this single attempt fails and marks the server as down. This causes cascading instability:The fix: Replace the single-shot health check with
check_health_with_retry()that retries up to 6 times with exponential backoff.Retry strategy:
tokio::time::timeoutfalseand retry (no premature short-circuit with?)reqwest::Clientbuilt once before the loop to reuse the connection poolHow did you verify your code works?
.no_proxy()calls, and dead unreachable codecargo checkcould not run locally due to missing MSVC linker on this Windows CI environment, but code is structurally soundScreenshots / recordings
N/A -- backend health check logic, no UI changes.
Checklist