Skip to content

fix(mcp): bump initial connect retries 3→6 for slow upstream warmups#11646

Open
handsdiff wants to merge 1 commit into
NousResearch:mainfrom
handsdiff:fix/mcp-initial-connect-retries
Open

fix(mcp): bump initial connect retries 3→6 for slow upstream warmups#11646
handsdiff wants to merge 1 commit into
NousResearch:mainfrom
handsdiff:fix/mcp-initial-connect-retries

Conversation

@handsdiff

Copy link
Copy Markdown
Contributor

Summary

On environments where the MCP server's upstream target is still warming up at agent boot time (e.g. cloud VMs where a reverse-proxy route takes up to ~30s to become ready), the initial MCP connection attempt can receive a transient error (e.g. a 405 from a placeholder route). With the current _MAX_INITIAL_CONNECT_RETRIES = 3 and exponential backoff starting at 1s, the cumulative retry window is 1+2+4 = 7s — short enough that the warmup window usually isn't covered. The MCP task gives up, sets _error, and the server stays down for the entire session since there's no lazy reconnect. Recovery requires manually restarting the agent after the window passes.

Bumps _MAX_INITIAL_CONNECT_RETRIES to 6. With the existing _MAX_BACKOFF_SECONDS = 60 cap, the retry sequence is 1+2+4+8+16+32 = 63s cumulative, which comfortably covers ~30s warmups. Genuine (permanent) connect failures take longer to surface, but only on startup; steady-state behavior is unchanged.

Two tests (test_mcp_tool.py, test_mcp_stability.py) relied on the 7s fast-fail path to stay under their 30s per-test timeout. They're patched to stub asyncio.sleep so the retry loop completes instantly.

Test plan

  • pytest tests/tools/test_mcp_tool.py tests/tools/test_mcp_stability.py -q
  • Manual: start agent against an MCP server whose upstream is slow to come up; confirm it reconnects within 60s instead of erroring out

🤖 Generated with Claude Code

@handsdiff handsdiff force-pushed the fix/mcp-initial-connect-retries branch from 1fa3954 to dee67e2 Compare April 18, 2026 16:12
@handsdiff handsdiff force-pushed the fix/mcp-initial-connect-retries branch from dee67e2 to 2458db3 Compare April 18, 2026 17:28
@handsdiff handsdiff force-pushed the fix/mcp-initial-connect-retries branch from 2458db3 to 9913a60 Compare April 19, 2026 22:11
@handsdiff handsdiff force-pushed the fix/mcp-initial-connect-retries branch from 9913a60 to 5723a1e Compare April 21, 2026 20:25
@alt-glitch alt-glitch added type/bug Something isn't working comp/tools Tool registry, model_tools, toolsets tool/mcp MCP client and OAuth labels Apr 21, 2026
@handsdiff handsdiff force-pushed the fix/mcp-initial-connect-retries branch from 5723a1e to f3c4eed Compare April 22, 2026 15:52
On freshly-booted exe.dev VMs, internal integration hostnames (e.g.
db-{vm}.int.exe.xyz) take up to 30s after the VM is up before the
reverse-proxy routing is warm. The MCP server's first connection attempt
reached this endpoint during the warmup window and got a transient 405,
which counts as a connection failure. With 3 retries and exponential
backoff starting at 1s, the cumulative wait was 1+2+4=7s — almost always
too short. The task gave up, set _error, and the MCP server stayed down
for the entire gateway session. No lazy reconnect, so the only recovery
was a second manual `systemctl restart hermes.service` 30s later.

Hit this five times across trapezius and slate-vela today during Hub
loop debugging; every restart needed a follow-up restart to get the db
MCP back.

Bump _MAX_INITIAL_CONNECT_RETRIES to 6 (backoffs 1+2+4+8+16+32 = 63s
cumulative, capped by the existing _MAX_BACKOFF_SECONDS=60). That
covers the warmup window with margin. Gateway startup-ready time
regresses on genuine failures, but the win on transient failures — no
manual second-restart dance — is worth it.

Two existing tests relied on the fast 7s path to stay under the 30s
per-test timeout. Added asyncio.sleep patches so the retry loop
finishes instantly in those tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handsdiff added a commit to handsdiff/hermes-agent that referenced this pull request Apr 24, 2026
…usResearch#11647)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@handsdiff handsdiff force-pushed the fix/mcp-initial-connect-retries branch from f3c4eed to 034d72a Compare April 24, 2026 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants