fix(mcp): bump initial connect retries 3→6 for slow upstream warmups by handsdiff · Pull Request #11646 · NousResearch/hermes-agent

handsdiff · 2026-04-17T14:25:54Z

Summary

On environments where the MCP server's upstream target is still warming up at agent boot time (e.g. cloud VMs where a reverse-proxy route takes up to ~30s to become ready), the initial MCP connection attempt can receive a transient error (e.g. a 405 from a placeholder route). With the current _MAX_INITIAL_CONNECT_RETRIES = 3 and exponential backoff starting at 1s, the cumulative retry window is 1+2+4 = 7s — short enough that the warmup window usually isn't covered. The MCP task gives up, sets _error, and the server stays down for the entire session since there's no lazy reconnect. Recovery requires manually restarting the agent after the window passes.

Bumps _MAX_INITIAL_CONNECT_RETRIES to 6. With the existing _MAX_BACKOFF_SECONDS = 60 cap, the retry sequence is 1+2+4+8+16+32 = 63s cumulative, which comfortably covers ~30s warmups. Genuine (permanent) connect failures take longer to surface, but only on startup; steady-state behavior is unchanged.

Two tests (test_mcp_tool.py, test_mcp_stability.py) relied on the 7s fast-fail path to stay under their 30s per-test timeout. They're patched to stub asyncio.sleep so the retry loop completes instantly.

Test plan

pytest tests/tools/test_mcp_tool.py tests/tools/test_mcp_stability.py -q
Manual: start agent against an MCP server whose upstream is slow to come up; confirm it reconnects within 60s instead of erroring out

🤖 Generated with Claude Code

On freshly-booted exe.dev VMs, internal integration hostnames (e.g. db-{vm}.int.exe.xyz) take up to 30s after the VM is up before the reverse-proxy routing is warm. The MCP server's first connection attempt reached this endpoint during the warmup window and got a transient 405, which counts as a connection failure. With 3 retries and exponential backoff starting at 1s, the cumulative wait was 1+2+4=7s — almost always too short. The task gave up, set _error, and the MCP server stayed down for the entire gateway session. No lazy reconnect, so the only recovery was a second manual `systemctl restart hermes.service` 30s later. Hit this five times across trapezius and slate-vela today during Hub loop debugging; every restart needed a follow-up restart to get the db MCP back. Bump _MAX_INITIAL_CONNECT_RETRIES to 6 (backoffs 1+2+4+8+16+32 = 63s cumulative, capped by the existing _MAX_BACKOFF_SECONDS=60). That covers the warmup window with margin. Gateway startup-ready time regresses on genuine failures, but the win on transient failures — no manual second-restart dance — is worth it. Two existing tests relied on the fast 7s path to stay under the 30s per-test timeout. Added asyncio.sleep patches so the retry loop finishes instantly in those tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…usResearch#11647) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

handsdiff force-pushed the fix/mcp-initial-connect-retries branch from 1fa3954 to dee67e2 Compare April 18, 2026 16:12

handsdiff force-pushed the fix/mcp-initial-connect-retries branch from dee67e2 to 2458db3 Compare April 18, 2026 17:28

handsdiff force-pushed the fix/mcp-initial-connect-retries branch from 2458db3 to 9913a60 Compare April 19, 2026 22:11

handsdiff force-pushed the fix/mcp-initial-connect-retries branch from 9913a60 to 5723a1e Compare April 21, 2026 20:25

alt-glitch added type/bug Something isn't working comp/tools Tool registry, model_tools, toolsets tool/mcp MCP client and OAuth labels Apr 21, 2026

handsdiff force-pushed the fix/mcp-initial-connect-retries branch from 5723a1e to f3c4eed Compare April 22, 2026 15:52

handsdiff added a commit to handsdiff/hermes-agent that referenced this pull request Apr 24, 2026

docs: fill in PR numbers for MCP upstream PRs (NousResearch#11646, No…

553cd2b

…usResearch#11647) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

handsdiff force-pushed the fix/mcp-initial-connect-retries branch from f3c4eed to 034d72a Compare April 24, 2026 03:26

alt-glitch mentioned this pull request May 4, 2026

MCP client doesn't retry connecting to HTTP servers after startup window closes #19559

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): bump initial connect retries 3→6 for slow upstream warmups#11646

fix(mcp): bump initial connect retries 3→6 for slow upstream warmups#11646
handsdiff wants to merge 1 commit into
NousResearch:mainfrom
handsdiff:fix/mcp-initial-connect-retries

handsdiff commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

handsdiff commented Apr 17, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants