fix(mcp): bump initial connect retries 3→6 for slow upstream warmups#11646
Open
handsdiff wants to merge 1 commit into
Open
fix(mcp): bump initial connect retries 3→6 for slow upstream warmups#11646handsdiff wants to merge 1 commit into
handsdiff wants to merge 1 commit into
Conversation
1fa3954 to
dee67e2
Compare
dee67e2 to
2458db3
Compare
2458db3 to
9913a60
Compare
9913a60 to
5723a1e
Compare
5723a1e to
f3c4eed
Compare
On freshly-booted exe.dev VMs, internal integration hostnames (e.g.
db-{vm}.int.exe.xyz) take up to 30s after the VM is up before the
reverse-proxy routing is warm. The MCP server's first connection attempt
reached this endpoint during the warmup window and got a transient 405,
which counts as a connection failure. With 3 retries and exponential
backoff starting at 1s, the cumulative wait was 1+2+4=7s — almost always
too short. The task gave up, set _error, and the MCP server stayed down
for the entire gateway session. No lazy reconnect, so the only recovery
was a second manual `systemctl restart hermes.service` 30s later.
Hit this five times across trapezius and slate-vela today during Hub
loop debugging; every restart needed a follow-up restart to get the db
MCP back.
Bump _MAX_INITIAL_CONNECT_RETRIES to 6 (backoffs 1+2+4+8+16+32 = 63s
cumulative, capped by the existing _MAX_BACKOFF_SECONDS=60). That
covers the warmup window with margin. Gateway startup-ready time
regresses on genuine failures, but the win on transient failures — no
manual second-restart dance — is worth it.
Two existing tests relied on the fast 7s path to stay under the 30s
per-test timeout. Added asyncio.sleep patches so the retry loop
finishes instantly in those tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handsdiff
added a commit
to handsdiff/hermes-agent
that referenced
this pull request
Apr 24, 2026
…usResearch#11647) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f3c4eed to
034d72a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On environments where the MCP server's upstream target is still warming up at agent boot time (e.g. cloud VMs where a reverse-proxy route takes up to ~30s to become ready), the initial MCP connection attempt can receive a transient error (e.g. a 405 from a placeholder route). With the current
_MAX_INITIAL_CONNECT_RETRIES = 3and exponential backoff starting at 1s, the cumulative retry window is1+2+4 = 7s— short enough that the warmup window usually isn't covered. The MCP task gives up, sets_error, and the server stays down for the entire session since there's no lazy reconnect. Recovery requires manually restarting the agent after the window passes.Bumps
_MAX_INITIAL_CONNECT_RETRIESto 6. With the existing_MAX_BACKOFF_SECONDS = 60cap, the retry sequence is1+2+4+8+16+32 = 63scumulative, which comfortably covers ~30s warmups. Genuine (permanent) connect failures take longer to surface, but only on startup; steady-state behavior is unchanged.Two tests (
test_mcp_tool.py,test_mcp_stability.py) relied on the 7s fast-fail path to stay under their 30s per-test timeout. They're patched to stubasyncio.sleepso the retry loop completes instantly.Test plan
pytest tests/tools/test_mcp_tool.py tests/tools/test_mcp_stability.py -q🤖 Generated with Claude Code