Description
Claude Code (v2.1.122, May 2026) added automatic retry logic for MCP servers that hit a transient error during startup: instead of immediately marking the server as disconnected, it retries the connection up to 3 times before giving up.
Current Zeph Behavior
In zeph-mcp/src/lifecycle.rs, MCP server startup is attempted once. If the initial connection fails (e.g., the server process takes slightly too long to start, a transient socket error, or a race condition with the server process), the server is marked unavailable for the session. The user sees a startup warning and the tools are silently absent.
This is a UX pain point when:
- The MCP server is a slow-starting subprocess (e.g., a Python server that needs to warm up)
- There is a transient OS-level socket error at startup
- The server is a remote HTTP MCP endpoint under temporary load
Proposed Fix
In McpLifecycle::connect_server(), wrap the initial connect attempt in a retry loop:
const MAX_RETRIES: u32 = 3;
const RETRY_DELAY: Duration = Duration::from_millis(500);
for attempt in 1..=MAX_RETRIES {
match self.try_connect(entry).await {
Ok(client) => return Ok(client),
Err(e) if attempt < MAX_RETRIES => {
tracing::warn!(attempt, "MCP server {} failed to connect, retrying: {e}", entry.name);
tokio::time::sleep(RETRY_DELAY * attempt).await; // exponential backoff
}
Err(e) => return Err(e),
}
}
Use exponential backoff (500ms, 1s, 2s) to avoid thundering-herd on a slow-starting server.
Acceptance Criteria
- MCP server connection is retried up to 3 times with exponential backoff (500ms, 1s, 2s)
- Each retry emits a
WARN log with attempt count
- A TUI spinner message shows "Reconnecting to MCP server (attempt N/3)…"
- If all retries fail, behavior is unchanged from current (WARN + server unavailable)
- Config option
mcp.max_connect_retries (default: 3) to allow project-specific tuning
References
Description
Claude Code (v2.1.122, May 2026) added automatic retry logic for MCP servers that hit a transient error during startup: instead of immediately marking the server as disconnected, it retries the connection up to 3 times before giving up.
Current Zeph Behavior
In
zeph-mcp/src/lifecycle.rs, MCP server startup is attempted once. If the initial connection fails (e.g., the server process takes slightly too long to start, a transient socket error, or a race condition with the server process), the server is marked unavailable for the session. The user sees a startup warning and the tools are silently absent.This is a UX pain point when:
Proposed Fix
In
McpLifecycle::connect_server(), wrap the initial connect attempt in a retry loop:Use exponential backoff (500ms, 1s, 2s) to avoid thundering-herd on a slow-starting server.
Acceptance Criteria
WARNlog with attempt countmcp.max_connect_retries(default: 3) to allow project-specific tuningReferences