Skip to content

research(mcp): MCP server startup auto-retry — retry transient connection errors up to 3 times before marking disconnected #3568

@bug-ops

Description

@bug-ops

Description

Claude Code (v2.1.122, May 2026) added automatic retry logic for MCP servers that hit a transient error during startup: instead of immediately marking the server as disconnected, it retries the connection up to 3 times before giving up.

Current Zeph Behavior

In zeph-mcp/src/lifecycle.rs, MCP server startup is attempted once. If the initial connection fails (e.g., the server process takes slightly too long to start, a transient socket error, or a race condition with the server process), the server is marked unavailable for the session. The user sees a startup warning and the tools are silently absent.

This is a UX pain point when:

  • The MCP server is a slow-starting subprocess (e.g., a Python server that needs to warm up)
  • There is a transient OS-level socket error at startup
  • The server is a remote HTTP MCP endpoint under temporary load

Proposed Fix

In McpLifecycle::connect_server(), wrap the initial connect attempt in a retry loop:

const MAX_RETRIES: u32 = 3;
const RETRY_DELAY: Duration = Duration::from_millis(500);

for attempt in 1..=MAX_RETRIES {
    match self.try_connect(entry).await {
        Ok(client) => return Ok(client),
        Err(e) if attempt < MAX_RETRIES => {
            tracing::warn!(attempt, "MCP server {} failed to connect, retrying: {e}", entry.name);
            tokio::time::sleep(RETRY_DELAY * attempt).await; // exponential backoff
        }
        Err(e) => return Err(e),
    }
}

Use exponential backoff (500ms, 1s, 2s) to avoid thundering-herd on a slow-starting server.

Acceptance Criteria

  • MCP server connection is retried up to 3 times with exponential backoff (500ms, 1s, 2s)
  • Each retry emits a WARN log with attempt count
  • A TUI spinner message shows "Reconnecting to MCP server (attempt N/3)…"
  • If all retries fail, behavior is unchanged from current (WARN + server unavailable)
  • Config option mcp.max_connect_retries (default: 3) to allow project-specific tuning

References

Metadata

Metadata

Assignees

Labels

P3Research — medium-high complexityresearchResearch-driven improvement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions