research(mcp): MCP server startup auto-retry — retry transient connection errors up to 3 times before marking disconnected

## Description

Claude Code (v2.1.122, May 2026) added automatic retry logic for MCP servers that hit a transient error during startup: instead of immediately marking the server as disconnected, it retries the connection up to 3 times before giving up.

## Current Zeph Behavior

In `zeph-mcp/src/lifecycle.rs`, MCP server startup is attempted once. If the initial connection fails (e.g., the server process takes slightly too long to start, a transient socket error, or a race condition with the server process), the server is marked unavailable for the session. The user sees a startup warning and the tools are silently absent.

This is a UX pain point when:
- The MCP server is a slow-starting subprocess (e.g., a Python server that needs to warm up)
- There is a transient OS-level socket error at startup
- The server is a remote HTTP MCP endpoint under temporary load

## Proposed Fix

In `McpLifecycle::connect_server()`, wrap the initial connect attempt in a retry loop:

```rust
const MAX_RETRIES: u32 = 3;
const RETRY_DELAY: Duration = Duration::from_millis(500);

for attempt in 1..=MAX_RETRIES {
    match self.try_connect(entry).await {
        Ok(client) => return Ok(client),
        Err(e) if attempt < MAX_RETRIES => {
            tracing::warn!(attempt, "MCP server {} failed to connect, retrying: {e}", entry.name);
            tokio::time::sleep(RETRY_DELAY * attempt).await; // exponential backoff
        }
        Err(e) => return Err(e),
    }
}
```

Use exponential backoff (500ms, 1s, 2s) to avoid thundering-herd on a slow-starting server.

## Acceptance Criteria

- MCP server connection is retried up to 3 times with exponential backoff (500ms, 1s, 2s)
- Each retry emits a `WARN` log with attempt count
- A TUI spinner message shows "Reconnecting to MCP server <name> (attempt N/3)…"
- If all retries fail, behavior is unchanged from current (WARN + server unavailable)
- Config option `mcp.max_connect_retries` (default: 3) to allow project-specific tuning

## References

- Claude Code v2.1.122 changelog
- Related: #3315 (parallel MCP startup), zeph-mcp lifecycle.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(mcp): MCP server startup auto-retry — retry transient connection errors up to 3 times before marking disconnected #3568

Description

Current Zeph Behavior

Proposed Fix

Acceptance Criteria

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(mcp): MCP server startup auto-retry — retry transient connection errors up to 3 times before marking disconnected #3568

Description

Description

Current Zeph Behavior

Proposed Fix

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions