Skip to content

feat(mcp): MCP server startup auto-retry with exponential backoff#3578

Merged
bug-ops merged 1 commit intomainfrom
mcp-server-auto-retry
May 4, 2026
Merged

feat(mcp): MCP server startup auto-retry with exponential backoff#3578
bug-ops merged 1 commit intomainfrom
mcp-server-auto-retry

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented May 4, 2026

Summary

Changes

  • McpError::ManagerShuttingDown { server_id } — new terminal lifecycle variant, maps to None in code(), never retried; replaces string-literal guard in manager
  • retry_loop — generic CancellationToken-driven helper with closed-form backoff min(500 * 2^(n-1), 8000) ms; unit-testable via tokio::time::pause()
  • connect_with_retry — thin wrapper binding connect_entry into retry_loop
  • is_retryable_connect_error — exhaustive match on all McpError variants (compile error on new unclassified variants)
  • McpConfig::max_connect_attempts: u8 — validated at deserialization (rejects 0 and >10); migration step 40 renames max_connect_retries in existing configs
  • TUI spinner messages via existing status_tx channel (no TUI code change)
  • Integration test enforces max_connect_attempts presence across all three default.toml copies
  • add_server retry explicitly out of scope; tracked as follow-up (see security finding F1)

Test plan

  • cargo nextest run -p zeph-mcp --lib --bins — 440 tests pass (7 new retry-specific unit tests)
  • cargo nextest run -p zeph-config --lib --bins — 342 tests pass (8 new config/migration tests + default.toml consistency integration test)
  • cargo +nightly fmt --check — clean
  • cargo clippy --workspace -- -D warnings — clean
  • Live test: configure a slow-starting MCP server subprocess, verify WARN logs with attempt count and TUI spinner appear before eventual connection or failure
  • Follow-up issue to be filed: HTTP 401/403 from remote MCP endpoints wrapped into McpError::Connection (classified retryable); recommend adding structured HTTP status to the variant before extending retry to add_server

Closes #3568

@github-actions github-actions Bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate dependencies Dependency updates config Configuration file changes enhancement New feature or request size/XL Extra large PR (500+ lines) labels May 4, 2026
Retry transient MCP server connection errors (Connection, Timeout) up
to `mcp.max_connect_attempts` times (default 3, range 1–10) with
exponential backoff (500 ms, 1 s, 2 s, … capped at 8 s).

Key changes:
- New `McpError::ManagerShuttingDown` variant; exhaustive
  `is_retryable_connect_error` ensures all future variants are
  deliberately classified
- Generic `retry_loop` helper driven by a `CancellationToken`; uses
  `tokio::select! biased` so shutdown exits any backoff sleep immediately
- `connect_with_retry` wraps `connect_entry` using `retry_loop`
- `McpConfig::max_connect_attempts` with deserialize-time validation
  (rejects 0 and >10 with a clear error); migration step 40 renames
  the old `max_connect_retries` key in existing configs
- TUI spinner messages emitted via existing `status_tx` channel:
  "Connecting to MCP server {id}..." / "Reconnecting … (attempt N/M)..."
- Integration test enforces `max_connect_attempts` presence across all
  three `default.toml` copies
- `add_server` retry is explicitly out of scope (tracked as follow-up)

Closes #3568
@bug-ops bug-ops force-pushed the mcp-server-auto-retry branch from 164bacb to 4685ada Compare May 4, 2026 13:01
@bug-ops bug-ops enabled auto-merge (squash) May 4, 2026 13:02
@bug-ops bug-ops merged commit b49cf0f into main May 4, 2026
32 checks passed
@bug-ops bug-ops deleted the mcp-server-auto-retry branch May 4, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration file changes core zeph-core crate dependencies Dependency updates documentation Improvements or additions to documentation enhancement New feature or request rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

research(mcp): MCP server startup auto-retry — retry transient connection errors up to 3 times before marking disconnected

1 participant