Skip to content

perf(tui): stop slow/dead MCP servers from freezing TUI startup#35245

Closed
kshitijk4poor wants to merge 1 commit into
NousResearch:mainfrom
kshitijk4poor:perf/tui-startup-lazy-banner
Closed

perf(tui): stop slow/dead MCP servers from freezing TUI startup#35245
kshitijk4poor wants to merge 1 commit into
NousResearch:mainfrom
kshitijk4poor:perf/tui-startup-lazy-banner

Conversation

@kshitijk4poor

@kshitijk4poor kshitijk4poor commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Problem

The TUI's summoning hermes… phase blocks on gateway.ready, which ran MCP tool discovery inline in tui_gateway/entry.py:main(). Any configured-but-unreachable MCP server burns its full connect-retry backoff (1 + 2 + 4s ≈ 7s) before the composer appears.

Repro: put a down stdio/http server in mcp_servers (here: twozero_td on a dead port), launch hermes --tui. Measured spawn → gateway.ready:

~7,500 ms   (before)
~115 ms     (after)   ← ~65× faster

Fix

1. Background MCP discovery (the 7s). Move discover_mcp_tools() into a daemon thread so gateway.ready fires immediately. Discovery is already idempotent and _lock-guarded and registers tools into the shared registry as servers connect; a down server keeps retrying in the background instead of stalling the shell. The MCP-SDK cold-start import guard (skip the ~200ms SDK import when no mcp_servers configured) is preserved, so the no-MCP common case pays nothing.

Backgrounding discovery means it no longer completes before the first agent build. Since AIAgent snapshots its tool list once at build (agent/agent_init.py) and never re-reads the registry, two follow-ons keep that snapshot correct:

1a. Bounded wait before the first agent build. entry.py publishes the discovery thread handle; _make_agent calls wait_for_mcp_discovery(timeout=0.75) before constructing the agent. This lets already-spawning fast servers land in the snapshot while a slow/dead server is never waited on past the bound — so a dead server cannot re-introduce the startup hang. The wait is only ever paid on the first prompt during a still-in-flight discovery; it's a no-op once discovery has finished, and a no-op (≈1µs) for users with no mcp_servers (no thread is ever created). gateway.ready itself is untouched — the full ~65× speedup stands.

1b. /reload-mcp now actually refreshes the cached agent. The TUI /reload-mcp handler previously guarded on hasattr(agent, "refresh_tools"), but no such method exists — so it was dead code and the only recovery for a late-connecting server was /new (which discards history). Replaced it with a real agent.tools / valid_tool_names rebuild via get_tool_definitions(...), mirroring gateway/run.py::_execute_mcp_reload. A server that connects later in the session is now picked up by /reload-mcp without losing conversation history. (The user has already consented to the prompt-cache invalidation via the existing confirm gate.)

The background-discovery failure path also now logs (logger.warning(..., exc_info=True)) instead of swallowing silently — a detached-thread exception was previously invisible.

2. Lazy banner imports (~45ms bonus). tui_gateway.server imports hermes_cli.banner purely to reach the lightweight prefetch_update_check helper, but banner eagerly imported rich.console + prompt_toolkit at module level — ~45ms of wasted imports on the critical path. Made them lazy (imported inside cprint / build_welcome_banner); Console annotation moved under TYPE_CHECKING. import tui_gateway.server: ~115ms → ~69ms.

Verification

  • spawn → gateway.ready: ~7,500ms → ~115ms (with a dead MCP server configured) — unchanged by the bounded-wait/reload follow-ons, which live on the first-prompt build path, not the gateway.ready path
  • First-prompt agent build: no-op (~1µs) for no-MCP users; waits only until fast servers land for reachable MCP; bounded at 0.75s for a slow/dead server
  • import tui_gateway.server: ~115ms → ~69ms
  • Importing hermes_cli.banner no longer pulls rich/prompt_toolkit; cprint + build_welcome_banner still render correctly (lazy imports resolve at call time)
  • wait_for_mcp_discovery verified: no-op with no thread, joins a fast/finished thread immediately, bounded (≈0.3s with a 0.3s timeout, not forever) on a hung thread — new tests in tests/tui_gateway/test_wait_for_mcp_discovery.py
  • /reload-mcp rebuilds the cached agent's tool snapshot (no /new required for late-connecting servers)
  • Tests: 330 pass — tests/tui_gateway/ (incl. the 4 new wait_for_mcp_discovery tests + existing test_make_agent_provider, test_entry_sys_path), tests/tools/test_mcp_tool.py + test_mcp_dynamic_discovery.py, tests/hermes_cli/test_banner.py, test_update_check.py, test_cmd_update.py. Ruff clean; no new ty diagnostics.

Net diff: 4 files, +192/-28.

@alt-glitch alt-glitch added type/perf Performance improvement or optimization comp/tui Terminal UI (ui-tui/ + tui_gateway/) tool/mcp MCP client and OAuth P2 Medium — degraded but workaround exists labels May 30, 2026
The 'summoning hermes…' phase blocked on gateway.ready, which ran MCP
tool discovery inline. Any configured-but-unreachable MCP server burned
its full connect-retry backoff (1+2+4s ≈ 7s) before the composer
appeared — startup went from instant to ~7.5s of dead air for anyone
with a down stdio/http server in mcp_servers.

Move discovery into a background daemon thread so gateway.ready fires
immediately; tools register into the shared registry as servers connect,
and the agent isn't built until the first prompt. Measured spawn→ready:
~7500ms → ~115ms (dead twozero_td server in config).

Also drop rich.console + prompt_toolkit off banner.py's import path
(lazy-imported inside cprint/build_welcome_banner). tui_gateway.server
imports banner only to reach the lightweight prefetch_update_check
helper; the eager rich/pt imports added ~45ms before gateway.ready for
no benefit. tui_gateway.server import: ~115ms → ~69ms.
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #35273. Your commit was cherry-picked onto current main with your authorship preserved in git log (rebase-merge). Thanks for the clean fix and the bounded-wait design — it correctly keeps the gateway.ready speedup while making sure fast MCP servers still land in the tool snapshot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tui Terminal UI (ui-tui/ + tui_gateway/) P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants