perf(tui): stop slow/dead MCP servers from freezing TUI startup#35245
Closed
kshitijk4poor wants to merge 1 commit into
Closed
perf(tui): stop slow/dead MCP servers from freezing TUI startup#35245kshitijk4poor wants to merge 1 commit into
kshitijk4poor wants to merge 1 commit into
Conversation
The 'summoning hermes…' phase blocked on gateway.ready, which ran MCP tool discovery inline. Any configured-but-unreachable MCP server burned its full connect-retry backoff (1+2+4s ≈ 7s) before the composer appeared — startup went from instant to ~7.5s of dead air for anyone with a down stdio/http server in mcp_servers. Move discovery into a background daemon thread so gateway.ready fires immediately; tools register into the shared registry as servers connect, and the agent isn't built until the first prompt. Measured spawn→ready: ~7500ms → ~115ms (dead twozero_td server in config). Also drop rich.console + prompt_toolkit off banner.py's import path (lazy-imported inside cprint/build_welcome_banner). tui_gateway.server imports banner only to reach the lightweight prefetch_update_check helper; the eager rich/pt imports added ~45ms before gateway.ready for no benefit. tui_gateway.server import: ~115ms → ~69ms.
ca3a51c to
dd5bbdb
Compare
Contributor
|
Merged via PR #35273. Your commit was cherry-picked onto current main with your authorship preserved in git log (rebase-merge). Thanks for the clean fix and the bounded-wait design — it correctly keeps the gateway.ready speedup while making sure fast MCP servers still land in the tool snapshot. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The TUI's
summoning hermes…phase blocks ongateway.ready, which ran MCP tool discovery inline intui_gateway/entry.py:main(). Any configured-but-unreachable MCP server burns its full connect-retry backoff (1 + 2 + 4s ≈ 7s) before the composer appears.Repro: put a down stdio/http server in
mcp_servers(here:twozero_tdon a dead port), launchhermes --tui. Measured spawn →gateway.ready:Fix
1. Background MCP discovery (the 7s). Move
discover_mcp_tools()into a daemon thread sogateway.readyfires immediately. Discovery is already idempotent and_lock-guarded and registers tools into the shared registry as servers connect; a down server keeps retrying in the background instead of stalling the shell. The MCP-SDK cold-start import guard (skip the ~200ms SDK import when nomcp_serversconfigured) is preserved, so the no-MCP common case pays nothing.Backgrounding discovery means it no longer completes before the first agent build. Since
AIAgentsnapshots its tool list once at build (agent/agent_init.py) and never re-reads the registry, two follow-ons keep that snapshot correct:1a. Bounded wait before the first agent build.
entry.pypublishes the discovery thread handle;_make_agentcallswait_for_mcp_discovery(timeout=0.75)before constructing the agent. This lets already-spawning fast servers land in the snapshot while a slow/dead server is never waited on past the bound — so a dead server cannot re-introduce the startup hang. The wait is only ever paid on the first prompt during a still-in-flight discovery; it's a no-op once discovery has finished, and a no-op (≈1µs) for users with nomcp_servers(no thread is ever created).gateway.readyitself is untouched — the full ~65× speedup stands.1b.
/reload-mcpnow actually refreshes the cached agent. The TUI/reload-mcphandler previously guarded onhasattr(agent, "refresh_tools"), but no such method exists — so it was dead code and the only recovery for a late-connecting server was/new(which discards history). Replaced it with a realagent.tools/valid_tool_namesrebuild viaget_tool_definitions(...), mirroringgateway/run.py::_execute_mcp_reload. A server that connects later in the session is now picked up by/reload-mcpwithout losing conversation history. (The user has already consented to the prompt-cache invalidation via the existing confirm gate.)The background-discovery failure path also now logs (
logger.warning(..., exc_info=True)) instead of swallowing silently — a detached-thread exception was previously invisible.2. Lazy banner imports (~45ms bonus).
tui_gateway.serverimportshermes_cli.bannerpurely to reach the lightweightprefetch_update_checkhelper, but banner eagerly importedrich.console+prompt_toolkitat module level — ~45ms of wasted imports on the critical path. Made them lazy (imported insidecprint/build_welcome_banner);Consoleannotation moved underTYPE_CHECKING.import tui_gateway.server: ~115ms → ~69ms.Verification
gateway.ready: ~7,500ms → ~115ms (with a dead MCP server configured) — unchanged by the bounded-wait/reload follow-ons, which live on the first-prompt build path, not thegateway.readypathimport tui_gateway.server: ~115ms → ~69mshermes_cli.bannerno longer pullsrich/prompt_toolkit;cprint+build_welcome_bannerstill render correctly (lazy imports resolve at call time)wait_for_mcp_discoveryverified: no-op with no thread, joins a fast/finished thread immediately, bounded (≈0.3s with a 0.3s timeout, not forever) on a hung thread — new tests intests/tui_gateway/test_wait_for_mcp_discovery.py/reload-mcprebuilds the cached agent's tool snapshot (no/newrequired for late-connecting servers)tests/tui_gateway/(incl. the 4 newwait_for_mcp_discoverytests + existingtest_make_agent_provider,test_entry_sys_path),tests/tools/test_mcp_tool.py+test_mcp_dynamic_discovery.py,tests/hermes_cli/test_banner.py,test_update_check.py,test_cmd_update.py. Ruff clean; no newtydiagnostics.Net diff: 4 files, +192/-28.