Summary
model_tools.py runs discover_mcp_tools() as a module-level side effect (line 143). The gateway lazy-imports run_agent (which imports model_tools) the first time a user message reaches _handle_message_with_agent — meaning the very first message after gateway start triggers MCP discovery inside the asyncio event loop thread. Since _run_on_mcp_loop uses a blocking future.result(timeout=120) rather than await, this freezes the Discord/Telegram/etc. WebSocket heartbeat for up to 120 seconds whenever any configured MCP server is unreachable. After ~50s Discord force-closes the shard.
This is distinct from #10138 (which is about a nested-call deadlock inside register_mcp_servers). Even if #10138 is fixed, a slow/unreachable MCP server will still freeze the loop because the discovery is invoked synchronously from an async context.
Reproduction
- Add an unreachable MCP server URL to
config.yaml:
mcp_servers:
unreachable:
url: http://10.99.99.99:9999/mcp
- Start the gateway. Discovery succeeds at startup (logs
MCP: registered N tool(s) from M server(s) (1 failed) after a short retry window).
- Send the first Discord/Telegram message after gateway start.
- Within ~10s, the platform logs
Shard ID None heartbeat blocked for more than 10 seconds. Heartbeat-block warnings escalate every 10s. The first message hangs for ~120s before either responding or the shard reconnects.
A subsequent message in the same gateway process is fine — model_tools is now imported and the side-effect doesn't re-run.
Stack trace (Hermes 0.11.0 / v2026.4.23, Python 3.11.15)
2026-04-28 05:54:59 WARNING discord.gateway: Shard ID None heartbeat blocked for more than 40 seconds.
Loop thread traceback (most recent call last):
...
File "gateway/platforms/base.py", line 2072, in _process_message_background
response = await self._message_handler(event)
File "gateway/run.py", line 3871, in _handle_message
return await self._handle_message_with_agent(...)
File "gateway/run.py", line 4516, in _handle_message_with_agent
agent_result = await self._run_agent(...)
File "gateway/run.py", line 9334, in _run_agent
from run_agent import AIAgent # lazy import
File "run_agent.py", line 67, in <module>
from model_tools import (...) # transitive
File "model_tools.py", line 143, in <module>
discover_mcp_tools() # module-level side effect
File "tools/mcp_tool.py", line 2455, in discover_mcp_tools
tool_names = register_mcp_servers(servers)
File "tools/mcp_tool.py", line 2408, in register_mcp_servers
_run_on_mcp_loop(_discover_all(), timeout=120)
File "tools/mcp_tool.py", line 1577, in _run_on_mcp_loop
return future.result(timeout=wait_timeout) # BLOCKS asyncio loop
File ".../concurrent/futures/_base.py", line 451, in result
self._condition.wait(timeout)
Why it manifests now
In a clean dev session, MCP discovery has already happened at gateway startup, so the lazy import on first message is cheap. The bug surfaces when:
- An MCP server is configured but unreachable (network timeout, dead host, wrong port, etc.) — startup discovery records "(1 failed)" but doesn't blacklist it, and
- The lazy import path re-invokes
discover_mcp_tools which retries the failed server with the full 120s budget.
I'd guess most users haven't hit this because their MCP servers are local/reachable.
Suggested fixes
Either of these resolves the symptom; ideally both:
-
Remove the module-level call. model_tools.py:143 calling discover_mcp_tools() at import is a side effect that's unsafe from any async context. Discovery already runs at gateway startup; a second invocation from within a message handler shouldn't be needed. If a re-discovery hook is wanted, expose it as an explicit function and call it from a non-async lifecycle event.
-
Make _run_on_mcp_loop async-aware. When called from an event loop, schedule the coroutine and await the future via asyncio.wrap_future rather than future.result(timeout=...). Today's blocking-wait pattern silently freezes whatever loop happens to be running.
Workaround
Remove the slow/unreachable server from mcp_servers in config.yaml. Discovery completes in ~2s and the import-time call returns fast enough not to trip the heartbeat watchdog. This is what we did locally.
Environment
- Hermes Agent v0.11.0 (v2026.4.23)
- Python 3.11.15 on Linux (Debian/LXC)
- Gateway: hermes-gateway systemd user service
- Platform: Discord (
discord.py); the same blocking-wait pattern would affect any platform whose handler runs in the asyncio loop
Summary
model_tools.pyrunsdiscover_mcp_tools()as a module-level side effect (line 143). The gateway lazy-importsrun_agent(which importsmodel_tools) the first time a user message reaches_handle_message_with_agent— meaning the very first message after gateway start triggers MCP discovery inside the asyncio event loop thread. Since_run_on_mcp_loopuses a blockingfuture.result(timeout=120)rather thanawait, this freezes the Discord/Telegram/etc. WebSocket heartbeat for up to 120 seconds whenever any configured MCP server is unreachable. After ~50s Discord force-closes the shard.This is distinct from #10138 (which is about a nested-call deadlock inside
register_mcp_servers). Even if #10138 is fixed, a slow/unreachable MCP server will still freeze the loop because the discovery is invoked synchronously from an async context.Reproduction
config.yaml:MCP: registered N tool(s) from M server(s) (1 failed)after a short retry window).Shard ID None heartbeat blocked for more than 10 seconds.Heartbeat-block warnings escalate every 10s. The first message hangs for ~120s before either responding or the shard reconnects.A subsequent message in the same gateway process is fine —
model_toolsis now imported and the side-effect doesn't re-run.Stack trace (Hermes 0.11.0 / v2026.4.23, Python 3.11.15)
Why it manifests now
In a clean dev session, MCP discovery has already happened at gateway startup, so the lazy import on first message is cheap. The bug surfaces when:
discover_mcp_toolswhich retries the failed server with the full 120s budget.I'd guess most users haven't hit this because their MCP servers are local/reachable.
Suggested fixes
Either of these resolves the symptom; ideally both:
Remove the module-level call.
model_tools.py:143callingdiscover_mcp_tools()at import is a side effect that's unsafe from any async context. Discovery already runs at gateway startup; a second invocation from within a message handler shouldn't be needed. If a re-discovery hook is wanted, expose it as an explicit function and call it from a non-async lifecycle event.Make
_run_on_mcp_loopasync-aware. When called from an event loop, schedule the coroutine andawaitthe future viaasyncio.wrap_futurerather thanfuture.result(timeout=...). Today's blocking-wait pattern silently freezes whatever loop happens to be running.Workaround
Remove the slow/unreachable server from
mcp_serversinconfig.yaml. Discovery completes in ~2s and the import-time call returns fast enough not to trip the heartbeat watchdog. This is what we did locally.Environment
discord.py); the same blocking-wait pattern would affect any platform whose handler runs in the asyncio loop