Skip to content

Lazy import of model_tools blocks asyncio event loop on first gateway message when an MCP server is slow/unreachable #16856

@GuidoE

Description

@GuidoE

Summary

model_tools.py runs discover_mcp_tools() as a module-level side effect (line 143). The gateway lazy-imports run_agent (which imports model_tools) the first time a user message reaches _handle_message_with_agent — meaning the very first message after gateway start triggers MCP discovery inside the asyncio event loop thread. Since _run_on_mcp_loop uses a blocking future.result(timeout=120) rather than await, this freezes the Discord/Telegram/etc. WebSocket heartbeat for up to 120 seconds whenever any configured MCP server is unreachable. After ~50s Discord force-closes the shard.

This is distinct from #10138 (which is about a nested-call deadlock inside register_mcp_servers). Even if #10138 is fixed, a slow/unreachable MCP server will still freeze the loop because the discovery is invoked synchronously from an async context.

Reproduction

  1. Add an unreachable MCP server URL to config.yaml:
    mcp_servers:
      unreachable:
        url: http://10.99.99.99:9999/mcp
  2. Start the gateway. Discovery succeeds at startup (logs MCP: registered N tool(s) from M server(s) (1 failed) after a short retry window).
  3. Send the first Discord/Telegram message after gateway start.
  4. Within ~10s, the platform logs Shard ID None heartbeat blocked for more than 10 seconds. Heartbeat-block warnings escalate every 10s. The first message hangs for ~120s before either responding or the shard reconnects.

A subsequent message in the same gateway process is fine — model_tools is now imported and the side-effect doesn't re-run.

Stack trace (Hermes 0.11.0 / v2026.4.23, Python 3.11.15)

2026-04-28 05:54:59 WARNING discord.gateway: Shard ID None heartbeat blocked for more than 40 seconds.
Loop thread traceback (most recent call last):
  ...
  File "gateway/platforms/base.py", line 2072, in _process_message_background
    response = await self._message_handler(event)
  File "gateway/run.py", line 3871, in _handle_message
    return await self._handle_message_with_agent(...)
  File "gateway/run.py", line 4516, in _handle_message_with_agent
    agent_result = await self._run_agent(...)
  File "gateway/run.py", line 9334, in _run_agent
    from run_agent import AIAgent              # lazy import
  File "run_agent.py", line 67, in <module>
    from model_tools import (...)              # transitive
  File "model_tools.py", line 143, in <module>
    discover_mcp_tools()                       # module-level side effect
  File "tools/mcp_tool.py", line 2455, in discover_mcp_tools
    tool_names = register_mcp_servers(servers)
  File "tools/mcp_tool.py", line 2408, in register_mcp_servers
    _run_on_mcp_loop(_discover_all(), timeout=120)
  File "tools/mcp_tool.py", line 1577, in _run_on_mcp_loop
    return future.result(timeout=wait_timeout)  # BLOCKS asyncio loop
  File ".../concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)

Why it manifests now

In a clean dev session, MCP discovery has already happened at gateway startup, so the lazy import on first message is cheap. The bug surfaces when:

  • An MCP server is configured but unreachable (network timeout, dead host, wrong port, etc.) — startup discovery records "(1 failed)" but doesn't blacklist it, and
  • The lazy import path re-invokes discover_mcp_tools which retries the failed server with the full 120s budget.

I'd guess most users haven't hit this because their MCP servers are local/reachable.

Suggested fixes

Either of these resolves the symptom; ideally both:

  1. Remove the module-level call. model_tools.py:143 calling discover_mcp_tools() at import is a side effect that's unsafe from any async context. Discovery already runs at gateway startup; a second invocation from within a message handler shouldn't be needed. If a re-discovery hook is wanted, expose it as an explicit function and call it from a non-async lifecycle event.

  2. Make _run_on_mcp_loop async-aware. When called from an event loop, schedule the coroutine and await the future via asyncio.wrap_future rather than future.result(timeout=...). Today's blocking-wait pattern silently freezes whatever loop happens to be running.

Workaround

Remove the slow/unreachable server from mcp_servers in config.yaml. Discovery completes in ~2s and the import-time call returns fast enough not to trip the heartbeat watchdog. This is what we did locally.

Environment

  • Hermes Agent v0.11.0 (v2026.4.23)
  • Python 3.11.15 on Linux (Debian/LXC)
  • Gateway: hermes-gateway systemd user service
  • Platform: Discord (discord.py); the same blocking-wait pattern would affect any platform whose handler runs in the asyncio loop

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverycomp/toolsTool registry, model_tools, toolsetstool/mcpMCP client and OAuthtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions