Skip to content

MCP HTTP connections go stale after extended idle periods #17003

@grubmeshi

Description

@grubmeshi

Summary

Long-lived MCP HTTP sessions become stale after extended idle periods (observed ~12h) because _wait_for_lifecycle_event() blocks indefinitely without generating any keepalive traffic. The next tool call after the idle period fails silently with an empty error message.

Environment

  • Hermes version: v2026.4.23
  • MCP SDK version: >= 1.24.0 (new HTTP API)
  • Transport: HTTP/StreamableHTTP via streamable_http_client
  • Deployment: Kubernetes with MCP sidecar (supergateway wrapping stdio server)

Observed Behavior

  1. MCP server connects successfully, tools discovered at 20:51
  2. No MCP tool calls for ~12 hours
  3. First tool call at 09:33 fails with empty error:
    ERROR tools.mcp_tool: MCP tool canny/canny_get_post call failed: 
    
  4. Subsequent calls also fail until pod restart

Root Cause Analysis

In tools/mcp_tool.py, the _run_http() method:

async with httpx.AsyncClient(**client_kwargs) as http_client:
    async with streamable_http_client(url, http_client=http_client) as (...):
        async with ClientSession(read_stream, write_stream, ...) as session:
            await session.initialize()
            self.session = session
            await self._discover_tools()
            self._ready.set()
            reason = await self._wait_for_lifecycle_event()  # ← blocks forever

The _wait_for_lifecycle_event() method blocks indefinitely waiting for shutdown/reconnect signals. During this time:

  • No reads/writes occur on the httpx connection
  • The read=300.0 timeout only applies to active reads, not idle connections
  • TCP keepalives at the OS/LB level eventually timeout (~2h default)
  • The socket becomes stale, but Hermes doesn't detect it

When the next tool call arrives, httpx attempts to use the dead socket and fails at the connection level (before any HTTP exchange), producing an empty error.

Proposed Fix

Add a periodic health check inside _wait_for_lifecycle_event() to exercise the connection:

async def _wait_for_lifecycle_event(self) -> str:
    """Block until shutdown, reconnect, or keepalive interval."""
    KEEPALIVE_INTERVAL = 180  # 3 minutes
    
    shutdown_task = asyncio.create_task(self._shutdown_event.wait())
    reconnect_task = asyncio.create_task(self._reconnect_event.wait())
    
    try:
        while True:
            done, pending = await asyncio.wait(
                {shutdown_task, reconnect_task},
                timeout=KEEPALIVE_INTERVAL,
                return_when=asyncio.FIRST_COMPLETED,
            )
            
            if done:
                break
                
            # Keepalive: exercise the connection
            if self.session:
                try:
                    await asyncio.wait_for(
                        self.session.list_tools(),
                        timeout=30.0
                    )
                except Exception as exc:
                    logger.warning(
                        "MCP server '%s' keepalive failed, triggering reconnect: %s",
                        self.name, exc
                    )
                    self._reconnect_event.set()
                    return "reconnect"
    finally:
        for t in (shutdown_task, reconnect_task):
            if not t.done():
                t.cancel()
                try:
                    await t
                except (asyncio.CancelledError, Exception):
                    pass

    if self._shutdown_event.is_set():
        return "shutdown"
    self._reconnect_event.clear()
    return "reconnect"

Alternative: Config-driven keepalive

Add a keepalive_interval config option per MCP server:

mcp_servers:
  my_server:
    url: "http://localhost:3001/mcp"
    keepalive_interval: 180  # seconds, 0 to disable

Workaround (Does NOT Work)

Until this is fixed, users can work around it with a cron job that periodically calls an MCP tool:

cron:
  mcp-keepalive:
    schedule: "*/3 * * * *"
    prompt: "Call <mcp_tool> to verify connection. Only respond if error."
    silent: true

Update: This workaround does not work. The _servers dict holding MCP sessions is module-level state within the gateway process. Cron jobs spawn separate hermes subprocesses with their own isolated _servers = {}. Each cron execution:

  1. Starts a new process with empty _servers
  2. Establishes a fresh MCP connection
  3. Calls the tool successfully
  4. Exits — discarding the connection

The gateway's long-lived _servers["canny"].session remains stale because cron jobs never touch the gateway's event loop or session state. The keepalive must happen inside _wait_for_lifecycle_event() within the gateway process itself.

Impact

  • Severity: Medium — MCP tools become unavailable after idle periods
  • Frequency: Affects any deployment with HTTP MCP servers and gaps > ~2h between tool calls
  • Recovery: Automatic reconnect logic exists but isn't triggered (no exception thrown until tool call)

Related

  • Circuit breaker logic in _bump_server_error() / _reset_server_error() handles repeated failures but doesn't prevent the initial stale connection
  • Reconnect logic in run() handles exceptions properly, just needs a trigger

Root cause analysis assisted by GitHub Copilot CLI

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/toolsTool registry, model_tools, toolsetstool/mcpMCP client and OAuthtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions