Summary
Long-lived MCP HTTP sessions become stale after extended idle periods (observed ~12h) because _wait_for_lifecycle_event() blocks indefinitely without generating any keepalive traffic. The next tool call after the idle period fails silently with an empty error message.
Environment
- Hermes version: v2026.4.23
- MCP SDK version: >= 1.24.0 (new HTTP API)
- Transport: HTTP/StreamableHTTP via
streamable_http_client
- Deployment: Kubernetes with MCP sidecar (supergateway wrapping stdio server)
Observed Behavior
- MCP server connects successfully, tools discovered at 20:51
- No MCP tool calls for ~12 hours
- First tool call at 09:33 fails with empty error:
ERROR tools.mcp_tool: MCP tool canny/canny_get_post call failed:
- Subsequent calls also fail until pod restart
Root Cause Analysis
In tools/mcp_tool.py, the _run_http() method:
async with httpx.AsyncClient(**client_kwargs) as http_client:
async with streamable_http_client(url, http_client=http_client) as (...):
async with ClientSession(read_stream, write_stream, ...) as session:
await session.initialize()
self.session = session
await self._discover_tools()
self._ready.set()
reason = await self._wait_for_lifecycle_event() # ← blocks forever
The _wait_for_lifecycle_event() method blocks indefinitely waiting for shutdown/reconnect signals. During this time:
- No reads/writes occur on the httpx connection
- The
read=300.0 timeout only applies to active reads, not idle connections
- TCP keepalives at the OS/LB level eventually timeout (~2h default)
- The socket becomes stale, but Hermes doesn't detect it
When the next tool call arrives, httpx attempts to use the dead socket and fails at the connection level (before any HTTP exchange), producing an empty error.
Proposed Fix
Add a periodic health check inside _wait_for_lifecycle_event() to exercise the connection:
async def _wait_for_lifecycle_event(self) -> str:
"""Block until shutdown, reconnect, or keepalive interval."""
KEEPALIVE_INTERVAL = 180 # 3 minutes
shutdown_task = asyncio.create_task(self._shutdown_event.wait())
reconnect_task = asyncio.create_task(self._reconnect_event.wait())
try:
while True:
done, pending = await asyncio.wait(
{shutdown_task, reconnect_task},
timeout=KEEPALIVE_INTERVAL,
return_when=asyncio.FIRST_COMPLETED,
)
if done:
break
# Keepalive: exercise the connection
if self.session:
try:
await asyncio.wait_for(
self.session.list_tools(),
timeout=30.0
)
except Exception as exc:
logger.warning(
"MCP server '%s' keepalive failed, triggering reconnect: %s",
self.name, exc
)
self._reconnect_event.set()
return "reconnect"
finally:
for t in (shutdown_task, reconnect_task):
if not t.done():
t.cancel()
try:
await t
except (asyncio.CancelledError, Exception):
pass
if self._shutdown_event.is_set():
return "shutdown"
self._reconnect_event.clear()
return "reconnect"
Alternative: Config-driven keepalive
Add a keepalive_interval config option per MCP server:
mcp_servers:
my_server:
url: "http://localhost:3001/mcp"
keepalive_interval: 180 # seconds, 0 to disable
Workaround (Does NOT Work)
Until this is fixed, users can work around it with a cron job that periodically calls an MCP tool:
cron:
mcp-keepalive:
schedule: "*/3 * * * *"
prompt: "Call <mcp_tool> to verify connection. Only respond if error."
silent: true
Update: This workaround does not work. The _servers dict holding MCP sessions is module-level state within the gateway process. Cron jobs spawn separate hermes subprocesses with their own isolated _servers = {}. Each cron execution:
- Starts a new process with empty
_servers
- Establishes a fresh MCP connection
- Calls the tool successfully
- Exits — discarding the connection
The gateway's long-lived _servers["canny"].session remains stale because cron jobs never touch the gateway's event loop or session state. The keepalive must happen inside _wait_for_lifecycle_event() within the gateway process itself.
Impact
- Severity: Medium — MCP tools become unavailable after idle periods
- Frequency: Affects any deployment with HTTP MCP servers and gaps > ~2h between tool calls
- Recovery: Automatic reconnect logic exists but isn't triggered (no exception thrown until tool call)
Related
- Circuit breaker logic in
_bump_server_error() / _reset_server_error() handles repeated failures but doesn't prevent the initial stale connection
- Reconnect logic in
run() handles exceptions properly, just needs a trigger
Root cause analysis assisted by GitHub Copilot CLI
Summary
Long-lived MCP HTTP sessions become stale after extended idle periods (observed ~12h) because
_wait_for_lifecycle_event()blocks indefinitely without generating any keepalive traffic. The next tool call after the idle period fails silently with an empty error message.Environment
streamable_http_clientObserved Behavior
Root Cause Analysis
In
tools/mcp_tool.py, the_run_http()method:The
_wait_for_lifecycle_event()method blocks indefinitely waiting for shutdown/reconnect signals. During this time:read=300.0timeout only applies to active reads, not idle connectionsWhen the next tool call arrives, httpx attempts to use the dead socket and fails at the connection level (before any HTTP exchange), producing an empty error.
Proposed Fix
Add a periodic health check inside
_wait_for_lifecycle_event()to exercise the connection:Alternative: Config-driven keepalive
Add a
keepalive_intervalconfig option per MCP server:Workaround (Does NOT Work)
Until this is fixed, users can work around it with a cron job that periodically calls an MCP tool:Update: This workaround does not work. The
_serversdict holding MCP sessions is module-level state within the gateway process. Cron jobs spawn separatehermessubprocesses with their own isolated_servers = {}. Each cron execution:_serversThe gateway's long-lived
_servers["canny"].sessionremains stale because cron jobs never touch the gateway's event loop or session state. The keepalive must happen inside_wait_for_lifecycle_event()within the gateway process itself.Impact
Related
_bump_server_error()/_reset_server_error()handles repeated failures but doesn't prevent the initial stale connectionrun()handles exceptions properly, just needs a triggerRoot cause analysis assisted by GitHub Copilot CLI