MCP HTTP connections go stale after extended idle periods

## Summary

Long-lived MCP HTTP sessions become stale after extended idle periods (observed ~12h) because `_wait_for_lifecycle_event()` blocks indefinitely without generating any keepalive traffic. The next tool call after the idle period fails silently with an empty error message.

## Environment

- Hermes version: v2026.4.23
- MCP SDK version: >= 1.24.0 (new HTTP API)
- Transport: HTTP/StreamableHTTP via `streamable_http_client`
- Deployment: Kubernetes with MCP sidecar (supergateway wrapping stdio server)

## Observed Behavior

1. MCP server connects successfully, tools discovered at 20:51
2. No MCP tool calls for ~12 hours
3. First tool call at 09:33 fails with empty error:
   ```
   ERROR tools.mcp_tool: MCP tool canny/canny_get_post call failed: 
   ```
4. Subsequent calls also fail until pod restart

## Root Cause Analysis

In `tools/mcp_tool.py`, the `_run_http()` method:

```python
async with httpx.AsyncClient(**client_kwargs) as http_client:
    async with streamable_http_client(url, http_client=http_client) as (...):
        async with ClientSession(read_stream, write_stream, ...) as session:
            await session.initialize()
            self.session = session
            await self._discover_tools()
            self._ready.set()
            reason = await self._wait_for_lifecycle_event()  # ← blocks forever
```

The `_wait_for_lifecycle_event()` method blocks indefinitely waiting for shutdown/reconnect signals. During this time:

- No reads/writes occur on the httpx connection
- The `read=300.0` timeout only applies to active reads, not idle connections
- TCP keepalives at the OS/LB level eventually timeout (~2h default)
- The socket becomes stale, but Hermes doesn't detect it

When the next tool call arrives, httpx attempts to use the dead socket and fails at the connection level (before any HTTP exchange), producing an empty error.

## Proposed Fix

Add a periodic health check inside `_wait_for_lifecycle_event()` to exercise the connection:

```python
async def _wait_for_lifecycle_event(self) -> str:
    """Block until shutdown, reconnect, or keepalive interval."""
    KEEPALIVE_INTERVAL = 180  # 3 minutes
    
    shutdown_task = asyncio.create_task(self._shutdown_event.wait())
    reconnect_task = asyncio.create_task(self._reconnect_event.wait())
    
    try:
        while True:
            done, pending = await asyncio.wait(
                {shutdown_task, reconnect_task},
                timeout=KEEPALIVE_INTERVAL,
                return_when=asyncio.FIRST_COMPLETED,
            )
            
            if done:
                break
                
            # Keepalive: exercise the connection
            if self.session:
                try:
                    await asyncio.wait_for(
                        self.session.list_tools(),
                        timeout=30.0
                    )
                except Exception as exc:
                    logger.warning(
                        "MCP server '%s' keepalive failed, triggering reconnect: %s",
                        self.name, exc
                    )
                    self._reconnect_event.set()
                    return "reconnect"
    finally:
        for t in (shutdown_task, reconnect_task):
            if not t.done():
                t.cancel()
                try:
                    await t
                except (asyncio.CancelledError, Exception):
                    pass

    if self._shutdown_event.is_set():
        return "shutdown"
    self._reconnect_event.clear()
    return "reconnect"
```

### Alternative: Config-driven keepalive

Add a `keepalive_interval` config option per MCP server:

```yaml
mcp_servers:
  my_server:
    url: "http://localhost:3001/mcp"
    keepalive_interval: 180  # seconds, 0 to disable
```

## Workaround (Does NOT Work)

~~Until this is fixed, users can work around it with a cron job that periodically calls an MCP tool:~~

```yaml
cron:
  mcp-keepalive:
    schedule: "*/3 * * * *"
    prompt: "Call <mcp_tool> to verify connection. Only respond if error."
    silent: true
```

**Update:** This workaround does **not** work. The `_servers` dict holding MCP sessions is module-level state within the gateway process. Cron jobs spawn **separate** `hermes` subprocesses with their own isolated `_servers = {}`. Each cron execution:

1. Starts a new process with empty `_servers`
2. Establishes a fresh MCP connection
3. Calls the tool successfully  
4. Exits — discarding the connection

The gateway's long-lived `_servers["canny"].session` remains stale because cron jobs never touch the gateway's event loop or session state. **The keepalive must happen inside `_wait_for_lifecycle_event()` within the gateway process itself.**

## Impact

- **Severity**: Medium — MCP tools become unavailable after idle periods
- **Frequency**: Affects any deployment with HTTP MCP servers and gaps > ~2h between tool calls
- **Recovery**: Automatic reconnect logic exists but isn't triggered (no exception thrown until tool call)

## Related

- Circuit breaker logic in `_bump_server_error()` / `_reset_server_error()` handles repeated failures but doesn't prevent the initial stale connection
- Reconnect logic in `run()` handles exceptions properly, just needs a trigger

---

*Root cause analysis assisted by GitHub Copilot CLI*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP HTTP connections go stale after extended idle periods #17003

Summary

Environment

Observed Behavior

Root Cause Analysis

Proposed Fix

Alternative: Config-driven keepalive

Workaround (Does NOT Work)

Impact

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MCP HTTP connections go stale after extended idle periods #17003

Description

Summary

Environment

Observed Behavior

Root Cause Analysis

Proposed Fix

Alternative: Config-driven keepalive

Workaround (Does NOT Work)

Impact

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions