Gateway leaks stdio-MCP subprocess children over time (orphan 'read stdin' blocked processes, unbounded RSS growth)

## Summary

`hermes_cli gateway run --replace` accumulates orphan stdio-MCP subprocess children over time. After roughly a day of normal use, `ps` showed **28 stdio MCP subprocesses**, all children of the current gateway PID, all blocked on `read` of stdin. Memory usage grows unbounded — one accumulated child reached ~300 MiB RSS.

The 2 launchd-managed services on the host were clean; only Hermes-spawned children leaked.

## Environment

- Host: macOS 15.6 arm64 (Apple Silicon Mac mini)
- Python: 3.11 (venv in `~/hermes-agent/venv`)
- `hermes_cli.main gateway run --replace`
- MCP server: a local stdio server (Rust binary), configured via `mcp_servers.<name>.command` + `args`
- Running gateway PID: stable (`--replace` is only called on upgrades)

## Observation

```
$ ps -eo pid,ppid,etime,rss,command | grep '[s]tdio-mcp-binary'
 4262 11745  09:53:24  2720 /path/to/stdio-mcp-binary serve
16169 11745 01-02:06:42 2448 /path/to/stdio-mcp-binary serve
18599 11745  08:00:41  2720 /path/to/stdio-mcp-binary serve
...  (26 more, same PPID 11745) ...
83682 11745  05:44    304544 /path/to/stdio-mcp-binary serve   # 300 MiB RSS
```

All 28 are children of the same gateway (PID 11745). `lsof` confirms each has:

- `fd 0` = PIPE (stdin from gateway)
- `fd 1` = PIPE (stdout to gateway)
- no listening TCP ports

→ They are genuine stdio-MCP sessions, just never torn down. The MCP binary behaves correctly: on `stdin EOF` it exits cleanly — the orphans are blocked because nothing ever closes the gateway side of the pipe.

## Suspected source

Looking at `tools/mcp_tool.py:_run_stdio`:

```python
async with stdio_client(server_params) as (read_stream, write_stream):
    new_pids = _snapshot_child_pids() - pids_before
    if new_pids:
        with _lock:
            _stdio_pids.update(new_pids)
    async with ClientSession(read_stream, write_stream, **sampling_kwargs) as session:
        await session.initialize()
        self.session = session
        await self._discover_tools()
        self._ready.set()
        await self._shutdown_event.wait()
# Context exited cleanly — subprocess was terminated by the SDK.
if new_pids:
    with _lock:
        _stdio_pids.difference_update(new_pids)
```

Cleanup relies on `self._shutdown_event.wait()` returning. If a superseded/stale `McpTool` instance is created (e.g. on config reload, reconnect, or because a delegate/sub-gateway instantiates its own tools) without the old instance's `_shutdown_event` being set, the old `async with stdio_client(...)` block never exits and the subprocess is never reaped.

Additional places that spawn MCP-ish subprocesses and may bypass `_stdio_pids` tracking:

- `batch_runner.py:263` — `import subprocess as _sp`
- `tools/delegate_tool.py`
- `tools/process_registry.py`

## Impact

- Memory growth proportional to uptime + reconnect/reload count
- File-descriptor pressure (2 pipes per orphan × 28 = 56 leaked fds per MCP server)
- Some children grow to hundreds of MiB RSS (looks like the MCP server's per-session caches accumulated before the session silently went idle)

## Reproduction hypothesis (not yet confirmed end-to-end)

1. Start `hermes_cli gateway run --replace` with at least one stdio MCP server configured
2. Reload Hermes config (or trigger whatever path re-instantiates `McpTool`)
3. `ps --ppid $GATEWAY_PID | grep 'stdio-mcp-binary' | wc -l` grows without bound

## Suggested investigation

1. Audit all code paths that construct a new `McpTool` or call `_run_stdio` — confirm the previous instance's `_shutdown_event` is always set before the new one starts.
2. On Hermes shutdown / SIGTERM, verify every tracked PID in `_stdio_pids` actually receives SIGTERM and is `waitpid()`'d.
3. For orphan detection: a periodic sweep comparing `_stdio_pids` with `ps --ppid $SELF` would catch drift.

Happy to provide more detailed PS / lsof dumps if useful — please let me know what telemetry would help most.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway leaks stdio-MCP subprocess children over time (orphan 'read stdin' blocked processes, unbounded RSS growth) #11202

Summary

Environment

Observation

Suspected source

Impact

Reproduction hypothesis (not yet confirmed end-to-end)

Suggested investigation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gateway leaks stdio-MCP subprocess children over time (orphan 'read stdin' blocked processes, unbounded RSS growth) #11202

Description

Summary

Environment

Observation

Suspected source

Impact

Reproduction hypothesis (not yet confirmed end-to-end)

Suggested investigation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions