Skip to content

Gateway leaks stdio-MCP subprocess children over time (orphan 'read stdin' blocked processes, unbounded RSS growth) #11202

@lyr1cs

Description

@lyr1cs

Summary

hermes_cli gateway run --replace accumulates orphan stdio-MCP subprocess children over time. After roughly a day of normal use, ps showed 28 stdio MCP subprocesses, all children of the current gateway PID, all blocked on read of stdin. Memory usage grows unbounded — one accumulated child reached ~300 MiB RSS.

The 2 launchd-managed services on the host were clean; only Hermes-spawned children leaked.

Environment

  • Host: macOS 15.6 arm64 (Apple Silicon Mac mini)
  • Python: 3.11 (venv in ~/hermes-agent/venv)
  • hermes_cli.main gateway run --replace
  • MCP server: a local stdio server (Rust binary), configured via mcp_servers.<name>.command + args
  • Running gateway PID: stable (--replace is only called on upgrades)

Observation

$ ps -eo pid,ppid,etime,rss,command | grep '[s]tdio-mcp-binary'
 4262 11745  09:53:24  2720 /path/to/stdio-mcp-binary serve
16169 11745 01-02:06:42 2448 /path/to/stdio-mcp-binary serve
18599 11745  08:00:41  2720 /path/to/stdio-mcp-binary serve
...  (26 more, same PPID 11745) ...
83682 11745  05:44    304544 /path/to/stdio-mcp-binary serve   # 300 MiB RSS

All 28 are children of the same gateway (PID 11745). lsof confirms each has:

  • fd 0 = PIPE (stdin from gateway)
  • fd 1 = PIPE (stdout to gateway)
  • no listening TCP ports

→ They are genuine stdio-MCP sessions, just never torn down. The MCP binary behaves correctly: on stdin EOF it exits cleanly — the orphans are blocked because nothing ever closes the gateway side of the pipe.

Suspected source

Looking at tools/mcp_tool.py:_run_stdio:

async with stdio_client(server_params) as (read_stream, write_stream):
    new_pids = _snapshot_child_pids() - pids_before
    if new_pids:
        with _lock:
            _stdio_pids.update(new_pids)
    async with ClientSession(read_stream, write_stream, **sampling_kwargs) as session:
        await session.initialize()
        self.session = session
        await self._discover_tools()
        self._ready.set()
        await self._shutdown_event.wait()
# Context exited cleanly — subprocess was terminated by the SDK.
if new_pids:
    with _lock:
        _stdio_pids.difference_update(new_pids)

Cleanup relies on self._shutdown_event.wait() returning. If a superseded/stale McpTool instance is created (e.g. on config reload, reconnect, or because a delegate/sub-gateway instantiates its own tools) without the old instance's _shutdown_event being set, the old async with stdio_client(...) block never exits and the subprocess is never reaped.

Additional places that spawn MCP-ish subprocesses and may bypass _stdio_pids tracking:

  • batch_runner.py:263import subprocess as _sp
  • tools/delegate_tool.py
  • tools/process_registry.py

Impact

  • Memory growth proportional to uptime + reconnect/reload count
  • File-descriptor pressure (2 pipes per orphan × 28 = 56 leaked fds per MCP server)
  • Some children grow to hundreds of MiB RSS (looks like the MCP server's per-session caches accumulated before the session silently went idle)

Reproduction hypothesis (not yet confirmed end-to-end)

  1. Start hermes_cli gateway run --replace with at least one stdio MCP server configured
  2. Reload Hermes config (or trigger whatever path re-instantiates McpTool)
  3. ps --ppid $GATEWAY_PID | grep 'stdio-mcp-binary' | wc -l grows without bound

Suggested investigation

  1. Audit all code paths that construct a new McpTool or call _run_stdio — confirm the previous instance's _shutdown_event is always set before the new one starts.
  2. On Hermes shutdown / SIGTERM, verify every tracked PID in _stdio_pids actually receives SIGTERM and is waitpid()'d.
  3. For orphan detection: a periodic sweep comparing _stdio_pids with ps --ppid $SELF would catch drift.

Happy to provide more detailed PS / lsof dumps if useful — please let me know what telemetry would help most.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverytool/mcpMCP client and OAuthtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions