Summary
hermes_cli gateway run --replace accumulates orphan stdio-MCP subprocess children over time. After roughly a day of normal use, ps showed 28 stdio MCP subprocesses, all children of the current gateway PID, all blocked on read of stdin. Memory usage grows unbounded — one accumulated child reached ~300 MiB RSS.
The 2 launchd-managed services on the host were clean; only Hermes-spawned children leaked.
Environment
- Host: macOS 15.6 arm64 (Apple Silicon Mac mini)
- Python: 3.11 (venv in
~/hermes-agent/venv)
hermes_cli.main gateway run --replace
- MCP server: a local stdio server (Rust binary), configured via
mcp_servers.<name>.command + args
- Running gateway PID: stable (
--replace is only called on upgrades)
Observation
$ ps -eo pid,ppid,etime,rss,command | grep '[s]tdio-mcp-binary'
4262 11745 09:53:24 2720 /path/to/stdio-mcp-binary serve
16169 11745 01-02:06:42 2448 /path/to/stdio-mcp-binary serve
18599 11745 08:00:41 2720 /path/to/stdio-mcp-binary serve
... (26 more, same PPID 11745) ...
83682 11745 05:44 304544 /path/to/stdio-mcp-binary serve # 300 MiB RSS
All 28 are children of the same gateway (PID 11745). lsof confirms each has:
fd 0 = PIPE (stdin from gateway)
fd 1 = PIPE (stdout to gateway)
- no listening TCP ports
→ They are genuine stdio-MCP sessions, just never torn down. The MCP binary behaves correctly: on stdin EOF it exits cleanly — the orphans are blocked because nothing ever closes the gateway side of the pipe.
Suspected source
Looking at tools/mcp_tool.py:_run_stdio:
async with stdio_client(server_params) as (read_stream, write_stream):
new_pids = _snapshot_child_pids() - pids_before
if new_pids:
with _lock:
_stdio_pids.update(new_pids)
async with ClientSession(read_stream, write_stream, **sampling_kwargs) as session:
await session.initialize()
self.session = session
await self._discover_tools()
self._ready.set()
await self._shutdown_event.wait()
# Context exited cleanly — subprocess was terminated by the SDK.
if new_pids:
with _lock:
_stdio_pids.difference_update(new_pids)
Cleanup relies on self._shutdown_event.wait() returning. If a superseded/stale McpTool instance is created (e.g. on config reload, reconnect, or because a delegate/sub-gateway instantiates its own tools) without the old instance's _shutdown_event being set, the old async with stdio_client(...) block never exits and the subprocess is never reaped.
Additional places that spawn MCP-ish subprocesses and may bypass _stdio_pids tracking:
batch_runner.py:263 — import subprocess as _sp
tools/delegate_tool.py
tools/process_registry.py
Impact
- Memory growth proportional to uptime + reconnect/reload count
- File-descriptor pressure (2 pipes per orphan × 28 = 56 leaked fds per MCP server)
- Some children grow to hundreds of MiB RSS (looks like the MCP server's per-session caches accumulated before the session silently went idle)
Reproduction hypothesis (not yet confirmed end-to-end)
- Start
hermes_cli gateway run --replace with at least one stdio MCP server configured
- Reload Hermes config (or trigger whatever path re-instantiates
McpTool)
ps --ppid $GATEWAY_PID | grep 'stdio-mcp-binary' | wc -l grows without bound
Suggested investigation
- Audit all code paths that construct a new
McpTool or call _run_stdio — confirm the previous instance's _shutdown_event is always set before the new one starts.
- On Hermes shutdown / SIGTERM, verify every tracked PID in
_stdio_pids actually receives SIGTERM and is waitpid()'d.
- For orphan detection: a periodic sweep comparing
_stdio_pids with ps --ppid $SELF would catch drift.
Happy to provide more detailed PS / lsof dumps if useful — please let me know what telemetry would help most.
Summary
hermes_cli gateway run --replaceaccumulates orphan stdio-MCP subprocess children over time. After roughly a day of normal use,psshowed 28 stdio MCP subprocesses, all children of the current gateway PID, all blocked onreadof stdin. Memory usage grows unbounded — one accumulated child reached ~300 MiB RSS.The 2 launchd-managed services on the host were clean; only Hermes-spawned children leaked.
Environment
~/hermes-agent/venv)hermes_cli.main gateway run --replacemcp_servers.<name>.command+args--replaceis only called on upgrades)Observation
All 28 are children of the same gateway (PID 11745).
lsofconfirms each has:fd 0= PIPE (stdin from gateway)fd 1= PIPE (stdout to gateway)→ They are genuine stdio-MCP sessions, just never torn down. The MCP binary behaves correctly: on
stdin EOFit exits cleanly — the orphans are blocked because nothing ever closes the gateway side of the pipe.Suspected source
Looking at
tools/mcp_tool.py:_run_stdio:Cleanup relies on
self._shutdown_event.wait()returning. If a superseded/staleMcpToolinstance is created (e.g. on config reload, reconnect, or because a delegate/sub-gateway instantiates its own tools) without the old instance's_shutdown_eventbeing set, the oldasync with stdio_client(...)block never exits and the subprocess is never reaped.Additional places that spawn MCP-ish subprocesses and may bypass
_stdio_pidstracking:batch_runner.py:263—import subprocess as _sptools/delegate_tool.pytools/process_registry.pyImpact
Reproduction hypothesis (not yet confirmed end-to-end)
hermes_cli gateway run --replacewith at least one stdio MCP server configuredMcpTool)ps --ppid $GATEWAY_PID | grep 'stdio-mcp-binary' | wc -lgrows without boundSuggested investigation
McpToolor call_run_stdio— confirm the previous instance's_shutdown_eventis always set before the new one starts._stdio_pidsactually receives SIGTERM and iswaitpid()'d._stdio_pidswithps --ppid $SELFwould catch drift.Happy to provide more detailed PS / lsof dumps if useful — please let me know what telemetry would help most.