Skip to content

fix(mcp): reap stdio subprocesses in _run_stdio finally block#12430

Closed
lyr1cs wants to merge 1 commit into
NousResearch:mainfrom
lyr1cs:fix/stdio-subprocess-reap
Closed

fix(mcp): reap stdio subprocesses in _run_stdio finally block#12430
lyr1cs wants to merge 1 commit into
NousResearch:mainfrom
lyr1cs:fix/stdio-subprocess-reap

Conversation

@lyr1cs

@lyr1cs lyr1cs commented Apr 19, 2026

Copy link
Copy Markdown

Closes #11202.

Summary

_run_stdio in tools/mcp_tool.py relies on stdio_client(...)'s __aexit__ (anyio-backed) to reap the MCP subprocess. On exception paths (transient read/decode error, timeout, etc.) anyio's cancel-scope cleanup does not always close the subprocess's stdin pipe, leaving the child blocked on read indefinitely.

MCPServerTask.run()'s outer reconnect loop (mcp_tool.py:1113–1200) then catches the exception, sleeps, and re-enters _run_stdio — spawning a fresh subprocess. The previously tracked PID stays in _stdio_pids but nothing reaps it until full gateway shutdown. _kill_orphaned_mcp_children() only fires from _stop_mcp_loop(), not on per-server reconnect.

Net effect: on a stable gateway PID (no --replace, no reload), subprocess count grows roughly linearly with uptime × transient-error rate. Field report in #11202 shows 28 orphans in ~1 day, all blocked on read of stdin, one reaching 300 MiB RSS.

Change

  • Wrap the async with stdio_client(...) block in _run_stdio with try/finally.
  • In the finally, explicitly reap the tracked PID set: SIGTERM → up to 1.5 s grace → SIGKILL stragglers.
  • Add two small helpers alongside _kill_orphaned_mcp_children:
    • _pid_alive(pid)kill(pid, 0) probe
    • _reap_pids(pids, grace) — async SIGTERM/grace/SIGKILL

HTTP transport is untouched. _kill_orphaned_mcp_children still handles the full-gateway-shutdown path.

Why in-tool instead of upstream SDK

Fixing stdio_client's anyio cleanup belongs in the MCP Python SDK, but:

  1. Hermes already tracks PIDs for force-kill (_stdio_pids), so the information needed to reap is already on hand.
  2. This pattern gives Hermes a floor of correctness that doesn't depend on SDK version pinning.
  3. Even after an SDK fix ships, this stays useful as defense-in-depth.

Test plan

  • python -m pytest tests/tools/test_mcp_tool.py -x
  • Manual: induce a reconnect storm with a stdio MCP server that rejects the second initialize call, confirm child count stays flat via ps --ppid $GATEWAY_PID | grep stdio-mcp | wc -l.
  • Sanity: clean shutdown via SIGTERM to gateway still reaps children (should, since __aexit__ path now has belt-and-suspenders).

Defends against stdio_client's anyio cleanup failing to close the
subprocess stdin pipe on exception paths. Without this, MCPServerTask.run's
outer reconnect loop (tools/mcp_tool.py:1113-1200) spawns a fresh child on
every transient error while the previous child stays blocked on read(stdin)
indefinitely, producing the unbounded orphan accumulation tracked in NousResearch#11202.

- Wrap the stdio_client() context in try/finally
- Always invoke a new _reap_pids() helper on the tracked PID set:
  SIGTERM, wait up to 1.5s, SIGKILL stragglers
- _kill_orphaned_mcp_children() still handles the full-shutdown path

Refs: NousResearch#11202
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #12978 (cron-side MCP subprocess reaping) and fixes #11202 (the leak report).

@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #16275 (salvaged @Ito-69's #12978). Your PR first identified the root cause — thanks! The merged approach separates active vs orphaned PIDs via _orphan_stdio_pids so the cleanup sweep is parallel-safe for concurrent cron jobs and live user sessions sharing _stdio_pids. Credit acknowledged in the salvage PR body.

@teknium1 teknium1 closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P1 High — major feature broken, no workaround tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway leaks stdio-MCP subprocess children over time (orphan 'read stdin' blocked processes, unbounded RSS growth)

3 participants