Skip to content

fix(mcp): reap stdio subprocesses via orphan set (salvage #12978, closes #11202)#16275

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-5de18fdc
Apr 27, 2026
Merged

fix(mcp): reap stdio subprocesses via orphan set (salvage #12978, closes #11202)#16275
teknium1 merged 2 commits into
mainfrom
hermes/hermes-5de18fdc

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Reaps stdio-MCP subprocess children that escape stdio_client()'s anyio cleanup on exception / cancellation, so they stop accumulating as orphan session leaders blocked on read(stdin). Closes #11202.

Straight cherry-pick of #12978 by @Ito-69, clean against current main.

Why this PR over #12430

Both PRs wrap _run_stdio's stdio+session context in try/finally. The critical difference: #12430 clears the full _stdio_pids set on every exit, which races with concurrent cron jobs (after #13021 made tick() use a ThreadPoolExecutor) and live user chats sharing the tracking dict. #12978 separates active vs orphaned PIDs via an os.kill(pid, 0) liveness probe into a new _orphan_stdio_pids set, so cleanup sweeps never disturb in-flight sessions.

Changes

  • tools/mcp_tool.py: _run_stdio wrapped in try/finally; still-alive PIDs migrate to _orphan_stdio_pids (new set). _kill_orphaned_mcp_children gains include_active: bool = False — default reaps only orphans; _stop_mcp_loop() passes include_active=True to preserve the full-shutdown "kill everything" semantics.
  • cron/scheduler.py: post-executor sweep in tick() after all futures have joined (safe point, no in-flight sessions).
  • tests/tools/test_mcp_stability.py: TestStdioPidTracking updated for the orphan-only default contract.

Validation

Before After
test_mcp_stability.py 16/16 pass
Broader MCP + cron suite 426/426 pass
Parallel-safety E2E _kill_orphaned_mcp_children() wiped active PIDs of concurrent sessions default call only touches _orphan_stdio_pids; include_active=True kept for shutdown
Author-reported field data 6 orphan mempalace.mcp_server processes / ~570 MB after 24h, 7 cron jobs orphan count stays at 0 across the cron schedule

Closes #11202. Credit to @lyr1cs (#12430) for first identifying the root cause; merged approach is #12978 for the parallel-safe design.

crearch and others added 2 commits April 26, 2026 18:12
MCP stdio servers are spawned via the SDK's stdio_client, which on
Linux uses start_new_session=True (setsid).  When a cron job is
cancelled mid-way (timeout, agent finish, exception), the subprocess
often escapes the SDK's teardown and survives as a session leader.
Because setsid() detaches the child from the gateway's process group
/ cgroup tree, systemd does not reap it on service restart either —
so every cron tick that touches an MCP tool leaks a dangling server
process.

Fix:

* tools/mcp_tool.py — _run_stdio now wraps the whole stdio+session
  context in try/finally.  On any exit path (clean, exception,
  cancellation), PIDs still alive are moved from the active
  _stdio_pids set into a new _orphan_stdio_pids set.  Orphan
  detection is done via os.kill(pid, 0) — a cheap liveness probe
  that never signals the target.

* tools/mcp_tool.py — _kill_orphaned_mcp_children gains an
  include_active=False flag.  Default behaviour now only reaps the
  orphan set so concurrent sessions (other parallel cron jobs or
  live user chats) are never disrupted.  The existing shutdown path
  passes include_active=True to keep the previous "kill everything"
  semantics after the MCP loop is stopped.

* cron/scheduler.py — the cleanup hook is moved from run_job()'s
  finally (which would race with parallel siblings after #13021)
  into tick() after the ThreadPoolExecutor has joined every future.
  At that point there are no in-flight sessions from this tick, so
  sweeping the orphan set is always safe.

Net effect: zero regression for healthy sessions, and orphan MCP
servers no longer accumulate between gateway restarts.

Made-with: Cursor
@teknium1 teknium1 merged commit 8747775 into main Apr 27, 2026
11 of 12 checks passed
@teknium1 teknium1 deleted the hermes/hermes-5de18fdc branch April 27, 2026 01:21
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth comp/cron Cron scheduler and job management labels Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway leaks stdio-MCP subprocess children over time (orphan 'read stdin' blocked processes, unbounded RSS growth)

3 participants