fix(cron): reap orphaned MCP stdio subprocesses after each tick by Ito-69 · Pull Request #12978 · NousResearch/hermes-agent

Ito-69 · 2026-04-20T10:55:41Z

What does this PR do?

Fixes a resource leak where MCP stdio subprocesses (e.g. mempalace.mcp_server, any stdio-configured MCP server) accumulate as orphan session leaders whenever the SDK's subprocess teardown fails — most commonly when a cron job is cancelled, times out, or raises mid-flight.

Root cause: stdio_client spawns the MCP server child via start_new_session=True (for killpg() support), which makes it a session leader (Ssl in ps, PID == PGID == SID). When the enclosing asyncio task is cancelled or the context exits with an exception, the SDK sometimes fails to reap the child before the async generator finalizes. The surviving subprocess escapes the gateway's process group / cgroup tree, so neither systemd (KillMode=mixed and KillMode=control-group on soft restarts) nor _stop_mcp_loop() at shutdown can reliably kill it. Every cron tick that touches an MCP tool leaks one more server, each holding ~25–100 MB.

Fix — v2 (parallel-safe, rebased on #13021):

Previous iterations of this PR called _kill_orphaned_mcp_children() from run_job()'s finally: block, clearing the full _stdio_pids set. That races with:

concurrent cron jobs after fix(cron): run due jobs in parallel to prevent serial tick starvation #13021 made tick() use a ThreadPoolExecutor (a job finishing first would kill the MCP servers of its still-running siblings);
live user chat sessions (Telegram/Slack) whose MCP calls share the same tracking set.

This revision separates active from orphaned PIDs at the source:

tools/mcp_tool.py — _run_stdio now wraps the stdio_client + ClientSession contexts in a try/finally. On any exit path (clean, exception, cancellation) PIDs are removed from _stdio_pids; if a PID is still alive (probed via os.kill(pid, 0)) it is moved into a new _orphan_stdio_pids set. Liveness is the signal, not the code path — so we catch cancellation-based leaks without any except CancelledError plumbing. Upstream now tracks each stdio child as pid -> server name and redirects server stderr to the shared log; that stays intact, integrated with the try/finally + orphan set.
tools/mcp_tool.py — _kill_orphaned_mcp_children(include_active: bool = False). The default now reaps only the orphan set, so concurrent sessions are never disturbed. The existing _stop_mcp_loop() call passes include_active=True to preserve the previous "kill everything" semantics after the loop has stopped. Reaping uses the same graceful path as upstream (SIGTERM, short wait, SIGKILL) when clearing worklists.
cron/scheduler.py — the cleanup sweep moved from run_job() into tick(), after the ThreadPoolExecutor has joined every future. At that point no in-flight sessions from this tick can exist, so the sweep is always safe regardless of cron.max_parallel_jobs.

Related Issue

Related: #11202

Type of Change

Changes Made

tools/mcp_tool.py
- New module-level _orphan_stdio_pids: set separated from active _stdio_pids (active set is a dict of pid → server name, per upstream).
- _run_stdio wrapped in try/finally; on exit any still-alive PID is migrated to the orphan set (liveness probed with signal 0). Keeps upstream stderr log routing (errlog=) in the same block.
- _kill_orphaned_mcp_children gains include_active: bool = False; shutdown call site updated to include_active=True.
cron/scheduler.py
- Cleanup hook removed from run_job()'s finally:.
- New cleanup hook at the end of tick(), after _results = [f.result() for f in _futures] but before the lock-release finally:, so it runs exactly once per tick and never during a sibling job.
tests/tools/test_mcp_stability.py — TestStdioPidTracking targets _orphan_stdio_pids for the default (orphan-only) kill path, matching the v2 contract.

How to Test

Reproduction (before fix):

Start gateway with at least one stdio MCP server configured (e.g. mempalace).
Enable a recurring cron job whose agent invokes an MCP tool.
ps -eo pid,lstart,cmd | grep mempalace.mcp_server before and after a few ticks — each tick that suffers a cancellation leaves a surviving session-leader.

After fix:

Same setup. Run HERMES_CRON_MAX_PARALLEL=unbounded (default).
Trigger several ticks with overlapping jobs.
ps shows at most the actively connected MCP servers; previous ticks' orphans are reaped in the next tick's post-executor sweep. The currently active sessions of concurrent siblings are never touched.

Live-run observation on a Fedora/Asahi gateway running 7 cron jobs: before the patch, 6 orphan mempalace.mcp_server processes had accumulated across 24h (~570 MB RSS). After rebasing to this revision and restarting the gateway, the orphan count stays at 0 across the cron schedule.

Checklist

My code follows the project's coding style
I have commented my code where complex logic isn't self-evident
I have added/updated tests (TestStdioPidTracking for orphan-only reaping; full suite not run in CI from this environment)
All existing tests pass locally (syntax + import smoke on patched modules; TestStdioPidTracking run for this change)

Ito-69 · 2026-04-21T06:14:02Z

Revision note (force-pushed `a57e799f` → `dcfe6cc2`)

Rebased on current main (includes #13021) and rewrote the fix so it is parallel-safe under the new ThreadPoolExecutor tick:

The old v1 hook in run_job()'s finally: would kill the entire _stdio_pids set — which in the new concurrent executor means a job finishing first could reap the MCP subprocesses owned by its still-running siblings. It would also stomp on _stdio_pids entries from live Telegram/Slack chat sessions sharing the same process.
v2 introduces a separate _orphan_stdio_pids set that is populated only when a subprocess is still alive after its session's stdio_client/ClientSession contexts have exited. _kill_orphaned_mcp_children() now defaults to reaping just that set; the shutdown call site keeps the old "kill everything" behaviour via include_active=True.
The sweep hook moved from run_job() to tick(), after [f.result() for f in _futures] — i.e. it runs exactly once per tick and only when no sibling job from that tick can still be in flight.

Verified on a 7-job Fedora/Asahi cron setup: orphan count went from 6 (~570 MB) across 24h to 0 after rebasing and restarting the gateway. No regressions observed in live Telegram/Slack MCP calls.

Happy to split the _orphan_stdio_pids change into its own commit if you'd prefer a two-commit PR, or to add a targeted test around the orphan-detection path.

alt-glitch · 2026-04-21T19:42:10Z

Related to #12430 (similar reap approach in _run_stdio finally) and #11202 (original leak report). This PR adds parallel-safe orphan tracking on top of #13021's ThreadPoolExecutor.

alt-glitch · 2026-04-21T19:43:12Z

Related to #12430 and #11202

Ito-69 · 2026-04-24T12:42:12Z

Update (rebased on `fe9d9a26` — force-push `fd9a3a8d`)

Resolved a rebase conflict in tools/mcp_tool.py with current main: kept upstream’s stderr log (errlog=), pid → server name dict tracking, and the SIGTERM → wait → SIGKILL reaper, while keeping v2 behaviour: try/finally in _run_stdio, _orphan_stdio_pids, include_active, and the post–ThreadPoolExecutor sweep in tick().
tests/tools/test_mcp_stability.py: TestStdioPidTracking now uses _orphan_stdio_pids for the default kill path (default reaping no longer iterates the full _stdio_pids dict).
Description: filled Fixes Gateway leaks stdio-MCP subprocess children over time (orphan 'read stdin' blocked processes, unbounded RSS growth) #11202; checklist reflects test updates.

Ito-69 · 2026-04-24T12:43:02Z

Amendment: Replaced Fixes #11202 with Related: #11202 in the description so the issue is not auto-closed on merge. Maintainers can close it manually when they are satisfied.

MCP stdio servers are spawned via the SDK's stdio_client, which on Linux uses start_new_session=True (setsid). When a cron job is cancelled mid-way (timeout, agent finish, exception), the subprocess often escapes the SDK's teardown and survives as a session leader. Because setsid() detaches the child from the gateway's process group / cgroup tree, systemd does not reap it on service restart either — so every cron tick that touches an MCP tool leaks a dangling server process. Fix: * tools/mcp_tool.py — _run_stdio now wraps the whole stdio+session context in try/finally. On any exit path (clean, exception, cancellation), PIDs still alive are moved from the active _stdio_pids set into a new _orphan_stdio_pids set. Orphan detection is done via os.kill(pid, 0) — a cheap liveness probe that never signals the target. * tools/mcp_tool.py — _kill_orphaned_mcp_children gains an include_active=False flag. Default behaviour now only reaps the orphan set so concurrent sessions (other parallel cron jobs or live user chats) are never disrupted. The existing shutdown path passes include_active=True to keep the previous "kill everything" semantics after the MCP loop is stopped. * cron/scheduler.py — the cleanup hook is moved from run_job()'s finally (which would race with parallel siblings after NousResearch#13021) into tick() after the ThreadPoolExecutor has joined every future. At that point there are no in-flight sessions from this tick, so sweeping the orphan set is always safe. Net effect: zero regression for healthy sessions, and orphan MCP servers no longer accumulate between gateway restarts. Made-with: Cursor

teknium1 · 2026-04-27T01:21:31Z

Merged via #16275 — your commit was cherry-picked onto current main with authorship preserved in git log. Thanks!

Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch 2 times, most recently from a57e799 to dcfe6cc Compare April 21, 2026 06:12

Ito-69 changed the title ~~fix(cron): force-kill MCP stdio subprocesses after each job run~~ fix(cron): reap orphaned MCP stdio subprocesses after each tick Apr 21, 2026

Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch 2 times, most recently from 78f8f3e to 026b7e6 Compare April 21, 2026 19:28

alt-glitch added type/bug Something isn't working comp/cron Cron scheduler and job management tool/mcp MCP client and OAuth labels Apr 21, 2026

Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch 3 times, most recently from 42cf483 to 1e3ccce Compare April 23, 2026 10:13

This was referenced Apr 23, 2026

fix(mcp): reap stdio subprocesses in _run_stdio finally block #12430

Closed

[Bug] Gateway accumulates zombie processes from MCP servers and subprocess calls #15012

Closed

Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch from 1e3ccce to fd9a3a8 Compare April 24, 2026 12:38

Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch from fd9a3a8 to af76902 Compare April 24, 2026 13:00

Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch from af76902 to 827d9e6 Compare April 26, 2026 07:14

teknium1 mentioned this pull request Apr 27, 2026

fix(mcp): reap stdio subprocesses via orphan set (salvage #12978, closes #11202) #16275

Merged

teknium1 closed this Apr 27, 2026

Ito-69 deleted the fix/cron-mcp-orphan-cleanup branch April 27, 2026 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cron): reap orphaned MCP stdio subprocesses after each tick#12978

fix(cron): reap orphaned MCP stdio subprocesses after each tick#12978
Ito-69 wants to merge 1 commit into
NousResearch:mainfrom
Ito-69:fix/cron-mcp-orphan-cleanup

Ito-69 commented Apr 20, 2026 •

edited

Loading

Uh oh!

Ito-69 commented Apr 21, 2026

Uh oh!

alt-glitch commented Apr 21, 2026

Uh oh!

alt-glitch commented Apr 21, 2026

Uh oh!

Ito-69 commented Apr 24, 2026

Uh oh!

Ito-69 commented Apr 24, 2026

Uh oh!

teknium1 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Ito-69 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Uh oh!

Ito-69 commented Apr 21, 2026

Revision note (force-pushed a57e799f → dcfe6cc2)

Uh oh!

alt-glitch commented Apr 21, 2026

Uh oh!

alt-glitch commented Apr 21, 2026

Uh oh!

Ito-69 commented Apr 24, 2026

Update (rebased on fe9d9a26 — force-push fd9a3a8d)

Uh oh!

Ito-69 commented Apr 24, 2026

Uh oh!

teknium1 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ito-69 commented Apr 20, 2026 •

edited

Loading

Revision note (force-pushed `a57e799f` → `dcfe6cc2`)

Update (rebased on `fe9d9a26` — force-push `fd9a3a8d`)