Skip to content

fix(cron): reap orphaned MCP stdio subprocesses after each tick#12978

Closed
Ito-69 wants to merge 1 commit into
NousResearch:mainfrom
Ito-69:fix/cron-mcp-orphan-cleanup
Closed

fix(cron): reap orphaned MCP stdio subprocesses after each tick#12978
Ito-69 wants to merge 1 commit into
NousResearch:mainfrom
Ito-69:fix/cron-mcp-orphan-cleanup

Conversation

@Ito-69

@Ito-69 Ito-69 commented Apr 20, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes a resource leak where MCP stdio subprocesses (e.g. mempalace.mcp_server, any stdio-configured MCP server) accumulate as orphan session leaders whenever the SDK's subprocess teardown fails — most commonly when a cron job is cancelled, times out, or raises mid-flight.

Root cause: stdio_client spawns the MCP server child via start_new_session=True (for killpg() support), which makes it a session leader (Ssl in ps, PID == PGID == SID). When the enclosing asyncio task is cancelled or the context exits with an exception, the SDK sometimes fails to reap the child before the async generator finalizes. The surviving subprocess escapes the gateway's process group / cgroup tree, so neither systemd (KillMode=mixed and KillMode=control-group on soft restarts) nor _stop_mcp_loop() at shutdown can reliably kill it. Every cron tick that touches an MCP tool leaks one more server, each holding ~25–100 MB.

Fix — v2 (parallel-safe, rebased on #13021):

Previous iterations of this PR called _kill_orphaned_mcp_children() from run_job()'s finally: block, clearing the full _stdio_pids set. That races with:

This revision separates active from orphaned PIDs at the source:

  1. tools/mcp_tool.py_run_stdio now wraps the stdio_client + ClientSession contexts in a try/finally. On any exit path (clean, exception, cancellation) PIDs are removed from _stdio_pids; if a PID is still alive (probed via os.kill(pid, 0)) it is moved into a new _orphan_stdio_pids set. Liveness is the signal, not the code path — so we catch cancellation-based leaks without any except CancelledError plumbing. Upstream now tracks each stdio child as pid -> server name and redirects server stderr to the shared log; that stays intact, integrated with the try/finally + orphan set.
  2. tools/mcp_tool.py_kill_orphaned_mcp_children(include_active: bool = False). The default now reaps only the orphan set, so concurrent sessions are never disturbed. The existing _stop_mcp_loop() call passes include_active=True to preserve the previous "kill everything" semantics after the loop has stopped. Reaping uses the same graceful path as upstream (SIGTERM, short wait, SIGKILL) when clearing worklists.
  3. cron/scheduler.py — the cleanup sweep moved from run_job() into tick(), after the ThreadPoolExecutor has joined every future. At that point no in-flight sessions from this tick can exist, so the sweep is always safe regardless of cron.max_parallel_jobs.

Related Issue

Related: #11202

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests
  • ♻️ Refactor
  • 🎯 New skill

Changes Made

  • tools/mcp_tool.py
    • New module-level _orphan_stdio_pids: set separated from active _stdio_pids (active set is a dict of pid → server name, per upstream).
    • _run_stdio wrapped in try/finally; on exit any still-alive PID is migrated to the orphan set (liveness probed with signal 0). Keeps upstream stderr log routing (errlog=) in the same block.
    • _kill_orphaned_mcp_children gains include_active: bool = False; shutdown call site updated to include_active=True.
  • cron/scheduler.py
    • Cleanup hook removed from run_job()'s finally:.
    • New cleanup hook at the end of tick(), after _results = [f.result() for f in _futures] but before the lock-release finally:, so it runs exactly once per tick and never during a sibling job.
  • tests/tools/test_mcp_stability.pyTestStdioPidTracking targets _orphan_stdio_pids for the default (orphan-only) kill path, matching the v2 contract.

How to Test

Reproduction (before fix):

  1. Start gateway with at least one stdio MCP server configured (e.g. mempalace).
  2. Enable a recurring cron job whose agent invokes an MCP tool.
  3. ps -eo pid,lstart,cmd | grep mempalace.mcp_server before and after a few ticks — each tick that suffers a cancellation leaves a surviving session-leader.

After fix:

  1. Same setup. Run HERMES_CRON_MAX_PARALLEL=unbounded (default).
  2. Trigger several ticks with overlapping jobs.
  3. ps shows at most the actively connected MCP servers; previous ticks' orphans are reaped in the next tick's post-executor sweep. The currently active sessions of concurrent siblings are never touched.

Live-run observation on a Fedora/Asahi gateway running 7 cron jobs: before the patch, 6 orphan mempalace.mcp_server processes had accumulated across 24h (~570 MB RSS). After rebasing to this revision and restarting the gateway, the orphan count stays at 0 across the cron schedule.

Checklist

  • My code follows the project's coding style
  • I have commented my code where complex logic isn't self-evident
  • I have added/updated tests (TestStdioPidTracking for orphan-only reaping; full suite not run in CI from this environment)
  • All existing tests pass locally (syntax + import smoke on patched modules; TestStdioPidTracking run for this change)

@Ito-69 Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch 2 times, most recently from a57e799 to dcfe6cc Compare April 21, 2026 06:12
@Ito-69 Ito-69 changed the title fix(cron): force-kill MCP stdio subprocesses after each job run fix(cron): reap orphaned MCP stdio subprocesses after each tick Apr 21, 2026
@Ito-69

Ito-69 commented Apr 21, 2026

Copy link
Copy Markdown
Author

Revision note (force-pushed a57e799fdcfe6cc2)

Rebased on current main (includes #13021) and rewrote the fix so it is parallel-safe under the new ThreadPoolExecutor tick:

  • The old v1 hook in run_job()'s finally: would kill the entire _stdio_pids set — which in the new concurrent executor means a job finishing first could reap the MCP subprocesses owned by its still-running siblings. It would also stomp on _stdio_pids entries from live Telegram/Slack chat sessions sharing the same process.
  • v2 introduces a separate _orphan_stdio_pids set that is populated only when a subprocess is still alive after its session's stdio_client/ClientSession contexts have exited. _kill_orphaned_mcp_children() now defaults to reaping just that set; the shutdown call site keeps the old "kill everything" behaviour via include_active=True.
  • The sweep hook moved from run_job() to tick(), after [f.result() for f in _futures] — i.e. it runs exactly once per tick and only when no sibling job from that tick can still be in flight.

Verified on a 7-job Fedora/Asahi cron setup: orphan count went from 6 (~570 MB) across 24h to 0 after rebasing and restarting the gateway. No regressions observed in live Telegram/Slack MCP calls.

Happy to split the _orphan_stdio_pids change into its own commit if you'd prefer a two-commit PR, or to add a targeted test around the orphan-detection path.

@Ito-69 Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch 2 times, most recently from 78f8f3e to 026b7e6 Compare April 21, 2026 19:28
@alt-glitch alt-glitch added type/bug Something isn't working comp/cron Cron scheduler and job management tool/mcp MCP client and OAuth labels Apr 21, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #12430 (similar reap approach in _run_stdio finally) and #11202 (original leak report). This PR adds parallel-safe orphan tracking on top of #13021's ThreadPoolExecutor.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #12430 and #11202

@Ito-69

Ito-69 commented Apr 24, 2026

Copy link
Copy Markdown
Author

Update (rebased on fe9d9a26 — force-push fd9a3a8d)

  • Resolved a rebase conflict in tools/mcp_tool.py with current main: kept upstream’s stderr log (errlog=), pid → server name dict tracking, and the SIGTERM → wait → SIGKILL reaper, while keeping v2 behaviour: try/finally in _run_stdio, _orphan_stdio_pids, include_active, and the post–ThreadPoolExecutor sweep in tick().
  • tests/tools/test_mcp_stability.py: TestStdioPidTracking now uses _orphan_stdio_pids for the default kill path (default reaping no longer iterates the full _stdio_pids dict).
  • Description: filled Fixes Gateway leaks stdio-MCP subprocess children over time (orphan 'read stdin' blocked processes, unbounded RSS growth) #11202; checklist reflects test updates.

@Ito-69

Ito-69 commented Apr 24, 2026

Copy link
Copy Markdown
Author

Amendment: Replaced Fixes #11202 with Related: #11202 in the description so the issue is not auto-closed on merge. Maintainers can close it manually when they are satisfied.

@Ito-69 Ito-69 force-pushed the fix/cron-mcp-orphan-cleanup branch from fd9a3a8 to af76902 Compare April 24, 2026 13:00
MCP stdio servers are spawned via the SDK's stdio_client, which on
Linux uses start_new_session=True (setsid).  When a cron job is
cancelled mid-way (timeout, agent finish, exception), the subprocess
often escapes the SDK's teardown and survives as a session leader.
Because setsid() detaches the child from the gateway's process group
/ cgroup tree, systemd does not reap it on service restart either —
so every cron tick that touches an MCP tool leaks a dangling server
process.

Fix:

* tools/mcp_tool.py — _run_stdio now wraps the whole stdio+session
  context in try/finally.  On any exit path (clean, exception,
  cancellation), PIDs still alive are moved from the active
  _stdio_pids set into a new _orphan_stdio_pids set.  Orphan
  detection is done via os.kill(pid, 0) — a cheap liveness probe
  that never signals the target.

* tools/mcp_tool.py — _kill_orphaned_mcp_children gains an
  include_active=False flag.  Default behaviour now only reaps the
  orphan set so concurrent sessions (other parallel cron jobs or
  live user chats) are never disrupted.  The existing shutdown path
  passes include_active=True to keep the previous "kill everything"
  semantics after the MCP loop is stopped.

* cron/scheduler.py — the cleanup hook is moved from run_job()'s
  finally (which would race with parallel siblings after NousResearch#13021)
  into tick() after the ThreadPoolExecutor has joined every future.
  At that point there are no in-flight sessions from this tick, so
  sweeping the orphan set is always safe.

Net effect: zero regression for healthy sessions, and orphan MCP
servers no longer accumulate between gateway restarts.

Made-with: Cursor
@teknium1

Copy link
Copy Markdown
Contributor

Merged via #16275 — your commit was cherry-picked onto current main with authorship preserved in git log. Thanks!

@teknium1 teknium1 closed this Apr 27, 2026
@Ito-69 Ito-69 deleted the fix/cron-mcp-orphan-cleanup branch April 27, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants