fix(mcp): reap stdio subprocesses via orphan set (salvage #12978, closes #11202)#16275
Merged
Conversation
MCP stdio servers are spawned via the SDK's stdio_client, which on Linux uses start_new_session=True (setsid). When a cron job is cancelled mid-way (timeout, agent finish, exception), the subprocess often escapes the SDK's teardown and survives as a session leader. Because setsid() detaches the child from the gateway's process group / cgroup tree, systemd does not reap it on service restart either — so every cron tick that touches an MCP tool leaks a dangling server process. Fix: * tools/mcp_tool.py — _run_stdio now wraps the whole stdio+session context in try/finally. On any exit path (clean, exception, cancellation), PIDs still alive are moved from the active _stdio_pids set into a new _orphan_stdio_pids set. Orphan detection is done via os.kill(pid, 0) — a cheap liveness probe that never signals the target. * tools/mcp_tool.py — _kill_orphaned_mcp_children gains an include_active=False flag. Default behaviour now only reaps the orphan set so concurrent sessions (other parallel cron jobs or live user chats) are never disrupted. The existing shutdown path passes include_active=True to keep the previous "kill everything" semantics after the MCP loop is stopped. * cron/scheduler.py — the cleanup hook is moved from run_job()'s finally (which would race with parallel siblings after #13021) into tick() after the ThreadPoolExecutor has joined every future. At that point there are no in-flight sessions from this tick, so sweeping the orphan set is always safe. Net effect: zero regression for healthy sessions, and orphan MCP servers no longer accumulate between gateway restarts. Made-with: Cursor
This was referenced Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reaps stdio-MCP subprocess children that escape
stdio_client()'s anyio cleanup on exception / cancellation, so they stop accumulating as orphan session leaders blocked onread(stdin). Closes #11202.Straight cherry-pick of #12978 by @Ito-69, clean against current main.
Why this PR over #12430
Both PRs wrap
_run_stdio's stdio+session context intry/finally. The critical difference: #12430 clears the full_stdio_pidsset on every exit, which races with concurrent cron jobs (after #13021 madetick()use a ThreadPoolExecutor) and live user chats sharing the tracking dict. #12978 separates active vs orphaned PIDs via anos.kill(pid, 0)liveness probe into a new_orphan_stdio_pidsset, so cleanup sweeps never disturb in-flight sessions.Changes
tools/mcp_tool.py:_run_stdiowrapped in try/finally; still-alive PIDs migrate to_orphan_stdio_pids(new set)._kill_orphaned_mcp_childrengainsinclude_active: bool = False— default reaps only orphans;_stop_mcp_loop()passesinclude_active=Trueto preserve the full-shutdown "kill everything" semantics.cron/scheduler.py: post-executor sweep intick()after all futures have joined (safe point, no in-flight sessions).tests/tools/test_mcp_stability.py:TestStdioPidTrackingupdated for the orphan-only default contract.Validation
test_mcp_stability.py_kill_orphaned_mcp_children()wiped active PIDs of concurrent sessions_orphan_stdio_pids;include_active=Truekept for shutdownmempalace.mcp_serverprocesses / ~570 MB after 24h, 7 cron jobsCloses #11202. Credit to @lyr1cs (#12430) for first identifying the root cause; merged approach is #12978 for the parallel-safe design.