fix(mcp): reap stdio subprocesses in _run_stdio finally block#12430
Closed
lyr1cs wants to merge 1 commit into
Closed
fix(mcp): reap stdio subprocesses in _run_stdio finally block#12430lyr1cs wants to merge 1 commit into
lyr1cs wants to merge 1 commit into
Conversation
Defends against stdio_client's anyio cleanup failing to close the subprocess stdin pipe on exception paths. Without this, MCPServerTask.run's outer reconnect loop (tools/mcp_tool.py:1113-1200) spawns a fresh child on every transient error while the previous child stays blocked on read(stdin) indefinitely, producing the unbounded orphan accumulation tracked in NousResearch#11202. - Wrap the stdio_client() context in try/finally - Always invoke a new _reap_pids() helper on the tracked PID set: SIGTERM, wait up to 1.5s, SIGKILL stragglers - _kill_orphaned_mcp_children() still handles the full-shutdown path Refs: NousResearch#11202
11 tasks
Collaborator
Contributor
|
Superseded by #16275 (salvaged @Ito-69's #12978). Your PR first identified the root cause — thanks! The merged approach separates active vs orphaned PIDs via |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #11202.
Summary
_run_stdiointools/mcp_tool.pyrelies onstdio_client(...)'s__aexit__(anyio-backed) to reap the MCP subprocess. On exception paths (transient read/decode error, timeout, etc.) anyio's cancel-scope cleanup does not always close the subprocess's stdin pipe, leaving the child blocked onreadindefinitely.MCPServerTask.run()'s outer reconnect loop (mcp_tool.py:1113–1200) then catches the exception, sleeps, and re-enters_run_stdio— spawning a fresh subprocess. The previously tracked PID stays in_stdio_pidsbut nothing reaps it until full gateway shutdown._kill_orphaned_mcp_children()only fires from_stop_mcp_loop(), not on per-server reconnect.Net effect: on a stable gateway PID (no
--replace, no reload), subprocess count grows roughly linearly with uptime × transient-error rate. Field report in #11202 shows 28 orphans in ~1 day, all blocked onreadof stdin, one reaching 300 MiB RSS.Change
async with stdio_client(...)block in_run_stdiowithtry/finally.SIGTERM→ up to 1.5 s grace →SIGKILLstragglers._kill_orphaned_mcp_children:_pid_alive(pid)—kill(pid, 0)probe_reap_pids(pids, grace)— async SIGTERM/grace/SIGKILLHTTP transport is untouched.
_kill_orphaned_mcp_childrenstill handles the full-gateway-shutdown path.Why in-tool instead of upstream SDK
Fixing
stdio_client's anyio cleanup belongs in the MCP Python SDK, but:_stdio_pids), so the information needed to reap is already on hand.Test plan
python -m pytest tests/tools/test_mcp_tool.py -xinitializecall, confirm child count stays flat viaps --ppid $GATEWAY_PID | grep stdio-mcp | wc -l.SIGTERMto gateway still reaps children (should, since__aexit__path now has belt-and-suspenders).