Skip to content

MCP child processes leak when McpConnectionManager is replaced #18881

@7etsuo

Description

@7etsuo

What issue are you seeing?

A long-running codex --dangerously-bypass-approvals-and-sandbox daemon leaks stdio MCP child processes over time. A single daemon with roughly 15 hours of uptime accumulated 492 orphaned MCP children, 123 of each of 4 configured stdio MCP servers, all direct children (PPID) of the daemon process. Each leaked child was busy-looping at roughly 35% CPU, and cumulative RSS of the leaked set was about 82 GB. The spawn/leak rate was approximately one full cycle per 7 minutes.

Configured stdio MCP servers in the affected daemon:

  • mcp-codebase-index
  • codebase-memory-mcp
  • github-mcp-server --toolsets actions,issues,pull_requests,repos stdio
  • ast-grep-server

Confirmed via ps:

  • 492 processes with PPID equal to the daemon PID
  • exactly 123 instances per configured server (123 × 4 = 492)
  • killing the daemon PID reaped all 492 children immediately, ruling out a non-codex parent

What steps can reproduce the bug?

I do not have a minimal isolated repro yet. Observed on a long-lived daemon with four configured stdio MCP servers. Based on code inspection the leak should reproduce by driving either of these paths repeatedly:

  1. Session MCP refresh, via refresh_mcp_servers_inner in codex-rs/core/src/session/mcp.rs:177.
  2. Accessible-connectors refresh, via compute_accessible_connectors in codex-rs/core/src/connectors.rs (reached through list_accessible_connectors_from_mcp_tools*).

Both paths construct a fresh McpConnectionManager (which spawns all configured MCP server processes), perform tool listing, then release the manager. Field leak rate (about one cycle per 7 minutes across 15 hours) matches "one refresh cycle per interval" rather than "one per daemon start".

A targeted integration test would configure a stub stdio MCP server that records its own PID at startup, drive the refresh path N times, and assert that previous PIDs are no longer alive by the time the N-th refresh completes.

What is the expected behavior?

When an McpConnectionManager is replaced or dropped, every MCP child process it spawned should be terminated (SIGTERM, escalating to SIGKILL after a bounded grace period) during that drop. Teardown should not depend on Arc<RunningService> refcount reaching zero, because cloned Arcs legitimately escape the manager's ownership boundary during in-flight operations.

Additional information

Root cause

McpConnectionManager (codex-rs/codex-mcp/src/mcp_connection_manager.rs:657) has no Drop impl. Child-process cleanup is entirely downstream of Arc<RunningService> refcount reaching zero.

StdioServerTransport owns both a TokioChildProcess with kill_on_drop(true) and a ProcessGroupGuard that SIGTERMs the child's process group on drop:

  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:211, command.kill_on_drop(true)
  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:218, command.process_group(0)
  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296, impl Drop for ProcessGroupGuard calling terminate_process_group

But StdioServerTransport is held inside RunningService, which is held inside Arc<RunningService<...>> inside ClientState::Ready, which is held inside RmcpClient, which is wrapped by AsyncManagedClient as a cloneable shared future:

  • codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483, AsyncManagedClient { client: Shared<BoxFuture<..., Result<ManagedClient, _>>>, ... }
  • codex-rs/rmcp-client/src/rmcp_client.rs:494, pub struct RmcpClient
  • codex-rs/rmcp-client/src/rmcp_client.rs:1014-1049, run_service_operation clones the Arc<RunningService> into every in-flight operation

Every in-flight run_service_operation_once clones the Arc. Dropping the manager drops only its reference; any concurrent or detached task holding an Arc clone keeps the transport, and therefore the child PID, alive indefinitely. The Shared<BoxFuture<...>> at AsyncManagedClient.client additionally retains the completed ManagedClient for replay; any other clone of that Shared extends client lifetime past manager drop.

Why the observed symptoms match this mechanism

  • PPID equals daemon: LocalStdioServerLauncher::launch_server spawns via tokio::process::Command directly in the daemon (stdio_server_launcher.rs:209-225). No subagent intermediary.
  • 123 of each of 4 servers: each McpConnectionManager::new(...) call spawns one process per configured server, so multiples of 4 are expected.
  • Roughly 35% CPU per leaked child: consistent with the child's stdio reader loop spinning after the parent dropped its pipe end without delivering EOF or SIGTERM.
  • Scales with uptime, not startup: the daemon boots once; refresh paths fire repeatedly.

Replacement sites with no teardown

codex-rs/core/src/session/mcp.rs:200-229:

let (refreshed_manager, cancel_token) = McpConnectionManager::new(...).await;
...
let mut manager = self.services.mcp_connection_manager.write().await;
*manager = refreshed_manager;        // old manager dropped here, no shutdown()

codex-rs/core/src/connectors.rs:238-320 (compute_accessible_connectors):

let (mcp_connection_manager, cancel_token) = McpConnectionManager::new(...).await;
...
// function returns; manager falls out of scope, no shutdown()

Proposed fix direction

A. Explicit async shutdown. Add pub async fn shutdown(&mut self) on McpConnectionManager that drives each AsyncManagedClient to readiness and calls a new RmcpClient::shutdown() which closes the transport and kills the process group. Invoke shutdown() at both the replacement and the scope-exit sites.

B. Manager-level Drop plus hoisted PGID. Hoist the child's process group ID out of the StdioServerTransport and Arc<RunningService> chain and store it at the AsyncManagedClient level, outside the Arc. Add impl Drop for McpConnectionManager that iterates self.clients and calls terminate_process_group on each stored PGID synchronously (the existing ProcessGroupGuard::drop is already sync via libc kill, so no async is required). This handles panic-driven drops and requires no caller changes.

Option B more closely matches the invariant that the manager owns these processes' lifetime. Per docs/contributing.md external PRs are invitation-only. Happy to prepare one if helpful.

Operational workaround

Restart the daemon to reap leaked children.

Environment

  • Platform: Linux, kernel 6.17.0-20-generic
  • Codex: codex-cli 0.122.0
  • MCP transports: 4 configured stdio servers (listed above)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmcpIssues related to the use of model context protocol (MCP) servers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions