MCP child processes leak when McpConnectionManager is replaced

## What issue are you seeing?

A long-running `codex --dangerously-bypass-approvals-and-sandbox` daemon leaks stdio MCP child processes over time. A single daemon with roughly 15 hours of uptime accumulated 492 orphaned MCP children, 123 of each of 4 configured stdio MCP servers, all direct children (PPID) of the daemon process. Each leaked child was busy-looping at roughly 35% CPU, and cumulative RSS of the leaked set was about 82 GB. The spawn/leak rate was approximately one full cycle per 7 minutes.

Configured stdio MCP servers in the affected daemon:

- `mcp-codebase-index`
- `codebase-memory-mcp`
- `github-mcp-server --toolsets actions,issues,pull_requests,repos stdio`
- `ast-grep-server`

Confirmed via `ps`:

- 492 processes with PPID equal to the daemon PID
- exactly 123 instances per configured server (123 × 4 = 492)
- killing the daemon PID reaped all 492 children immediately, ruling out a non-codex parent

## What steps can reproduce the bug?

I do not have a minimal isolated repro yet. Observed on a long-lived daemon with four configured stdio MCP servers. Based on code inspection the leak should reproduce by driving either of these paths repeatedly:

1. Session MCP refresh, via `refresh_mcp_servers_inner` in `codex-rs/core/src/session/mcp.rs:177`.
2. Accessible-connectors refresh, via `compute_accessible_connectors` in `codex-rs/core/src/connectors.rs` (reached through `list_accessible_connectors_from_mcp_tools*`).

Both paths construct a fresh `McpConnectionManager` (which spawns all configured MCP server processes), perform tool listing, then release the manager. Field leak rate (about one cycle per 7 minutes across 15 hours) matches "one refresh cycle per interval" rather than "one per daemon start".

A targeted integration test would configure a stub stdio MCP server that records its own PID at startup, drive the refresh path N times, and assert that previous PIDs are no longer alive by the time the N-th refresh completes.

## What is the expected behavior?

When an `McpConnectionManager` is replaced or dropped, every MCP child process it spawned should be terminated (SIGTERM, escalating to SIGKILL after a bounded grace period) during that drop. Teardown should not depend on `Arc<RunningService>` refcount reaching zero, because cloned `Arc`s legitimately escape the manager's ownership boundary during in-flight operations.

## Additional information

### Root cause

`McpConnectionManager` (`codex-rs/codex-mcp/src/mcp_connection_manager.rs:657`) has no `Drop` impl. Child-process cleanup is entirely downstream of `Arc<RunningService>` refcount reaching zero.

`StdioServerTransport` owns both a `TokioChildProcess` with `kill_on_drop(true)` and a `ProcessGroupGuard` that SIGTERMs the child's process group on drop:

- `codex-rs/rmcp-client/src/stdio_server_launcher.rs:211`, `command.kill_on_drop(true)`
- `codex-rs/rmcp-client/src/stdio_server_launcher.rs:218`, `command.process_group(0)`
- `codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296`, `impl Drop for ProcessGroupGuard` calling `terminate_process_group`

But `StdioServerTransport` is held inside `RunningService`, which is held inside `Arc<RunningService<...>>` inside `ClientState::Ready`, which is held inside `RmcpClient`, which is wrapped by `AsyncManagedClient` as a cloneable shared future:

- `codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483`, `AsyncManagedClient { client: Shared<BoxFuture<..., Result<ManagedClient, _>>>, ... }`
- `codex-rs/rmcp-client/src/rmcp_client.rs:494`, `pub struct RmcpClient`
- `codex-rs/rmcp-client/src/rmcp_client.rs:1014-1049`, `run_service_operation` clones the `Arc<RunningService>` into every in-flight operation

Every in-flight `run_service_operation_once` clones the `Arc`. Dropping the manager drops only its reference; any concurrent or detached task holding an `Arc` clone keeps the transport, and therefore the child PID, alive indefinitely. The `Shared<BoxFuture<...>>` at `AsyncManagedClient.client` additionally retains the completed `ManagedClient` for replay; any other clone of that `Shared` extends client lifetime past manager drop.

### Why the observed symptoms match this mechanism

- **PPID equals daemon**: `LocalStdioServerLauncher::launch_server` spawns via `tokio::process::Command` directly in the daemon (`stdio_server_launcher.rs:209-225`). No subagent intermediary.
- **123 of each of 4 servers**: each `McpConnectionManager::new(...)` call spawns one process per configured server, so multiples of 4 are expected.
- **Roughly 35% CPU per leaked child**: consistent with the child's stdio reader loop spinning after the parent dropped its pipe end without delivering EOF or SIGTERM.
- **Scales with uptime, not startup**: the daemon boots once; refresh paths fire repeatedly.

### Replacement sites with no teardown

`codex-rs/core/src/session/mcp.rs:200-229`:

```rust
let (refreshed_manager, cancel_token) = McpConnectionManager::new(...).await;
...
let mut manager = self.services.mcp_connection_manager.write().await;
*manager = refreshed_manager;        // old manager dropped here, no shutdown()
```

`codex-rs/core/src/connectors.rs:238-320` (`compute_accessible_connectors`):

```rust
let (mcp_connection_manager, cancel_token) = McpConnectionManager::new(...).await;
...
// function returns; manager falls out of scope, no shutdown()
```

### Proposed fix direction

**A. Explicit async shutdown.** Add `pub async fn shutdown(&mut self)` on `McpConnectionManager` that drives each `AsyncManagedClient` to readiness and calls a new `RmcpClient::shutdown()` which closes the transport and kills the process group. Invoke `shutdown()` at both the replacement and the scope-exit sites.

**B. Manager-level `Drop` plus hoisted PGID.** Hoist the child's process group ID out of the `StdioServerTransport` and `Arc<RunningService>` chain and store it at the `AsyncManagedClient` level, outside the `Arc`. Add `impl Drop for McpConnectionManager` that iterates `self.clients` and calls `terminate_process_group` on each stored PGID synchronously (the existing `ProcessGroupGuard::drop` is already sync via `libc` kill, so no async is required). This handles panic-driven drops and requires no caller changes.

Option B more closely matches the invariant that the manager owns these processes' lifetime. Per `docs/contributing.md` external PRs are invitation-only. Happy to prepare one if helpful.

### Operational workaround

Restart the daemon to reap leaked children.

### Environment

- Platform: Linux, kernel `6.17.0-20-generic`
- Codex: `codex-cli 0.122.0`
- MCP transports: 4 configured stdio servers (listed above)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP child processes leak when McpConnectionManager is replaced #18881

What issue are you seeing?

What steps can reproduce the bug?

What is the expected behavior?

Additional information

Root cause

Why the observed symptoms match this mechanism

Replacement sites with no teardown

Proposed fix direction

Operational workaround

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MCP child processes leak when McpConnectionManager is replaced #18881

Description

What issue are you seeing?

What steps can reproduce the bug?

What is the expected behavior?

Additional information

Root cause

Why the observed symptoms match this mechanism

Replacement sites with no teardown

Proposed fix direction

Operational workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions