What issue are you seeing?
A long-running codex --dangerously-bypass-approvals-and-sandbox daemon leaks stdio MCP child processes over time. A single daemon with roughly 15 hours of uptime accumulated 492 orphaned MCP children, 123 of each of 4 configured stdio MCP servers, all direct children (PPID) of the daemon process. Each leaked child was busy-looping at roughly 35% CPU, and cumulative RSS of the leaked set was about 82 GB. The spawn/leak rate was approximately one full cycle per 7 minutes.
Configured stdio MCP servers in the affected daemon:
mcp-codebase-index
codebase-memory-mcp
github-mcp-server --toolsets actions,issues,pull_requests,repos stdio
ast-grep-server
Confirmed via ps:
- 492 processes with PPID equal to the daemon PID
- exactly 123 instances per configured server (123 × 4 = 492)
- killing the daemon PID reaped all 492 children immediately, ruling out a non-codex parent
What steps can reproduce the bug?
I do not have a minimal isolated repro yet. Observed on a long-lived daemon with four configured stdio MCP servers. Based on code inspection the leak should reproduce by driving either of these paths repeatedly:
- Session MCP refresh, via
refresh_mcp_servers_inner in codex-rs/core/src/session/mcp.rs:177.
- Accessible-connectors refresh, via
compute_accessible_connectors in codex-rs/core/src/connectors.rs (reached through list_accessible_connectors_from_mcp_tools*).
Both paths construct a fresh McpConnectionManager (which spawns all configured MCP server processes), perform tool listing, then release the manager. Field leak rate (about one cycle per 7 minutes across 15 hours) matches "one refresh cycle per interval" rather than "one per daemon start".
A targeted integration test would configure a stub stdio MCP server that records its own PID at startup, drive the refresh path N times, and assert that previous PIDs are no longer alive by the time the N-th refresh completes.
What is the expected behavior?
When an McpConnectionManager is replaced or dropped, every MCP child process it spawned should be terminated (SIGTERM, escalating to SIGKILL after a bounded grace period) during that drop. Teardown should not depend on Arc<RunningService> refcount reaching zero, because cloned Arcs legitimately escape the manager's ownership boundary during in-flight operations.
Additional information
Root cause
McpConnectionManager (codex-rs/codex-mcp/src/mcp_connection_manager.rs:657) has no Drop impl. Child-process cleanup is entirely downstream of Arc<RunningService> refcount reaching zero.
StdioServerTransport owns both a TokioChildProcess with kill_on_drop(true) and a ProcessGroupGuard that SIGTERMs the child's process group on drop:
codex-rs/rmcp-client/src/stdio_server_launcher.rs:211, command.kill_on_drop(true)
codex-rs/rmcp-client/src/stdio_server_launcher.rs:218, command.process_group(0)
codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296, impl Drop for ProcessGroupGuard calling terminate_process_group
But StdioServerTransport is held inside RunningService, which is held inside Arc<RunningService<...>> inside ClientState::Ready, which is held inside RmcpClient, which is wrapped by AsyncManagedClient as a cloneable shared future:
codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483, AsyncManagedClient { client: Shared<BoxFuture<..., Result<ManagedClient, _>>>, ... }
codex-rs/rmcp-client/src/rmcp_client.rs:494, pub struct RmcpClient
codex-rs/rmcp-client/src/rmcp_client.rs:1014-1049, run_service_operation clones the Arc<RunningService> into every in-flight operation
Every in-flight run_service_operation_once clones the Arc. Dropping the manager drops only its reference; any concurrent or detached task holding an Arc clone keeps the transport, and therefore the child PID, alive indefinitely. The Shared<BoxFuture<...>> at AsyncManagedClient.client additionally retains the completed ManagedClient for replay; any other clone of that Shared extends client lifetime past manager drop.
Why the observed symptoms match this mechanism
- PPID equals daemon:
LocalStdioServerLauncher::launch_server spawns via tokio::process::Command directly in the daemon (stdio_server_launcher.rs:209-225). No subagent intermediary.
- 123 of each of 4 servers: each
McpConnectionManager::new(...) call spawns one process per configured server, so multiples of 4 are expected.
- Roughly 35% CPU per leaked child: consistent with the child's stdio reader loop spinning after the parent dropped its pipe end without delivering EOF or SIGTERM.
- Scales with uptime, not startup: the daemon boots once; refresh paths fire repeatedly.
Replacement sites with no teardown
codex-rs/core/src/session/mcp.rs:200-229:
let (refreshed_manager, cancel_token) = McpConnectionManager::new(...).await;
...
let mut manager = self.services.mcp_connection_manager.write().await;
*manager = refreshed_manager; // old manager dropped here, no shutdown()
codex-rs/core/src/connectors.rs:238-320 (compute_accessible_connectors):
let (mcp_connection_manager, cancel_token) = McpConnectionManager::new(...).await;
...
// function returns; manager falls out of scope, no shutdown()
Proposed fix direction
A. Explicit async shutdown. Add pub async fn shutdown(&mut self) on McpConnectionManager that drives each AsyncManagedClient to readiness and calls a new RmcpClient::shutdown() which closes the transport and kills the process group. Invoke shutdown() at both the replacement and the scope-exit sites.
B. Manager-level Drop plus hoisted PGID. Hoist the child's process group ID out of the StdioServerTransport and Arc<RunningService> chain and store it at the AsyncManagedClient level, outside the Arc. Add impl Drop for McpConnectionManager that iterates self.clients and calls terminate_process_group on each stored PGID synchronously (the existing ProcessGroupGuard::drop is already sync via libc kill, so no async is required). This handles panic-driven drops and requires no caller changes.
Option B more closely matches the invariant that the manager owns these processes' lifetime. Per docs/contributing.md external PRs are invitation-only. Happy to prepare one if helpful.
Operational workaround
Restart the daemon to reap leaked children.
Environment
- Platform: Linux, kernel
6.17.0-20-generic
- Codex:
codex-cli 0.122.0
- MCP transports: 4 configured stdio servers (listed above)
What issue are you seeing?
A long-running
codex --dangerously-bypass-approvals-and-sandboxdaemon leaks stdio MCP child processes over time. A single daemon with roughly 15 hours of uptime accumulated 492 orphaned MCP children, 123 of each of 4 configured stdio MCP servers, all direct children (PPID) of the daemon process. Each leaked child was busy-looping at roughly 35% CPU, and cumulative RSS of the leaked set was about 82 GB. The spawn/leak rate was approximately one full cycle per 7 minutes.Configured stdio MCP servers in the affected daemon:
mcp-codebase-indexcodebase-memory-mcpgithub-mcp-server --toolsets actions,issues,pull_requests,repos stdioast-grep-serverConfirmed via
ps:What steps can reproduce the bug?
I do not have a minimal isolated repro yet. Observed on a long-lived daemon with four configured stdio MCP servers. Based on code inspection the leak should reproduce by driving either of these paths repeatedly:
refresh_mcp_servers_innerincodex-rs/core/src/session/mcp.rs:177.compute_accessible_connectorsincodex-rs/core/src/connectors.rs(reached throughlist_accessible_connectors_from_mcp_tools*).Both paths construct a fresh
McpConnectionManager(which spawns all configured MCP server processes), perform tool listing, then release the manager. Field leak rate (about one cycle per 7 minutes across 15 hours) matches "one refresh cycle per interval" rather than "one per daemon start".A targeted integration test would configure a stub stdio MCP server that records its own PID at startup, drive the refresh path N times, and assert that previous PIDs are no longer alive by the time the N-th refresh completes.
What is the expected behavior?
When an
McpConnectionManageris replaced or dropped, every MCP child process it spawned should be terminated (SIGTERM, escalating to SIGKILL after a bounded grace period) during that drop. Teardown should not depend onArc<RunningService>refcount reaching zero, because clonedArcs legitimately escape the manager's ownership boundary during in-flight operations.Additional information
Root cause
McpConnectionManager(codex-rs/codex-mcp/src/mcp_connection_manager.rs:657) has noDropimpl. Child-process cleanup is entirely downstream ofArc<RunningService>refcount reaching zero.StdioServerTransportowns both aTokioChildProcesswithkill_on_drop(true)and aProcessGroupGuardthat SIGTERMs the child's process group on drop:codex-rs/rmcp-client/src/stdio_server_launcher.rs:211,command.kill_on_drop(true)codex-rs/rmcp-client/src/stdio_server_launcher.rs:218,command.process_group(0)codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296,impl Drop for ProcessGroupGuardcallingterminate_process_groupBut
StdioServerTransportis held insideRunningService, which is held insideArc<RunningService<...>>insideClientState::Ready, which is held insideRmcpClient, which is wrapped byAsyncManagedClientas a cloneable shared future:codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483,AsyncManagedClient { client: Shared<BoxFuture<..., Result<ManagedClient, _>>>, ... }codex-rs/rmcp-client/src/rmcp_client.rs:494,pub struct RmcpClientcodex-rs/rmcp-client/src/rmcp_client.rs:1014-1049,run_service_operationclones theArc<RunningService>into every in-flight operationEvery in-flight
run_service_operation_onceclones theArc. Dropping the manager drops only its reference; any concurrent or detached task holding anArcclone keeps the transport, and therefore the child PID, alive indefinitely. TheShared<BoxFuture<...>>atAsyncManagedClient.clientadditionally retains the completedManagedClientfor replay; any other clone of thatSharedextends client lifetime past manager drop.Why the observed symptoms match this mechanism
LocalStdioServerLauncher::launch_serverspawns viatokio::process::Commanddirectly in the daemon (stdio_server_launcher.rs:209-225). No subagent intermediary.McpConnectionManager::new(...)call spawns one process per configured server, so multiples of 4 are expected.Replacement sites with no teardown
codex-rs/core/src/session/mcp.rs:200-229:codex-rs/core/src/connectors.rs:238-320(compute_accessible_connectors):Proposed fix direction
A. Explicit async shutdown. Add
pub async fn shutdown(&mut self)onMcpConnectionManagerthat drives eachAsyncManagedClientto readiness and calls a newRmcpClient::shutdown()which closes the transport and kills the process group. Invokeshutdown()at both the replacement and the scope-exit sites.B. Manager-level
Dropplus hoisted PGID. Hoist the child's process group ID out of theStdioServerTransportandArc<RunningService>chain and store it at theAsyncManagedClientlevel, outside theArc. Addimpl Drop for McpConnectionManagerthat iteratesself.clientsand callsterminate_process_groupon each stored PGID synchronously (the existingProcessGroupGuard::dropis already sync vialibckill, so no async is required). This handles panic-driven drops and requires no caller changes.Option B more closely matches the invariant that the manager owns these processes' lifetime. Per
docs/contributing.mdexternal PRs are invitation-only. Happy to prepare one if helpful.Operational workaround
Restart the daemon to reap leaked children.
Environment
6.17.0-20-genericcodex-cli 0.122.0