Skip to content

openclaw agent / openclaw infer CLI processes don't exit; MCP stdio server children orphaned and accumulating (~66 MB each) #71457

@spartoviMD

Description

@spartoviMD

Summary

On a host where the OpenClaw gateway runs continuously and openclaw agent is invoked many times (in our case, once per inbound email via a third-party Microsoft Graph bridge), MCP stdio server children spawned for the local-bridge MCP integration are never reaped. They accumulate at roughly 66 MB RSS each with no upper bound until the gateway is restarted.

The MCP server itself is a textbook stdio implementation using @modelcontextprotocol/sdk@^1.12.0:

import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
const server = new McpServer({ name: 'local-bridge-mcp', version: '0.1.0' });
// ... server.registerTool(...) ×N ...
const transport = new StdioServerTransport();
await server.connect(transport);

The same server.js works correctly under Claude Code and other MCP hosts — the children exit cleanly when the host closes our stdin or sends a shutdown request, and Node exits naturally because nothing is left holding the event loop. So the MCP server code is not the issue.

Root cause (hypothesis)

The leak is an OpenClaw-side lifecycle bug, not an MCP-protocol or SDK-side bug. The diagnostic that points there:

$ ps -A -o pid,ppid,etime,command | grep -E 'openclaw-(agent|infer|gateway)'
10828 10826   58:54 openclaw-infer
11260 11257   39:09 openclaw-agent
11463     1   28:58 openclaw-gateway

All three of these are CLI invocations (or a long-running daemon) that should have exited in seconds for a single agent turn — but they're staying alive for many minutes after the work completes. Because they don't exit, they don't close the stdio pipes to their MCP server children. The children sit in await server.connect(transport) forever, exactly as the MCP SDK design says they should.

lsof on a leaked MCP child confirms its stdin pipe is still actively connected to the parent's writing end:

$ lsof -p <leaked-mcp-pid>
node    12719 &lt;user&gt;   4     PIPE 0xc6b48af03ed5df52     16384  ->0x799250810c02e91f
node    12719 &lt;user&gt;   5     PIPE 0x799250810c02e91f     16384  ->0xc6b48af03ed5df52

Reproduction

  1. OpenClaw 2026.4.23 (a979721) installed and running as ai.openclaw.gateway LaunchAgent on macOS (Apple Silicon).
  2. An MCP server registered in ~/.openclaw/openclaw.json under mcp.servers.local-bridge using stdio transport. The server's only behavior is to register a few tools that httpJson proxy to a localhost HTTP service.
  3. Repeatedly invoke openclaw agent --message ... --json --timeout 60 (we do this from a third-party bridge, but a shell loop reproduces).
  4. Observe that:
    • Every openclaw agent invocation lingers as a openclaw-agent process for many minutes after returning.
    • Each invocation also leaves behind one node .../mcp-server/server.js child of either the gateway or the agent process.
    • ps | grep mcp-server | wc -l grows monotonically.
    • Total RSS climbs by ~66 MB per turn.

Observed scale

On a host that processes ~17 inbound emails since last gateway restart:

$ ps -A -o pid,rss,etime,command | grep '[m]cp-server/server.js' \
    | awk '{rss+=$2; n++} END {printf "%d processes, total RSS: %.1f MB\n", n, rss/1024}'
17 processes, total RSS: 1130.2 MB

15 of the 17 children are direct children of the gateway daemon; the other 2 are children of the abandoned openclaw-agent/openclaw-infer parents shown above.

Expected behavior

After an agent turn completes:

  1. Any one-shot openclaw agent / openclaw infer CLI invocation exits, returning its result.
  2. Any spawned MCP server children have their stdin closed (or receive a shutdown request), see EOF, and exit naturally.
  3. The parent reaps the child via wait() so it doesn't become a zombie.
  4. Long-running openclaw gateway daemon does the same per-turn — spawn, use, shutdown, reap.

This is what other MCP hosts (Claude Code, Cursor, etc.) do with the same server.js, and it's the contract the MCP SDK assumes.

Workaround we deployed (defensive, not a fix)

We added a hard max-lifetime + parent-watch in our MCP server to cap the leak:

const MAX_LIFETIME_MS = parseInt(process.env.MCP_MAX_LIFETIME_MS || '600000', 10);
setTimeout(() => {
  console.error(`[mcp] max lifetime ${MAX_LIFETIME_MS}ms reached, exiting`);
  process.exit(0);
}, MAX_LIFETIME_MS).unref();

setInterval(() => {
  try { process.kill(process.ppid, 0); }
  catch { console.error('[mcp] parent gone, exiting'); process.exit(0); }
}, 30_000).unref();

.unref() on both timers so they don't keep the event loop alive on their own. This bounds the leak per child but doesn't address the underlying lifecycle issue in OpenClaw — that has to be fixed upstream.

Environment

  • OpenClaw 2026.4.23 (a979721) installed via npm i -g openclaw
  • macOS (Apple Silicon Mac mini)
  • Node.js >=20
  • @modelcontextprotocol/sdk@^1.12.0
  • Gateway running as ai.openclaw.gateway LaunchAgent on port 18789

Why this matters

For an agent that runs continuously (e.g., embedded in any kind of always-on automation — email, calendar, cron-like flows), the leak is unbounded. At ~66 MB per turn and a steady traffic of even tens of turns per day, the host runs out of RAM in a week or two without external intervention. The workaround above is acceptable belt-and-suspenders, but the right fix has to be on the host (OpenClaw) side: either send shutdown per the MCP spec, or close stdio and reap the child after each turn.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions