Skip to content

Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak) #75621

@zackchiutw

Description

@zackchiutw

Summary

On 2026.4.29, the OpenClaw gateway sometimes lazy-spawns two stdio MCP child processes for the same configured server, with identical ppid and identical --config argument. Both children stay alive indefinitely, doubling memory cost and CPU during the index-rebuild / handshake phase that runs at child startup. Restarting the gateway is currently the only way to clear the duplicates.

This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about persistent duplicate child processes from a single MCP config entry.

Evidence (steady state, no recent gateway restart)

$ ps -eo pid,ppid,etimes,%mem,rss,cmd --no-headers | grep graphiti.*main\.py
 464302  460593    471  5.2 850940 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
 473502  460593    131  5.2 850908 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
   6260    1273  21470  5.2 849132 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
   6032    1273  21480  5.2 851336 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml

PID 460593 is the OpenClaw gateway. It owns two graphiti-mcp children with the same config — etimes shows they were spawned ~5.5 min apart (471s vs 131s), suggesting the second was spawned during a later agent session even though the first was still healthy.

(For context, PIDs 6032/6260 are an unrelated supervisor exhibiting the same pattern; reporting only the OpenClaw side here.)

After systemctl --user restart openclaw-gateway, both children disappear (clean shutdown). On the next inbound agent session that needs the MCP, exactly one new child is spawned — but eventually a second one appears again.

Symptoms when this happens

Concurrent gateway logs:

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayMaxMs=11316.2 eventLoopUtilization=1 cpuCoreRatio=1.033
[fetch-timeout] fetch timeout reached; aborting operation        (×N, repeated)
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
typing TTL reached (2m); stopping typing indicator
[trace:embedded-run] prep stages: phase=stream-ready totalMs=102822 stages=
  workspace-sandbox:2217ms,
  core-plugin-tools:11841ms,
  bootstrap-context:5364ms,
  bundle-tools:29245ms,        ← duplicate stdio MCP startup running
  system-prompt:26264ms,
  session-resource-loader:9203ms,
  agent-session:7ms,
  stream-setup:18680ms

Two graphiti-mcp processes simultaneously running build_indices_and_constraints() against the same Neo4j (each fires ~30 index queries) saturates the gateway event loop, leading to the cascading fetch-timeout aborts above. End-user symptom on Telegram: messages take 100+ seconds and the typing indicator times out before the reply lands. Killing one of the duplicates with kill <youngest-pid> immediately drops gateway CPU and restores normal latency.

Recovery / mitigation we deployed

Until upstream fixes this, we run a per-5-min cron that:

  1. parses ps -eo pid,ppid,etimes,cmd --no-headers for graphiti.*main\.py,
  2. groups by (ppid, --config),
  3. for any group with count > 1, keeps the oldest etimes and SIGTERMs the rest,
  4. notifies via Telegram if either duplicates are found or [diagnostic] liveness warning exceeds a threshold in the last 10 min.

This works as a band-aid but should not be necessary.

Suggested investigation

The bundle-mcp / lazy-load path appears to start a new child without checking whether an existing healthy child for the same (plugin, config) tuple is already running. Possible causes:

  • Race in agent-session bootstrap (two concurrent inbound messages each spawn before the registry sees the other's child).
  • The earlier child not being registered into the per-gateway "live MCP children" map after the previous gateway restart’s child reaping completed asynchronously.
  • The MCP server registry keying on something that differs across attempts (PID, run-id) instead of (plugin id, config path).

A simple fix worth considering: hold a Map<configKey, Promise<Child>> and have all callers await the same in-flight spawn promise.

Environment

OpenClaw   2026.4.29 (a448042)
Node       v22.22.2
Platform   Linux 6.17.0 (Ubuntu)
MCP server graphiti-mcp (stdio) via openclaw bundle-mcp; LLM=Gemini, DB=Neo4j (3 isolated instances)

Happy to attach a fuller journalctl span or strace/perf capture if useful — just say the word.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions