Summary
On 2026.4.29, the OpenClaw gateway sometimes lazy-spawns two stdio MCP child processes for the same configured server, with identical ppid and identical --config argument. Both children stay alive indefinitely, doubling memory cost and CPU during the index-rebuild / handshake phase that runs at child startup. Restarting the gateway is currently the only way to clear the duplicates.
This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about persistent duplicate child processes from a single MCP config entry.
Evidence (steady state, no recent gateway restart)
$ ps -eo pid,ppid,etimes,%mem,rss,cmd --no-headers | grep graphiti.*main\.py
464302 460593 471 5.2 850940 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
473502 460593 131 5.2 850908 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
6260 1273 21470 5.2 849132 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
6032 1273 21480 5.2 851336 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
PID 460593 is the OpenClaw gateway. It owns two graphiti-mcp children with the same config — etimes shows they were spawned ~5.5 min apart (471s vs 131s), suggesting the second was spawned during a later agent session even though the first was still healthy.
(For context, PIDs 6032/6260 are an unrelated supervisor exhibiting the same pattern; reporting only the OpenClaw side here.)
After systemctl --user restart openclaw-gateway, both children disappear (clean shutdown). On the next inbound agent session that needs the MCP, exactly one new child is spawned — but eventually a second one appears again.
Symptoms when this happens
Concurrent gateway logs:
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
eventLoopDelayMaxMs=11316.2 eventLoopUtilization=1 cpuCoreRatio=1.033
[fetch-timeout] fetch timeout reached; aborting operation (×N, repeated)
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
typing TTL reached (2m); stopping typing indicator
[trace:embedded-run] prep stages: phase=stream-ready totalMs=102822 stages=
workspace-sandbox:2217ms,
core-plugin-tools:11841ms,
bootstrap-context:5364ms,
bundle-tools:29245ms, ← duplicate stdio MCP startup running
system-prompt:26264ms,
session-resource-loader:9203ms,
agent-session:7ms,
stream-setup:18680ms
Two graphiti-mcp processes simultaneously running build_indices_and_constraints() against the same Neo4j (each fires ~30 index queries) saturates the gateway event loop, leading to the cascading fetch-timeout aborts above. End-user symptom on Telegram: messages take 100+ seconds and the typing indicator times out before the reply lands. Killing one of the duplicates with kill <youngest-pid> immediately drops gateway CPU and restores normal latency.
Recovery / mitigation we deployed
Until upstream fixes this, we run a per-5-min cron that:
- parses
ps -eo pid,ppid,etimes,cmd --no-headers for graphiti.*main\.py,
- groups by
(ppid, --config),
- for any group with
count > 1, keeps the oldest etimes and SIGTERMs the rest,
- notifies via Telegram if either duplicates are found or
[diagnostic] liveness warning exceeds a threshold in the last 10 min.
This works as a band-aid but should not be necessary.
Suggested investigation
The bundle-mcp / lazy-load path appears to start a new child without checking whether an existing healthy child for the same (plugin, config) tuple is already running. Possible causes:
- Race in agent-session bootstrap (two concurrent inbound messages each spawn before the registry sees the other's child).
- The earlier child not being registered into the per-gateway "live MCP children" map after the previous gateway restart’s child reaping completed asynchronously.
- The MCP server registry keying on something that differs across attempts (PID, run-id) instead of
(plugin id, config path).
A simple fix worth considering: hold a Map<configKey, Promise<Child>> and have all callers await the same in-flight spawn promise.
Environment
OpenClaw 2026.4.29 (a448042)
Node v22.22.2
Platform Linux 6.17.0 (Ubuntu)
MCP server graphiti-mcp (stdio) via openclaw bundle-mcp; LLM=Gemini, DB=Neo4j (3 isolated instances)
Happy to attach a fuller journalctl span or strace/perf capture if useful — just say the word.
Summary
On 2026.4.29, the OpenClaw gateway sometimes lazy-spawns two stdio MCP child processes for the same configured server, with identical
ppidand identical--configargument. Both children stay alive indefinitely, doubling memory cost and CPU during the index-rebuild / handshake phase that runs at child startup. Restarting the gateway is currently the only way to clear the duplicates.This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about persistent duplicate child processes from a single MCP config entry.
Evidence (steady state, no recent gateway restart)
PID 460593 is the OpenClaw gateway. It owns two graphiti-mcp children with the same config —
etimesshows they were spawned ~5.5 min apart (471s vs 131s), suggesting the second was spawned during a later agent session even though the first was still healthy.(For context, PIDs 6032/6260 are an unrelated supervisor exhibiting the same pattern; reporting only the OpenClaw side here.)
After
systemctl --user restart openclaw-gateway, both children disappear (clean shutdown). On the next inbound agent session that needs the MCP, exactly one new child is spawned — but eventually a second one appears again.Symptoms when this happens
Concurrent gateway logs:
Two graphiti-mcp processes simultaneously running
build_indices_and_constraints()against the same Neo4j (each fires ~30 index queries) saturates the gateway event loop, leading to the cascadingfetch-timeoutaborts above. End-user symptom on Telegram: messages take 100+ seconds and the typing indicator times out before the reply lands. Killing one of the duplicates withkill <youngest-pid>immediately drops gateway CPU and restores normal latency.Recovery / mitigation we deployed
Until upstream fixes this, we run a per-5-min cron that:
ps -eo pid,ppid,etimes,cmd --no-headersforgraphiti.*main\.py,(ppid, --config),count > 1, keeps the oldestetimesand SIGTERMs the rest,[diagnostic] liveness warningexceeds a threshold in the last 10 min.This works as a band-aid but should not be necessary.
Suggested investigation
The bundle-mcp / lazy-load path appears to start a new child without checking whether an existing healthy child for the same
(plugin, config)tuple is already running. Possible causes:(plugin id, config path).A simple fix worth considering: hold a
Map<configKey, Promise<Child>>and have all callersawaitthe same in-flight spawn promise.Environment
Happy to attach a fuller journalctl span or
strace/perfcapture if useful — just say the word.