Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak)

## Summary

On 2026.4.29, the OpenClaw gateway sometimes lazy-spawns **two** stdio MCP child processes for the same configured server, with identical `ppid` and identical `--config` argument. Both children stay alive indefinitely, doubling memory cost and CPU during the index-rebuild / handshake phase that runs at child startup. Restarting the gateway is currently the only way to clear the duplicates.

This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about **persistent duplicate child processes from a single MCP config entry**.

## Evidence (steady state, no recent gateway restart)

```
$ ps -eo pid,ppid,etimes,%mem,rss,cmd --no-headers | grep graphiti.*main\.py
 464302  460593    471  5.2 850940 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
 473502  460593    131  5.2 850908 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
   6260    1273  21470  5.2 849132 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
   6032    1273  21480  5.2 851336 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
```

PID 460593 is the OpenClaw gateway. It owns **two** graphiti-mcp children with the *same* config — `etimes` shows they were spawned ~5.5 min apart (471s vs 131s), suggesting the second was spawned during a later agent session even though the first was still healthy.

(For context, PIDs 6032/6260 are an unrelated supervisor exhibiting the same pattern; reporting only the OpenClaw side here.)

After `systemctl --user restart openclaw-gateway`, both children disappear (clean shutdown). On the next inbound agent session that needs the MCP, exactly **one** new child is spawned — but eventually a second one appears again.

## Symptoms when this happens

Concurrent gateway logs:

```
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayMaxMs=11316.2 eventLoopUtilization=1 cpuCoreRatio=1.033
[fetch-timeout] fetch timeout reached; aborting operation        (×N, repeated)
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
typing TTL reached (2m); stopping typing indicator
[trace:embedded-run] prep stages: phase=stream-ready totalMs=102822 stages=
  workspace-sandbox:2217ms,
  core-plugin-tools:11841ms,
  bootstrap-context:5364ms,
  bundle-tools:29245ms,        ← duplicate stdio MCP startup running
  system-prompt:26264ms,
  session-resource-loader:9203ms,
  agent-session:7ms,
  stream-setup:18680ms
```

Two graphiti-mcp processes simultaneously running `build_indices_and_constraints()` against the same Neo4j (each fires ~30 index queries) saturates the gateway event loop, leading to the cascading `fetch-timeout` aborts above. End-user symptom on Telegram: messages take 100+ seconds and the typing indicator times out before the reply lands. Killing one of the duplicates with `kill <youngest-pid>` immediately drops gateway CPU and restores normal latency.

## Recovery / mitigation we deployed

Until upstream fixes this, we run a per-5-min cron that:
1. parses `ps -eo pid,ppid,etimes,cmd --no-headers` for `graphiti.*main\.py`,
2. groups by `(ppid, --config)`,
3. for any group with `count > 1`, keeps the oldest `etimes` and SIGTERMs the rest,
4. notifies via Telegram if either duplicates are found or `[diagnostic] liveness warning` exceeds a threshold in the last 10 min.

This works as a band-aid but should not be necessary.

## Suggested investigation

The bundle-mcp / lazy-load path appears to start a new child without checking whether an existing healthy child for the same `(plugin, config)` tuple is already running. Possible causes:

- Race in agent-session bootstrap (two concurrent inbound messages each spawn before the registry sees the other's child).
- The earlier child not being registered into the per-gateway "live MCP children" map after the previous gateway restart’s child reaping completed asynchronously.
- The MCP server registry keying on something that differs across attempts (PID, run-id) instead of `(plugin id, config path)`.

A simple fix worth considering: hold a `Map<configKey, Promise<Child>>` and have all callers `await` the same in-flight spawn promise.

## Environment

```
OpenClaw   2026.4.29 (a448042)
Node       v22.22.2
Platform   Linux 6.17.0 (Ubuntu)
MCP server graphiti-mcp (stdio) via openclaw bundle-mcp; LLM=Gemini, DB=Neo4j (3 isolated instances)
```

Happy to attach a fuller journalctl span or `strace`/`perf` capture if useful — just say the word.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak) #75621

Summary

Evidence (steady state, no recent gateway restart)

Symptoms when this happens

Recovery / mitigation we deployed

Suggested investigation

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak) #75621

Description

Summary

Evidence (steady state, no recent gateway restart)

Symptoms when this happens

Recovery / mitigation we deployed

Suggested investigation

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions