[Bug] v2026.4.29 — Three compounding regressions on Linux: startup event-loop block, embedded-agent RSS spike → ERR_MODULE_NOT_FOUND crash, orphaned child processes after restart

## Environment
- **OpenClaw:** `2026.4.29` (regression from `2026.2.26`)
- **Node:** `v24.15.0`
- **Platform:** Linux x86_64 (Debian 13), systemd user service via `openclaw gateway install`
- **Provider:** OpenRouter (`openrouter/moonshotai/kimi-k2.6` as primary)
- **Channels:** Telegram, Slack, webchat

## Summary

Three distinct bugs were observed after upgrading from `2026.2.26` to `2026.4.29`. Together they make the gateway unusable on Linux: startup blocks the event loop for 28 seconds before serving any request, each embedded agent run causes a ~1GB RSS spike that eventually OOMs and crashes with `ERR_MODULE_NOT_FOUND`, and child processes orphaned by restarts accumulate indefinitely spinning at 100%+ CPU. Downgrading to `2026.2.26` resolved all symptoms once the orphaned processes were manually killed.

---

## Bug 1 — 28-second event loop block at gateway startup (no requests yet)

Immediately after startup, before any user request is processed, the event loop blocks for **28 seconds**:

```
diagnostic.liveness.warning  t+0s
  eventLoopDelayMaxMs: 28119
  eventLoopUtilization: 0.64
  cpuCoreRatio: 0.20
  active: 0  queued: 0
```

This correlates exactly with the gateway log showing plugin-runtime-deps being installed synchronously at boot:

```
[gateway] [plugins] staging bundled runtime deps before gateway startup (40 specs)
[gateway] [plugins] installed bundled runtime deps in 16063ms
[plugins] acpx staging bundled runtime deps (47 specs)
[plugins] acpx installed bundled runtime deps in 21351ms
```

That is ~37 seconds of npm install running on first boot. On Linux VMs or slower storage, this appears to block the event loop. `2026.2.26` ships all deps inside the npm package and does not have this startup install step.

**Recurring pattern:** Even after startup, 11–14 second event loop blocks recur roughly every 2–3 minutes throughout normal operation, always when `active >= 1`. The gateway becomes unresponsive for those windows.

Full liveness warning timeline from the stability log:

```
t+0s:    eventLoopDelayMax=28119ms  utilization=0.64  cpuRatio=0.20  active=0  queued=0
t+164s:  eventLoopDelayMax=1925ms   utilization=0.13  cpuRatio=0.22  active=0  queued=0
t+285s:  eventLoopDelayMax=13749ms  utilization=0.45  cpuRatio=0.46  active=1  queued=1
t+448s:  eventLoopDelayMax=12625ms  utilization=0.43  cpuRatio=0.44  active=1  queued=1
t+630s:  eventLoopDelayMax=1068ms   utilization=0.07  cpuRatio=0.10  active=0  queued=0
t+844s:  eventLoopDelayMax=11467ms  utilization=0.34  cpuRatio=0.35  active=1  queued=1
t+995s:  eventLoopDelayMax=13522ms  utilization=1.00  cpuRatio=1.02  active=1  queued=1
t+1119s: eventLoopDelayMax=31256ms  utilization=0.93  cpuRatio=1.19  active=1  queued=1  <- crash
```

---

## Bug 2 — Embedded agent run causes ~926MB RSS spike → memory pressure → `ERR_MODULE_NOT_FOUND` crash

A single webchat message triggers a near-1GB RSS jump in the main gateway process:

```
t+1085s  RSS: 748 MB   heap: 407 MB   (idle, no active runs)
t+1119s  RSS: 1674 MB  heap: 1196 MB  <- single webchat message queued at t+1086s
         *** diagnostic.memory.pressure: rss_threshold (1.5GB) ***
         eventLoopDelayMax: 31256ms  utilization: 0.93  cpuRatio: 1.19
```

Immediately after the memory pressure event, the process crashes with an unhandled rejection:

```json
{
  "reason": "unhandled_rejection",
  "error": {
    "name": "Error",
    "code": "ERR_MODULE_NOT_FOUND"
  }
}
```

The ~926MB spike appears to be the embedded agent harness loading module trees into the main process that are not released after dispatch. This did not occur in `2026.2.26`, which did not use the runtime-dep installation pattern introduced around `2026.4.24`.

The `ERR_MODULE_NOT_FOUND` is likely a secondary effect: a dynamic `import()` in the embedded harness fails under memory pressure, becomes an unhandled rejection, and kills the process.

A single request agent startup stage breakdown (from gateway log):

```
startup stages: totalMs=15630
  workspace:0ms  runtime-plugins:2ms  hooks:0ms
  model-resolution:4217ms  auth:5621ms
  context-engine:0ms  attempt-dispatch:5790ms
```

Model resolution and auth each block for 4–5 seconds per run even when the process is not yet under memory pressure.

---

## Bug 3 — Child processes orphaned on service restart; `openclaw gateway install` writes `KillMode=process`

The systemd unit written by `openclaw gateway install` includes:

```ini
KillMode=process
```

This means `systemctl restart` (or `stop`) only sends SIGTERM to the main gateway PID. Any subprocesses spawned — including the embedded agent harness subprocess — are **left running**.

After a restart triggered by Bug 2's crash, the orphaned subprocess continued running at **102% CPU / 12.9GB virtual memory for 9+ minutes** with no owner. It survived a version downgrade to `2026.2.26` and caused "still broken after rollback" symptoms that appeared to be a config problem. Only a manual `kill -9` resolved it.

**Suggested fix:** Change the generated unit to `KillMode=mixed`, which SIGTERMs the main process and SIGKILLs any remaining children in the cgroup after `TimeoutStopSec`. One-line change in the `gateway install` template:

```ini
# Before
KillMode=process

# After
KillMode=mixed
```

---

## Reproduction notes

- Bugs 1 and 3 appear reliably reproducible on any Linux systemd install.
- Bug 2 was reproduced on the first webchat message after ~18 minutes of uptime using `openrouter/moonshotai/kimi-k2.6` as primary model. May be general to any embedded agent run.
- The stability log (`~/.openclaw/logs/stability/`) captures the full event sequence — happy to share the sanitized JSON if useful.

## Workaround (until fixed)

1. Pin to `2026.2.26`: `npm install -g openclaw@2026.2.26`
2. Kill orphaned children after downgrade: `ps aux | grep openclaw` then `kill -9` anything not owned by the current service PID
3. Manually patch the systemd unit: change `KillMode=process` to `KillMode=mixed` and run `systemctl --user daemon-reload`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] v2026.4.29 — Three compounding regressions on Linux: startup event-loop block, embedded-agent RSS spike → ERR_MODULE_NOT_FOUND crash, orphaned child processes after restart #75747

Environment

Summary

Bug 1 — 28-second event loop block at gateway startup (no requests yet)

Bug 2 — Embedded agent run causes ~926MB RSS spike → memory pressure → `ERR_MODULE_NOT_FOUND` crash

Bug 3 — Child processes orphaned on service restart; `openclaw gateway install` writes `KillMode=process`

Reproduction notes

Workaround (until fixed)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] v2026.4.29 — Three compounding regressions on Linux: startup event-loop block, embedded-agent RSS spike → ERR_MODULE_NOT_FOUND crash, orphaned child processes after restart #75747

Description

Environment

Summary

Bug 1 — 28-second event loop block at gateway startup (no requests yet)

Bug 2 — Embedded agent run causes ~926MB RSS spike → memory pressure → ERR_MODULE_NOT_FOUND crash

Bug 3 — Child processes orphaned on service restart; openclaw gateway install writes KillMode=process

Reproduction notes

Workaround (until fixed)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 2 — Embedded agent run causes ~926MB RSS spike → memory pressure → `ERR_MODULE_NOT_FOUND` crash

Bug 3 — Child processes orphaned on service restart; `openclaw gateway install` writes `KillMode=process`