Environment
- OpenClaw:
2026.4.29 (regression from 2026.2.26)
- Node:
v24.15.0
- Platform: Linux x86_64 (Debian 13), systemd user service via
openclaw gateway install
- Provider: OpenRouter (
openrouter/moonshotai/kimi-k2.6 as primary)
- Channels: Telegram, Slack, webchat
Summary
Three distinct bugs were observed after upgrading from 2026.2.26 to 2026.4.29. Together they make the gateway unusable on Linux: startup blocks the event loop for 28 seconds before serving any request, each embedded agent run causes a ~1GB RSS spike that eventually OOMs and crashes with ERR_MODULE_NOT_FOUND, and child processes orphaned by restarts accumulate indefinitely spinning at 100%+ CPU. Downgrading to 2026.2.26 resolved all symptoms once the orphaned processes were manually killed.
Bug 1 — 28-second event loop block at gateway startup (no requests yet)
Immediately after startup, before any user request is processed, the event loop blocks for 28 seconds:
diagnostic.liveness.warning t+0s
eventLoopDelayMaxMs: 28119
eventLoopUtilization: 0.64
cpuCoreRatio: 0.20
active: 0 queued: 0
This correlates exactly with the gateway log showing plugin-runtime-deps being installed synchronously at boot:
[gateway] [plugins] staging bundled runtime deps before gateway startup (40 specs)
[gateway] [plugins] installed bundled runtime deps in 16063ms
[plugins] acpx staging bundled runtime deps (47 specs)
[plugins] acpx installed bundled runtime deps in 21351ms
That is ~37 seconds of npm install running on first boot. On Linux VMs or slower storage, this appears to block the event loop. 2026.2.26 ships all deps inside the npm package and does not have this startup install step.
Recurring pattern: Even after startup, 11–14 second event loop blocks recur roughly every 2–3 minutes throughout normal operation, always when active >= 1. The gateway becomes unresponsive for those windows.
Full liveness warning timeline from the stability log:
t+0s: eventLoopDelayMax=28119ms utilization=0.64 cpuRatio=0.20 active=0 queued=0
t+164s: eventLoopDelayMax=1925ms utilization=0.13 cpuRatio=0.22 active=0 queued=0
t+285s: eventLoopDelayMax=13749ms utilization=0.45 cpuRatio=0.46 active=1 queued=1
t+448s: eventLoopDelayMax=12625ms utilization=0.43 cpuRatio=0.44 active=1 queued=1
t+630s: eventLoopDelayMax=1068ms utilization=0.07 cpuRatio=0.10 active=0 queued=0
t+844s: eventLoopDelayMax=11467ms utilization=0.34 cpuRatio=0.35 active=1 queued=1
t+995s: eventLoopDelayMax=13522ms utilization=1.00 cpuRatio=1.02 active=1 queued=1
t+1119s: eventLoopDelayMax=31256ms utilization=0.93 cpuRatio=1.19 active=1 queued=1 <- crash
Bug 2 — Embedded agent run causes ~926MB RSS spike → memory pressure → ERR_MODULE_NOT_FOUND crash
A single webchat message triggers a near-1GB RSS jump in the main gateway process:
t+1085s RSS: 748 MB heap: 407 MB (idle, no active runs)
t+1119s RSS: 1674 MB heap: 1196 MB <- single webchat message queued at t+1086s
*** diagnostic.memory.pressure: rss_threshold (1.5GB) ***
eventLoopDelayMax: 31256ms utilization: 0.93 cpuRatio: 1.19
Immediately after the memory pressure event, the process crashes with an unhandled rejection:
{
"reason": "unhandled_rejection",
"error": {
"name": "Error",
"code": "ERR_MODULE_NOT_FOUND"
}
}
The ~926MB spike appears to be the embedded agent harness loading module trees into the main process that are not released after dispatch. This did not occur in 2026.2.26, which did not use the runtime-dep installation pattern introduced around 2026.4.24.
The ERR_MODULE_NOT_FOUND is likely a secondary effect: a dynamic import() in the embedded harness fails under memory pressure, becomes an unhandled rejection, and kills the process.
A single request agent startup stage breakdown (from gateway log):
startup stages: totalMs=15630
workspace:0ms runtime-plugins:2ms hooks:0ms
model-resolution:4217ms auth:5621ms
context-engine:0ms attempt-dispatch:5790ms
Model resolution and auth each block for 4–5 seconds per run even when the process is not yet under memory pressure.
Bug 3 — Child processes orphaned on service restart; openclaw gateway install writes KillMode=process
The systemd unit written by openclaw gateway install includes:
This means systemctl restart (or stop) only sends SIGTERM to the main gateway PID. Any subprocesses spawned — including the embedded agent harness subprocess — are left running.
After a restart triggered by Bug 2's crash, the orphaned subprocess continued running at 102% CPU / 12.9GB virtual memory for 9+ minutes with no owner. It survived a version downgrade to 2026.2.26 and caused "still broken after rollback" symptoms that appeared to be a config problem. Only a manual kill -9 resolved it.
Suggested fix: Change the generated unit to KillMode=mixed, which SIGTERMs the main process and SIGKILLs any remaining children in the cgroup after TimeoutStopSec. One-line change in the gateway install template:
# Before
KillMode=process
# After
KillMode=mixed
Reproduction notes
- Bugs 1 and 3 appear reliably reproducible on any Linux systemd install.
- Bug 2 was reproduced on the first webchat message after ~18 minutes of uptime using
openrouter/moonshotai/kimi-k2.6 as primary model. May be general to any embedded agent run.
- The stability log (
~/.openclaw/logs/stability/) captures the full event sequence — happy to share the sanitized JSON if useful.
Workaround (until fixed)
- Pin to
2026.2.26: npm install -g openclaw@2026.2.26
- Kill orphaned children after downgrade:
ps aux | grep openclaw then kill -9 anything not owned by the current service PID
- Manually patch the systemd unit: change
KillMode=process to KillMode=mixed and run systemctl --user daemon-reload
Environment
2026.4.29(regression from2026.2.26)v24.15.0openclaw gateway installopenrouter/moonshotai/kimi-k2.6as primary)Summary
Three distinct bugs were observed after upgrading from
2026.2.26to2026.4.29. Together they make the gateway unusable on Linux: startup blocks the event loop for 28 seconds before serving any request, each embedded agent run causes a ~1GB RSS spike that eventually OOMs and crashes withERR_MODULE_NOT_FOUND, and child processes orphaned by restarts accumulate indefinitely spinning at 100%+ CPU. Downgrading to2026.2.26resolved all symptoms once the orphaned processes were manually killed.Bug 1 — 28-second event loop block at gateway startup (no requests yet)
Immediately after startup, before any user request is processed, the event loop blocks for 28 seconds:
This correlates exactly with the gateway log showing plugin-runtime-deps being installed synchronously at boot:
That is ~37 seconds of npm install running on first boot. On Linux VMs or slower storage, this appears to block the event loop.
2026.2.26ships all deps inside the npm package and does not have this startup install step.Recurring pattern: Even after startup, 11–14 second event loop blocks recur roughly every 2–3 minutes throughout normal operation, always when
active >= 1. The gateway becomes unresponsive for those windows.Full liveness warning timeline from the stability log:
Bug 2 — Embedded agent run causes ~926MB RSS spike → memory pressure →
ERR_MODULE_NOT_FOUNDcrashA single webchat message triggers a near-1GB RSS jump in the main gateway process:
Immediately after the memory pressure event, the process crashes with an unhandled rejection:
{ "reason": "unhandled_rejection", "error": { "name": "Error", "code": "ERR_MODULE_NOT_FOUND" } }The ~926MB spike appears to be the embedded agent harness loading module trees into the main process that are not released after dispatch. This did not occur in
2026.2.26, which did not use the runtime-dep installation pattern introduced around2026.4.24.The
ERR_MODULE_NOT_FOUNDis likely a secondary effect: a dynamicimport()in the embedded harness fails under memory pressure, becomes an unhandled rejection, and kills the process.A single request agent startup stage breakdown (from gateway log):
Model resolution and auth each block for 4–5 seconds per run even when the process is not yet under memory pressure.
Bug 3 — Child processes orphaned on service restart;
openclaw gateway installwritesKillMode=processThe systemd unit written by
openclaw gateway installincludes:KillMode=processThis means
systemctl restart(orstop) only sends SIGTERM to the main gateway PID. Any subprocesses spawned — including the embedded agent harness subprocess — are left running.After a restart triggered by Bug 2's crash, the orphaned subprocess continued running at 102% CPU / 12.9GB virtual memory for 9+ minutes with no owner. It survived a version downgrade to
2026.2.26and caused "still broken after rollback" symptoms that appeared to be a config problem. Only a manualkill -9resolved it.Suggested fix: Change the generated unit to
KillMode=mixed, which SIGTERMs the main process and SIGKILLs any remaining children in the cgroup afterTimeoutStopSec. One-line change in thegateway installtemplate:Reproduction notes
openrouter/moonshotai/kimi-k2.6as primary model. May be general to any embedded agent run.~/.openclaw/logs/stability/) captures the full event sequence — happy to share the sanitized JSON if useful.Workaround (until fixed)
2026.2.26:npm install -g openclaw@2026.2.26ps aux | grep openclawthenkill -9anything not owned by the current service PIDKillMode=processtoKillMode=mixedand runsystemctl --user daemon-reload