Skip to content

[Bug] v2026.4.29 — Three compounding regressions on Linux: startup event-loop block, embedded-agent RSS spike → ERR_MODULE_NOT_FOUND crash, orphaned child processes after restart #75747

@mdxwired

Description

@mdxwired

Environment

  • OpenClaw: 2026.4.29 (regression from 2026.2.26)
  • Node: v24.15.0
  • Platform: Linux x86_64 (Debian 13), systemd user service via openclaw gateway install
  • Provider: OpenRouter (openrouter/moonshotai/kimi-k2.6 as primary)
  • Channels: Telegram, Slack, webchat

Summary

Three distinct bugs were observed after upgrading from 2026.2.26 to 2026.4.29. Together they make the gateway unusable on Linux: startup blocks the event loop for 28 seconds before serving any request, each embedded agent run causes a ~1GB RSS spike that eventually OOMs and crashes with ERR_MODULE_NOT_FOUND, and child processes orphaned by restarts accumulate indefinitely spinning at 100%+ CPU. Downgrading to 2026.2.26 resolved all symptoms once the orphaned processes were manually killed.


Bug 1 — 28-second event loop block at gateway startup (no requests yet)

Immediately after startup, before any user request is processed, the event loop blocks for 28 seconds:

diagnostic.liveness.warning  t+0s
  eventLoopDelayMaxMs: 28119
  eventLoopUtilization: 0.64
  cpuCoreRatio: 0.20
  active: 0  queued: 0

This correlates exactly with the gateway log showing plugin-runtime-deps being installed synchronously at boot:

[gateway] [plugins] staging bundled runtime deps before gateway startup (40 specs)
[gateway] [plugins] installed bundled runtime deps in 16063ms
[plugins] acpx staging bundled runtime deps (47 specs)
[plugins] acpx installed bundled runtime deps in 21351ms

That is ~37 seconds of npm install running on first boot. On Linux VMs or slower storage, this appears to block the event loop. 2026.2.26 ships all deps inside the npm package and does not have this startup install step.

Recurring pattern: Even after startup, 11–14 second event loop blocks recur roughly every 2–3 minutes throughout normal operation, always when active >= 1. The gateway becomes unresponsive for those windows.

Full liveness warning timeline from the stability log:

t+0s:    eventLoopDelayMax=28119ms  utilization=0.64  cpuRatio=0.20  active=0  queued=0
t+164s:  eventLoopDelayMax=1925ms   utilization=0.13  cpuRatio=0.22  active=0  queued=0
t+285s:  eventLoopDelayMax=13749ms  utilization=0.45  cpuRatio=0.46  active=1  queued=1
t+448s:  eventLoopDelayMax=12625ms  utilization=0.43  cpuRatio=0.44  active=1  queued=1
t+630s:  eventLoopDelayMax=1068ms   utilization=0.07  cpuRatio=0.10  active=0  queued=0
t+844s:  eventLoopDelayMax=11467ms  utilization=0.34  cpuRatio=0.35  active=1  queued=1
t+995s:  eventLoopDelayMax=13522ms  utilization=1.00  cpuRatio=1.02  active=1  queued=1
t+1119s: eventLoopDelayMax=31256ms  utilization=0.93  cpuRatio=1.19  active=1  queued=1  <- crash

Bug 2 — Embedded agent run causes ~926MB RSS spike → memory pressure → ERR_MODULE_NOT_FOUND crash

A single webchat message triggers a near-1GB RSS jump in the main gateway process:

t+1085s  RSS: 748 MB   heap: 407 MB   (idle, no active runs)
t+1119s  RSS: 1674 MB  heap: 1196 MB  <- single webchat message queued at t+1086s
         *** diagnostic.memory.pressure: rss_threshold (1.5GB) ***
         eventLoopDelayMax: 31256ms  utilization: 0.93  cpuRatio: 1.19

Immediately after the memory pressure event, the process crashes with an unhandled rejection:

{
  "reason": "unhandled_rejection",
  "error": {
    "name": "Error",
    "code": "ERR_MODULE_NOT_FOUND"
  }
}

The ~926MB spike appears to be the embedded agent harness loading module trees into the main process that are not released after dispatch. This did not occur in 2026.2.26, which did not use the runtime-dep installation pattern introduced around 2026.4.24.

The ERR_MODULE_NOT_FOUND is likely a secondary effect: a dynamic import() in the embedded harness fails under memory pressure, becomes an unhandled rejection, and kills the process.

A single request agent startup stage breakdown (from gateway log):

startup stages: totalMs=15630
  workspace:0ms  runtime-plugins:2ms  hooks:0ms
  model-resolution:4217ms  auth:5621ms
  context-engine:0ms  attempt-dispatch:5790ms

Model resolution and auth each block for 4–5 seconds per run even when the process is not yet under memory pressure.


Bug 3 — Child processes orphaned on service restart; openclaw gateway install writes KillMode=process

The systemd unit written by openclaw gateway install includes:

KillMode=process

This means systemctl restart (or stop) only sends SIGTERM to the main gateway PID. Any subprocesses spawned — including the embedded agent harness subprocess — are left running.

After a restart triggered by Bug 2's crash, the orphaned subprocess continued running at 102% CPU / 12.9GB virtual memory for 9+ minutes with no owner. It survived a version downgrade to 2026.2.26 and caused "still broken after rollback" symptoms that appeared to be a config problem. Only a manual kill -9 resolved it.

Suggested fix: Change the generated unit to KillMode=mixed, which SIGTERMs the main process and SIGKILLs any remaining children in the cgroup after TimeoutStopSec. One-line change in the gateway install template:

# Before
KillMode=process

# After
KillMode=mixed

Reproduction notes

  • Bugs 1 and 3 appear reliably reproducible on any Linux systemd install.
  • Bug 2 was reproduced on the first webchat message after ~18 minutes of uptime using openrouter/moonshotai/kimi-k2.6 as primary model. May be general to any embedded agent run.
  • The stability log (~/.openclaw/logs/stability/) captures the full event sequence — happy to share the sanitized JSON if useful.

Workaround (until fixed)

  1. Pin to 2026.2.26: npm install -g openclaw@2026.2.26
  2. Kill orphaned children after downgrade: ps aux | grep openclaw then kill -9 anything not owned by the current service PID
  3. Manually patch the systemd unit: change KillMode=process to KillMode=mixed and run systemctl --user daemon-reload

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions