You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Every release of OpenClaw since 2026.4.24 produces the same boot-time WebSocket handler starvation on a Raspberry Pi 5 / ARM64 native systemd install. The gateway's [gateway] ready line fires, plugin/channel providers register, but the gateway pegs ~100% on a single CPU core and the WS handler on 127.0.0.1:18789 either never responds or responds intermittently with degraded latency. Internal subsystems (memory-core ↔ cron-service) cannot reach each other while the event loop is starved. The 2026.4.23 rollback is clean every time — within ~30s the gateway reports Connect: ok ~30 ms with idle CPU.
This was previously reported via comments on #73655 and #74323 (closed as duplicate of #73655). Filing this as a dedicated tracking issue because the symptom now reproduces unchanged across three consecutive upgrade attempts on this host (4.26, 4.27 (skipped after release-note read), 4.29) — including 4.29 which carries explicit Refs #73655 and Refs #72338 fixes intended to address it.
Environment
OpenClaw: tested 2026.4.26 (be8c246), 2026.4.29 (a448042). Last known-good: 2026.4.23 (a979721).
Hardware: Raspberry Pi 5 (8 GB), ARM64 (aarch64)
OS: Linux 6.12.62+rpt-rpi-2712 (64-bit, Debian-derived)
Node: tested both v22.22.2 and v24.15.0 (NodeSource apt). Symptom changes shape with Node version (see below) but does not go away.
Install: npm i -g openclaw into ~/.npm-global
Gateway: systemctl --user openclaw-gateway (loopback only, no remote, no reverse proxy)
[gateway] ready (within 8s)
[plugins] embedded acpx runtime backend registered (+12s)
[browser/server] Browser control listening on http://127.0.0.1:18791/ (+15s, this works)
[telegram] [<account>] starting provider (+30-180s, varies wildly)
[discord] [default] starting provider
[plugins] memory-core: managed dreaming cron could not be reconciled (cron service unavailable). ← starvation marker
[ws] closed before connect ... code=1006 (repeating, while CPU pegged)
ss -tln | grep :18789 shows the listener up with Recv-Q=1 (one queued connection unserviced). top -bn1 -p $(MainPID) shows 90-280% CPU sustained. curl -m3 -H 'Connection:Upgrade' -H 'Upgrade:websocket' -H 'Sec-WebSocket-Key: dGVzdA==' -H 'Sec-WebSocket-Version: 13' http://127.0.0.1:18789/ returns 0 bytes after 3s.
Symptom variations across attempts
Attempt
Versions
Symptom
1
Node 22.22.2 + 4.26
WS handler completely dead from boot; recv-Q stuck; probe times out indefinitely (>5 min waited)
2
Node 24.15.0 + 4.26
WS handler intermittently responds after ~3 min; one probe succeeds with 435 ms RPC, next probe times out; CPU 100%+ pegged
3
Node 24.15.0 + 4.29
Same as #2; 8+ min after boot CPU still pegged 100%, 6/6 fresh probes timeout, internal cron-service unreachable
The 4.26 changelog #72720 fix ("skip CLI startup self-respawn for foreground gateway runs so low-memory Linux/Node 24 hosts start through the same path … without hanging before logs") does help — on Node 22 the gateway never reaches a state where any probe can succeed; on Node 24 it eventually serves a few. But that fix only addresses the observable liveness, not the underlying event-loop starvation.
What 4.29 changes (and what it does not)
What worked as documented:
Fixes #74692 (sqlite-vec mirrored into bundled-plugin runtime-deps) — verified after rollback that memory-graph stats returns the correct rows; vector extension loads fine. So the fix is good, just lands inside a release whose other regressions block the upgrade for this host.
What did not change the symptom on this hardware:
Refs #73655 (conservative stuck-session recovery that releases only stale session lanes…) — no observable effect on the boot-time CPU peg. The starvation appears to begin during plugin/runtime-deps materialization, before any session lane could be marked stuck.
Refs #72338 (subagents/runs.json file-signature caching) — no observable effect; CPU stays pegged for >8 min post-restart with the new cache in place.
Possible root cause angle
This is one host's read, not a diagnosis: the symptom feels less like leak (1) Manifest EADDRINUSE retry and more like a pre-session starvation source — runtime-deps materialization, mirror staging, model-pricing fetch, or plugin-discovery work — that owns the event loop during and after [gateway] ready. Two specific observations:
The cron service unavailable log line fires during the CPU peg, not after a queued reply. That suggests internal RPC contention rather than session-write-lock starvation.
Whatever it is, it scales with something Pi5-specific (slow disk? ARM64 jit warm-up? lower core count vs the m1/m2 reporters?) since the same 4.26 → 4.29 upgrade reportedly works on faster-ARM64 mac hosts (per #74323's macOS reporter, after they reverted to a working 4.26).
Workaround
npm i -g openclaw@2026.4.23
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.
What I can capture next time
If a maintainer wants targeted evidence on the next attempt, I can capture any of:
OPENCLAW_GATEWAY_STARTUP_TRACE=1 per-phase timings and lookup-table counts
per-thread CPU samples on the gateway PID (top -H -p $PID)
strace -c -p $PID 30s sample to see which syscalls dominate
raw gateway log (StandardOutput=append: is configured)
any other capture you'd like
Happy to retry the upgrade on this host with capture instrumentation as soon as someone is in a position to consume the evidence. Holding on 4.23 until then.
Summary
Every release of OpenClaw since
2026.4.24produces the same boot-time WebSocket handler starvation on a Raspberry Pi 5 / ARM64 native systemd install. The gateway's[gateway] readyline fires, plugin/channel providers register, but the gateway pegs ~100% on a single CPU core and the WS handler on127.0.0.1:18789either never responds or responds intermittently with degraded latency. Internal subsystems (memory-core ↔ cron-service) cannot reach each other while the event loop is starved. The2026.4.23rollback is clean every time — within ~30s the gateway reportsConnect: ok~30 ms with idle CPU.This was previously reported via comments on #73655 and #74323 (closed as duplicate of #73655). Filing this as a dedicated tracking issue because the symptom now reproduces unchanged across three consecutive upgrade attempts on this host (4.26, 4.27 (skipped after release-note read), 4.29) — including 4.29 which carries explicit
Refs #73655andRefs #72338fixes intended to address it.Environment
aarch64)npm i -g openclawinto~/.npm-globalsystemctl --user openclaw-gateway(loopback only, no remote, no reverse proxy)Reproduction
The gateway log shows:
ss -tln | grep :18789shows the listener up withRecv-Q=1(one queued connection unserviced).top -bn1 -p $(MainPID)shows 90-280% CPU sustained.curl -m3 -H 'Connection:Upgrade' -H 'Upgrade:websocket' -H 'Sec-WebSocket-Key: dGVzdA==' -H 'Sec-WebSocket-Version: 13' http://127.0.0.1:18789/returns 0 bytes after 3s.Symptom variations across attempts
The 4.26 changelog #72720 fix ("skip CLI startup self-respawn for foreground gateway runs so low-memory Linux/Node 24 hosts start through the same path … without hanging before logs") does help — on Node 22 the gateway never reaches a state where any probe can succeed; on Node 24 it eventually serves a few. But that fix only addresses the observable liveness, not the underlying event-loop starvation.
What 4.29 changes (and what it does not)
What worked as documented:
Fixes #74692(sqlite-vec mirrored into bundled-plugin runtime-deps) — verified after rollback thatmemory-graph statsreturns the correct rows; vector extension loads fine. So the fix is good, just lands inside a release whose other regressions block the upgrade for this host.What did not change the symptom on this hardware:
Refs #73655(conservative stuck-session recovery that releases only stale session lanes…) — no observable effect on the boot-time CPU peg. The starvation appears to begin during plugin/runtime-deps materialization, before any session lane could be marked stuck.Refs #72338(subagents/runs.jsonfile-signature caching) — no observable effect; CPU stays pegged for >8 min post-restart with the new cache in place.Possible root cause angle
This is one host's read, not a diagnosis: the symptom feels less like leak (1) Manifest EADDRINUSE retry and more like a pre-session starvation source — runtime-deps materialization, mirror staging, model-pricing fetch, or plugin-discovery work — that owns the event loop during and after
[gateway] ready. Two specific observations:2099/2100Manifest listener, so leak (1) of Gateway leak triad on plugin restart: Manifest EADDRINUSE retry loop, signal-handler accumulation, sync I/O on session JSONL → WS handshake starvation #73655 cannot apply.cron service unavailablelog line fires during the CPU peg, not after a queued reply. That suggests internal RPC contention rather than session-write-lock starvation.Whatever it is, it scales with something Pi5-specific (slow disk? ARM64 jit warm-up? lower core count vs the m1/m2 reporters?) since the same
4.26 → 4.29upgrade reportedly works on faster-ARM64 mac hosts (per #74323's macOS reporter, after they reverted to a working 4.26).Workaround
What I can capture next time
If a maintainer wants targeted evidence on the next attempt, I can capture any of:
OPENCLAW_GATEWAY_STARTUP_TRACE=1per-phase timings and lookup-table countstop -H -p $PID)/proc/$PID/status | grep Sig, plus theMaxListenersExceededWarningmentioned in Gateway leak triad on plugin restart: Manifest EADDRINUSE retry loop, signal-handler accumulation, sync I/O on session JSONL → WS handshake starvation #73655 if it appears)lsof -p $PIDsnapshot during the pegstrace -c -p $PID30s sample to see which syscalls dominateStandardOutput=append:is configured)Happy to retry the upgrade on this host with capture instrumentation as soon as someone is in a position to consume the evidence. Holding on 4.23 until then.
Cross-references