Skip to content

Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time #75703

@jsc2304

Description

@jsc2304

Summary

Every release of OpenClaw since 2026.4.24 produces the same boot-time WebSocket handler starvation on a Raspberry Pi 5 / ARM64 native systemd install. The gateway's [gateway] ready line fires, plugin/channel providers register, but the gateway pegs ~100% on a single CPU core and the WS handler on 127.0.0.1:18789 either never responds or responds intermittently with degraded latency. Internal subsystems (memory-core ↔ cron-service) cannot reach each other while the event loop is starved. The 2026.4.23 rollback is clean every time — within ~30s the gateway reports Connect: ok ~30 ms with idle CPU.

This was previously reported via comments on #73655 and #74323 (closed as duplicate of #73655). Filing this as a dedicated tracking issue because the symptom now reproduces unchanged across three consecutive upgrade attempts on this host (4.26, 4.27 (skipped after release-note read), 4.29) — including 4.29 which carries explicit Refs #73655 and Refs #72338 fixes intended to address it.

Environment

Reproduction

# Pre-flight (every attempt)
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak-pre-update
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-unknown-*
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.<old>-*

# Update
openclaw update --yes
# → "Update Result: OK", doctor passes (~60-160s, varies)

# Health
openclaw gateway probe
# → "Connect: failed - timeout" repeatedly for 4-8+ minutes

The gateway log shows:

[gateway] ready                                           (within 8s)
[plugins] embedded acpx runtime backend registered        (+12s)
[browser/server] Browser control listening on http://127.0.0.1:18791/   (+15s, this works)
[telegram] [<account>] starting provider                  (+30-180s, varies wildly)
[discord] [default] starting provider
[plugins] memory-core: managed dreaming cron could not be reconciled (cron service unavailable).   ← starvation marker
[ws] closed before connect ... code=1006                  (repeating, while CPU pegged)

ss -tln | grep :18789 shows the listener up with Recv-Q=1 (one queued connection unserviced). top -bn1 -p $(MainPID) shows 90-280% CPU sustained. curl -m3 -H 'Connection:Upgrade' -H 'Upgrade:websocket' -H 'Sec-WebSocket-Key: dGVzdA==' -H 'Sec-WebSocket-Version: 13' http://127.0.0.1:18789/ returns 0 bytes after 3s.

Symptom variations across attempts

Attempt Versions Symptom
1 Node 22.22.2 + 4.26 WS handler completely dead from boot; recv-Q stuck; probe times out indefinitely (>5 min waited)
2 Node 24.15.0 + 4.26 WS handler intermittently responds after ~3 min; one probe succeeds with 435 ms RPC, next probe times out; CPU 100%+ pegged
3 Node 24.15.0 + 4.29 Same as #2; 8+ min after boot CPU still pegged 100%, 6/6 fresh probes timeout, internal cron-service unreachable

The 4.26 changelog #72720 fix ("skip CLI startup self-respawn for foreground gateway runs so low-memory Linux/Node 24 hosts start through the same path … without hanging before logs") does help — on Node 22 the gateway never reaches a state where any probe can succeed; on Node 24 it eventually serves a few. But that fix only addresses the observable liveness, not the underlying event-loop starvation.

What 4.29 changes (and what it does not)

What worked as documented:

  • Fixes #74692 (sqlite-vec mirrored into bundled-plugin runtime-deps) — verified after rollback that memory-graph stats returns the correct rows; vector extension loads fine. So the fix is good, just lands inside a release whose other regressions block the upgrade for this host.

What did not change the symptom on this hardware:

  • Refs #73655 (conservative stuck-session recovery that releases only stale session lanes…) — no observable effect on the boot-time CPU peg. The starvation appears to begin during plugin/runtime-deps materialization, before any session lane could be marked stuck.
  • Refs #72338 (subagents/runs.json file-signature caching) — no observable effect; CPU stays pegged for >8 min post-restart with the new cache in place.

Possible root cause angle

This is one host's read, not a diagnosis: the symptom feels less like leak (1) Manifest EADDRINUSE retry and more like a pre-session starvation source — runtime-deps materialization, mirror staging, model-pricing fetch, or plugin-discovery work — that owns the event loop during and after [gateway] ready. Two specific observations:

Whatever it is, it scales with something Pi5-specific (slow disk? ARM64 jit warm-up? lower core count vs the m1/m2 reporters?) since the same 4.26 → 4.29 upgrade reportedly works on faster-ARM64 mac hosts (per #74323's macOS reporter, after they reverted to a working 4.26).

Workaround

npm i -g openclaw@2026.4.23
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.

What I can capture next time

If a maintainer wants targeted evidence on the next attempt, I can capture any of:

Happy to retry the upgrade on this host with capture instrumentation as soon as someone is in a position to consume the evidence. Holding on 4.23 until then.

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions