Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time

## Summary

Every release of OpenClaw since `2026.4.24` produces the same boot-time WebSocket handler starvation on a Raspberry Pi 5 / ARM64 native systemd install. The gateway's `[gateway] ready` line fires, plugin/channel providers register, but the gateway pegs ~100% on a single CPU core and the WS handler on `127.0.0.1:18789` either never responds or responds intermittently with degraded latency. Internal subsystems (memory-core ↔ cron-service) cannot reach each other while the event loop is starved. The `2026.4.23` rollback is clean every time — within ~30s the gateway reports `Connect: ok` ~30 ms with idle CPU.

This was previously reported via comments on #73655 and #74323 (closed as duplicate of #73655). Filing this as a dedicated tracking issue because the symptom now reproduces unchanged across **three** consecutive upgrade attempts on this host (4.26, 4.27 (skipped after release-note read), 4.29) — including 4.29 which carries explicit `Refs #73655` and `Refs #72338` fixes intended to address it.

## Environment

- OpenClaw: tested 2026.4.26 (be8c246), 2026.4.29 (a448042). Last known-good: 2026.4.23 (a979721).
- Hardware: Raspberry Pi 5 (8 GB), ARM64 (`aarch64`)
- OS: Linux 6.12.62+rpt-rpi-2712 (64-bit, Debian-derived)
- Node: tested both v22.22.2 and v24.15.0 (NodeSource apt). Symptom changes shape with Node version (see below) but does not go away.
- Install: `npm i -g openclaw` into `~/.npm-global`
- Gateway: `systemctl --user openclaw-gateway` (loopback only, no remote, no reverse proxy)
- Plugins: telegram (3 accounts), discord, memory-core, embedded acpx, browser-control. **No Manifest plugin, no Lark, no Polymarket — different from #73655 reporter's stack.**

## Reproduction

```
# Pre-flight (every attempt)
cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak-pre-update
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-unknown-*
rm -rf ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.<old>-*

# Update
openclaw update --yes
# → "Update Result: OK", doctor passes (~60-160s, varies)

# Health
openclaw gateway probe
# → "Connect: failed - timeout" repeatedly for 4-8+ minutes
```

The gateway log shows:

```
[gateway] ready                                           (within 8s)
[plugins] embedded acpx runtime backend registered        (+12s)
[browser/server] Browser control listening on http://127.0.0.1:18791/   (+15s, this works)
[telegram] [<account>] starting provider                  (+30-180s, varies wildly)
[discord] [default] starting provider
[plugins] memory-core: managed dreaming cron could not be reconciled (cron service unavailable).   ← starvation marker
[ws] closed before connect ... code=1006                  (repeating, while CPU pegged)
```

`ss -tln | grep :18789` shows the listener up with `Recv-Q=1` (one queued connection unserviced). `top -bn1 -p $(MainPID)` shows 90-280% CPU sustained. `curl -m3 -H 'Connection:Upgrade' -H 'Upgrade:websocket' -H 'Sec-WebSocket-Key: dGVzdA==' -H 'Sec-WebSocket-Version: 13' http://127.0.0.1:18789/` returns 0 bytes after 3s.

## Symptom variations across attempts

| Attempt | Versions | Symptom |
|---|---|---|
| 1 | Node 22.22.2 + 4.26 | WS handler completely dead from boot; recv-Q stuck; probe times out indefinitely (>5 min waited) |
| 2 | Node 24.15.0 + 4.26 | WS handler intermittently responds after ~3 min; one probe succeeds with 435 ms RPC, next probe times out; CPU 100%+ pegged |
| 3 | Node 24.15.0 + 4.29 | Same as #2; 8+ min after boot CPU still pegged 100%, 6/6 fresh probes timeout, internal cron-service unreachable |

The 4.26 changelog #72720 fix ("skip CLI startup self-respawn for foreground gateway runs so low-memory Linux/Node 24 hosts start through the same path … without hanging before logs") *does* help — on Node 22 the gateway never reaches a state where any probe can succeed; on Node 24 it eventually serves a few. But that fix only addresses the **observable** liveness, not the underlying event-loop starvation.

## What 4.29 changes (and what it does not)

What worked as documented:
- `Fixes #74692` (sqlite-vec mirrored into bundled-plugin runtime-deps) — verified after rollback that `memory-graph stats` returns the correct rows; vector extension loads fine. So the fix is good, just lands inside a release whose other regressions block the upgrade for this host.

What did not change the symptom on this hardware:
- `Refs #73655` (`conservative stuck-session recovery that releases only stale session lanes…`) — no observable effect on the boot-time CPU peg. The starvation appears to begin during plugin/runtime-deps materialization, before any session lane could be marked stuck.
- `Refs #72338` (`subagents/runs.json` file-signature caching) — no observable effect; CPU stays pegged for >8 min post-restart with the new cache in place.

## Possible root cause angle

This is one host's read, not a diagnosis: the symptom feels less like leak (1) Manifest EADDRINUSE retry and more like a pre-session starvation source — runtime-deps materialization, mirror staging, model-pricing fetch, or plugin-discovery work — that owns the event loop during and after `[gateway] ready`. Two specific observations:

- This host has **no** `2099/2100` Manifest listener, so leak (1) of #73655 cannot apply.
- The `cron service unavailable` log line fires *during* the CPU peg, not after a queued reply. That suggests internal RPC contention rather than session-write-lock starvation.

Whatever it is, it scales with something Pi5-specific (slow disk? ARM64 jit warm-up? lower core count vs the m1/m2 reporters?) since the same `4.26 → 4.29` upgrade reportedly works on faster-ARM64 mac hosts (per #74323's macOS reporter, after they reverted to a working 4.26).

## Workaround

```bash
npm i -g openclaw@2026.4.23
systemctl --user daemon-reload && systemctl --user restart openclaw-gateway
# Reachable in ~30s, RPC ~30ms, idle CPU 0%, RSS ~650MB. No data loss across the round-trip.
```

## What I can capture next time

If a maintainer wants targeted evidence on the next attempt, I can capture any of:
- `OPENCLAW_GATEWAY_STARTUP_TRACE=1` per-phase timings and lookup-table counts
- per-thread CPU samples on the gateway PID (`top -H -p $PID`)
- signal-handler counts (`/proc/$PID/status | grep Sig`, plus the `MaxListenersExceededWarning` mentioned in #73655 if it appears)
- `lsof -p $PID` snapshot during the peg
- `strace -c -p $PID` 30s sample to see which syscalls dominate
- raw gateway log (`StandardOutput=append:` is configured)
- any other capture you'd like

Happy to retry the upgrade on this host with capture instrumentation as soon as someone is in a position to consume the evidence. Holding on 4.23 until then.

## Cross-references

- #73655 — Gateway leak triad, root-cause hypothesis (still open). My prior 4.26 reproduction comment: https://github.com/openclaw/openclaw/issues/73655#issuecomment-4343520517 . 4.29 reproduction comment: https://github.com/openclaw/openclaw/issues/73655#issuecomment-4359928329
- #74323 — closed as duplicate of #73655.
- #72338 — earlier CPU-spin pattern (filed against 4.24).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time #75703

Summary

Environment

Reproduction

Symptom variations across attempts

What 4.29 changes (and what it does not)

Possible root cause angle

Workaround

What I can capture next time

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Attempt	Versions	Symptom
1	Node 22.22.2 + 4.26	WS handler completely dead from boot; recv-Q stuck; probe times out indefinitely (>5 min waited)
2	Node 24.15.0 + 4.26	WS handler intermittently responds after ~3 min; one probe succeeds with 435 ms RPC, next probe times out; CPU 100%+ pegged
3	Node 24.15.0 + 4.29	Same as #2; 8+ min after boot CPU still pegged 100%, 6/6 fresh probes timeout, internal cron-service unreachable

Uh oh!

Gateway WS handler CPU-spin starvation reproduces on Raspberry Pi 5 / ARM64 across 4.26 → 4.29; clean rollback to 4.23 every time #75703

Description

Summary

Environment

Reproduction

Symptom variations across attempts

What 4.29 changes (and what it does not)

Possible root cause angle

Workaround

What I can capture next time

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions