Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
Native hook relay CLI invocations (openclaw hooks relay --provider … --relay-id … --event …, process title openclaw-hooks) can remain alive indefinitely after their work completes or fails. Each invocation loads the full CLI bundle (~300–450 MB RSS). On a gateway driven by periodic heartbeat agent turns, stuck relays accumulated at roughly 2 per turn until the host ran out of memory: 49 stuck openclaw-hooks processes holding 12.4 GB RSS on an 18 GB host (no swap configured) → kernel global OOM killed the gateway (highest oom_score_adj) → killing it freed only ~0.5 GB because the leaked relays are independent processes → host livelocked (SSH key exchange could not complete) until a hard reboot ~40 h later.
This is distinct from #89325 (stale relay registration after restart): the stale-registration errors were also observed, but the bug here is that the relay processes themselves never exit, regardless of whether the gateway call succeeds, fails, or hits the stale-registration path.
Steps to reproduce
- Run a gateway with a CLI-backend agent (claude-cli and codex app-server harnesses in this case) and heartbeat enabled, so hook events fire regularly.
- Let it run for several hours.
ps -eo comm,rss | grep openclaw-hooks — stuck relay processes accumulate (each ~300–450 MB RSS) instead of exiting after their ~5 s useful lifetime.
Mechanism, from the shipped bundle (2026.6.1, commit 2e08f0f), dist hooks-cli chunk, runNativeHookRelayCli:
readStreamText(stdin) does for await (const chunk of stream) with no timeout — if the spawning harness keeps the stdin pipe open, the relay blocks forever before any timeout logic applies.
- The commander action sets
process.exitCode = await runNativeHookRelayCli(opts) and returns — there is no process.exit(). Any handle still referenced after the run (e.g. the gateway WS connection from callGateway, or stdin still open) keeps the Node process alive even though its work is done.
- The default
--timeout 5000 bounds only the gateway RPC, not the stdin read and not process lifetime.
Expected behavior
Relay invocations are strictly bounded: a hard process deadline (e.g. an unref'd setTimeout(() => process.exit(124), …) armed at action start), a bounded stdin read, and/or an explicit process.exit(exitCode) after flushing stdout/stderr. A relay process should never outlive its gateway timeout by more than seconds.
Actual behavior
Relay processes survive indefinitely. Kernel OOM task dump at the time of the kill showed 49 processes with comm openclaw-hooks at ~85k–117k pages each (≈0.33–0.45 GB), totalling 12.4 GB RSS:
Out of memory: Killed process <pid> (node) total-vm:43791120kB, anon-rss:465964kB, file-rss:2136kB, shmem-rss:0kB, UID:1001 pgtables:12680kB oom_score_adj:200
(The killed process was the gateway itself; the leaked relays survived outside its cgroup, so memory pressure persisted after the gateway auto-restarted.)
After a gateway restart, stale relays additionally logged:
[ws] ⇄ res ✗ nativeHook.invoke 20ms errorCode=INVALID_REQUEST errorMessage=native hook relay not found
OpenClaw version
2026.6.1 (2e08f0f)
Operating system
Ubuntu 24.04 LTS (aarch64 cloud VM, 18 GB RAM)
Install method
npm (global)
Model
claude-cli backend (Opus) + codex app-server harness
Provider / routing chain
gateway → CLI backends (claude-cli, codex app-server), heartbeat-driven turns
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Counts/cadence: ~26 heartbeat-triggered agent turns over ~13 h produced 49 leaked relays (~2 per turn). Each leaked process held the full CLI bundle resident. Host identifiers redacted from log excerpts above.
Impact and severity
High for unattended/always-on deployments: a steadily-leaking few-hundred-MB process per hook event eventually exhausts host memory. On a swapless host this presents as a full livelock (gateway unresponsive, SSH unreachable, instance still "running" at the cloud-provider level), requiring an out-of-band hard reboot.
Additional information
Workaround in use: a systemd user timer that SIGKILLs any openclaw-hooks process older than 5 minutes (legitimate relays live ~5 s), plus a cgroup memory cap on the user slice so a recurrence cannot take down the host.
Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
Native hook relay CLI invocations (
openclaw hooks relay --provider … --relay-id … --event …, process titleopenclaw-hooks) can remain alive indefinitely after their work completes or fails. Each invocation loads the full CLI bundle (~300–450 MB RSS). On a gateway driven by periodic heartbeat agent turns, stuck relays accumulated at roughly 2 per turn until the host ran out of memory: 49 stuckopenclaw-hooksprocesses holding 12.4 GB RSS on an 18 GB host (no swap configured) → kernel global OOM killed the gateway (highestoom_score_adj) → killing it freed only ~0.5 GB because the leaked relays are independent processes → host livelocked (SSH key exchange could not complete) until a hard reboot ~40 h later.This is distinct from #89325 (stale relay registration after restart): the stale-registration errors were also observed, but the bug here is that the relay processes themselves never exit, regardless of whether the gateway call succeeds, fails, or hits the stale-registration path.
Steps to reproduce
ps -eo comm,rss | grep openclaw-hooks— stuck relay processes accumulate (each ~300–450 MB RSS) instead of exiting after their ~5 s useful lifetime.Mechanism, from the shipped bundle (2026.6.1, commit 2e08f0f), dist hooks-cli chunk,
runNativeHookRelayCli:readStreamText(stdin)doesfor await (const chunk of stream)with no timeout — if the spawning harness keeps the stdin pipe open, the relay blocks forever before any timeout logic applies.process.exitCode = await runNativeHookRelayCli(opts)and returns — there is noprocess.exit(). Any handle still referenced after the run (e.g. the gateway WS connection fromcallGateway, or stdin still open) keeps the Node process alive even though its work is done.--timeout 5000bounds only the gateway RPC, not the stdin read and not process lifetime.Expected behavior
Relay invocations are strictly bounded: a hard process deadline (e.g. an unref'd
setTimeout(() => process.exit(124), …)armed at action start), a bounded stdin read, and/or an explicitprocess.exit(exitCode)after flushing stdout/stderr. A relay process should never outlive its gateway timeout by more than seconds.Actual behavior
Relay processes survive indefinitely. Kernel OOM task dump at the time of the kill showed 49 processes with comm
openclaw-hooksat ~85k–117k pages each (≈0.33–0.45 GB), totalling 12.4 GB RSS:(The killed process was the gateway itself; the leaked relays survived outside its cgroup, so memory pressure persisted after the gateway auto-restarted.)
After a gateway restart, stale relays additionally logged:
OpenClaw version
2026.6.1 (2e08f0f)
Operating system
Ubuntu 24.04 LTS (aarch64 cloud VM, 18 GB RAM)
Install method
npm (global)
Model
claude-cli backend (Opus) + codex app-server harness
Provider / routing chain
gateway → CLI backends (claude-cli, codex app-server), heartbeat-driven turns
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Counts/cadence: ~26 heartbeat-triggered agent turns over ~13 h produced 49 leaked relays (~2 per turn). Each leaked process held the full CLI bundle resident. Host identifiers redacted from log excerpts above.
Impact and severity
High for unattended/always-on deployments: a steadily-leaking few-hundred-MB process per hook event eventually exhausts host memory. On a swapless host this presents as a full livelock (gateway unresponsive, SSH unreachable, instance still "running" at the cloud-provider level), requiring an out-of-band hard reboot.
Additional information
Workaround in use: a systemd user timer that SIGKILLs any
openclaw-hooksprocess older than 5 minutes (legitimate relays live ~5 s), plus a cgroup memory cap on the user slice so a recurrence cannot take down the host.