[Bug]: native hook relay CLI processes (`openclaw-hooks`) never exit and accumulate until host OOM

### Bug type

Behavior bug (incorrect output/state without crash)

### Beta release blocker

No

### Summary

Native hook relay CLI invocations (`openclaw hooks relay --provider … --relay-id … --event …`, process title `openclaw-hooks`) can remain alive indefinitely after their work completes or fails. Each invocation loads the full CLI bundle (~300–450 MB RSS). On a gateway driven by periodic heartbeat agent turns, stuck relays accumulated at roughly 2 per turn until the host ran out of memory: **49 stuck `openclaw-hooks` processes holding 12.4 GB RSS on an 18 GB host** (no swap configured) → kernel global OOM killed the gateway (highest `oom_score_adj`) → killing it freed only ~0.5 GB because the leaked relays are independent processes → host livelocked (SSH key exchange could not complete) until a hard reboot ~40 h later.

This is distinct from #89325 (stale relay registration after restart): the stale-registration errors were also observed, but the bug here is that the **relay processes themselves never exit**, regardless of whether the gateway call succeeds, fails, or hits the stale-registration path.

### Steps to reproduce

1. Run a gateway with a CLI-backend agent (claude-cli and codex app-server harnesses in this case) and heartbeat enabled, so hook events fire regularly.
2. Let it run for several hours.
3. `ps -eo comm,rss | grep openclaw-hooks` — stuck relay processes accumulate (each ~300–450 MB RSS) instead of exiting after their ~5 s useful lifetime.

Mechanism, from the shipped bundle (2026.6.1, commit 2e08f0f), dist hooks-cli chunk, `runNativeHookRelayCli`:

1. `readStreamText(stdin)` does `for await (const chunk of stream)` with **no timeout** — if the spawning harness keeps the stdin pipe open, the relay blocks forever before any timeout logic applies.
2. The commander action sets `process.exitCode = await runNativeHookRelayCli(opts)` and returns — there is **no `process.exit()`**. Any handle still referenced after the run (e.g. the gateway WS connection from `callGateway`, or stdin still open) keeps the Node process alive even though its work is done.
3. The default `--timeout 5000` bounds only the gateway RPC, not the stdin read and not process lifetime.

### Expected behavior

Relay invocations are strictly bounded: a hard process deadline (e.g. an unref'd `setTimeout(() => process.exit(124), …)` armed at action start), a bounded stdin read, and/or an explicit `process.exit(exitCode)` after flushing stdout/stderr. A relay process should never outlive its gateway timeout by more than seconds.

### Actual behavior

Relay processes survive indefinitely. Kernel OOM task dump at the time of the kill showed 49 processes with comm `openclaw-hooks` at ~85k–117k pages each (≈0.33–0.45 GB), totalling 12.4 GB RSS:

```
Out of memory: Killed process <pid> (node) total-vm:43791120kB, anon-rss:465964kB, file-rss:2136kB, shmem-rss:0kB, UID:1001 pgtables:12680kB oom_score_adj:200
```

(The killed process was the gateway itself; the leaked relays survived outside its cgroup, so memory pressure persisted after the gateway auto-restarted.)

After a gateway restart, stale relays additionally logged:

```
[ws] ⇄ res ✗ nativeHook.invoke 20ms errorCode=INVALID_REQUEST errorMessage=native hook relay not found
```

### OpenClaw version

2026.6.1 (2e08f0f)

### Operating system

Ubuntu 24.04 LTS (aarch64 cloud VM, 18 GB RAM)

### Install method

npm (global)

### Model

claude-cli backend (Opus) + codex app-server harness

### Provider / routing chain

gateway → CLI backends (claude-cli, codex app-server), heartbeat-driven turns

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

Counts/cadence: ~26 heartbeat-triggered agent turns over ~13 h produced 49 leaked relays (~2 per turn). Each leaked process held the full CLI bundle resident. Host identifiers redacted from log excerpts above.

### Impact and severity

High for unattended/always-on deployments: a steadily-leaking few-hundred-MB process per hook event eventually exhausts host memory. On a swapless host this presents as a full livelock (gateway unresponsive, SSH unreachable, instance still "running" at the cloud-provider level), requiring an out-of-band hard reboot.

### Additional information

Workaround in use: a systemd user timer that SIGKILLs any `openclaw-hooks` process older than 5 minutes (legitimate relays live ~5 s), plus a cgroup memory cap on the user slice so a recurrence cannot take down the host.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: native hook relay CLI processes (`openclaw-hooks`) never exit and accumulate until host OOM #90993

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: native hook relay CLI processes (openclaw-hooks) never exit and accumulate until host OOM #90993

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: native hook relay CLI processes (`openclaw-hooks`) never exit and accumulate until host OOM #90993