Skip to content

[Bug]: native hook relay CLI processes (openclaw-hooks) never exit and accumulate until host OOM #90993

@clem-git

Description

@clem-git

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Native hook relay CLI invocations (openclaw hooks relay --provider … --relay-id … --event …, process title openclaw-hooks) can remain alive indefinitely after their work completes or fails. Each invocation loads the full CLI bundle (~300–450 MB RSS). On a gateway driven by periodic heartbeat agent turns, stuck relays accumulated at roughly 2 per turn until the host ran out of memory: 49 stuck openclaw-hooks processes holding 12.4 GB RSS on an 18 GB host (no swap configured) → kernel global OOM killed the gateway (highest oom_score_adj) → killing it freed only ~0.5 GB because the leaked relays are independent processes → host livelocked (SSH key exchange could not complete) until a hard reboot ~40 h later.

This is distinct from #89325 (stale relay registration after restart): the stale-registration errors were also observed, but the bug here is that the relay processes themselves never exit, regardless of whether the gateway call succeeds, fails, or hits the stale-registration path.

Steps to reproduce

  1. Run a gateway with a CLI-backend agent (claude-cli and codex app-server harnesses in this case) and heartbeat enabled, so hook events fire regularly.
  2. Let it run for several hours.
  3. ps -eo comm,rss | grep openclaw-hooks — stuck relay processes accumulate (each ~300–450 MB RSS) instead of exiting after their ~5 s useful lifetime.

Mechanism, from the shipped bundle (2026.6.1, commit 2e08f0f), dist hooks-cli chunk, runNativeHookRelayCli:

  1. readStreamText(stdin) does for await (const chunk of stream) with no timeout — if the spawning harness keeps the stdin pipe open, the relay blocks forever before any timeout logic applies.
  2. The commander action sets process.exitCode = await runNativeHookRelayCli(opts) and returns — there is no process.exit(). Any handle still referenced after the run (e.g. the gateway WS connection from callGateway, or stdin still open) keeps the Node process alive even though its work is done.
  3. The default --timeout 5000 bounds only the gateway RPC, not the stdin read and not process lifetime.

Expected behavior

Relay invocations are strictly bounded: a hard process deadline (e.g. an unref'd setTimeout(() => process.exit(124), …) armed at action start), a bounded stdin read, and/or an explicit process.exit(exitCode) after flushing stdout/stderr. A relay process should never outlive its gateway timeout by more than seconds.

Actual behavior

Relay processes survive indefinitely. Kernel OOM task dump at the time of the kill showed 49 processes with comm openclaw-hooks at ~85k–117k pages each (≈0.33–0.45 GB), totalling 12.4 GB RSS:

Out of memory: Killed process <pid> (node) total-vm:43791120kB, anon-rss:465964kB, file-rss:2136kB, shmem-rss:0kB, UID:1001 pgtables:12680kB oom_score_adj:200

(The killed process was the gateway itself; the leaked relays survived outside its cgroup, so memory pressure persisted after the gateway auto-restarted.)

After a gateway restart, stale relays additionally logged:

[ws] ⇄ res ✗ nativeHook.invoke 20ms errorCode=INVALID_REQUEST errorMessage=native hook relay not found

OpenClaw version

2026.6.1 (2e08f0f)

Operating system

Ubuntu 24.04 LTS (aarch64 cloud VM, 18 GB RAM)

Install method

npm (global)

Model

claude-cli backend (Opus) + codex app-server harness

Provider / routing chain

gateway → CLI backends (claude-cli, codex app-server), heartbeat-driven turns

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Counts/cadence: ~26 heartbeat-triggered agent turns over ~13 h produced 49 leaked relays (~2 per turn). Each leaked process held the full CLI bundle resident. Host identifiers redacted from log excerpts above.

Impact and severity

High for unattended/always-on deployments: a steadily-leaking few-hundred-MB process per hook event eventually exhausts host memory. On a swapless host this presents as a full livelock (gateway unresponsive, SSH unreachable, instance still "running" at the cloud-provider level), requiring an out-of-band hard reboot.

Additional information

Workaround in use: a systemd user timer that SIGKILLs any openclaw-hooks process older than 5 minutes (legitimate relays live ~5 s), plus a cgroup memory cap on the user slice so a recurrence cannot take down the host.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions