[Bug]: openclaw-agent ignores SIGTERM under cron, accumulates hung process chains and exhausts host RAM/swap

### Bug type
Behavior bug (incorrect output/state without crash)

### Beta release blocker
No

### Summary
`openclaw agent` processes spawned by cron with `timeout 600 openclaw agent ...` (without `-k <delay>`) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap.

### Steps to reproduce
1. On a Linux host with cron, schedule any agent with a wrapper that has no SIGKILL escalation, e.g.:
   ```
   0 2,8,14,20 * * * timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
   0 */2 * * *      timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
   ```
2. Let it run for ≥48 hours. Each time the agent does something that blocks past the 600s budget (synchronous gateway call, slow tool, networked search, etc.), the wrapper sends SIGTERM at 600s — the agent does not exit.
3. Inspect `ps -eo pid,ppid,stat,etime,rss,cmd | grep openclaw-agent | sort -k4`. You will see process trees layered as `/bin/sh -c "timeout 600 openclaw agent ..."` -> `timeout` -> `openclaw-agent`, all of them surviving across many cron firings.

### Expected behavior
One of:
- (a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or
- (b) The shipped CLI/cron documentation explicitly requires `timeout -k <delay> <duration>` (or equivalent) for any cron-spawned agent and warns about the leak otherwise.

Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.

### Actual behavior
Real incident timeline observed on this host:

- **2026-04-17 17:34 UTC** — first occurrence. 14 hung `openclaw-agent` processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB.
- **Initial mitigation (same day)** — wrapped both cron lines in `timeout 300 openclaw agent ...`. This stopped *future* indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower.
- **2026-04-25 00:14 UTC** — second occurrence after we extended the budget to `timeout 600`. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill.

Process layering observed (from `ps -ef` and `pstree`):
```
/bin/sh -c "timeout 600 openclaw agent --agent social-director --message ..."
 └── timeout 600 openclaw agent --agent ... (waiting for child after SIGTERM at t=600s)
      └── openclaw-agent (still alive, RSS 40-730 MB depending on age)
```
`pkill -TERM` on `openclaw-agent` had no effect — only `pkill -9` (SIGKILL) freed the chain. Cleanup required two passes:
1. `pkill -9 -f "/bin/sh -c timeout 600 openclaw agent --agent social-director"` — outer shells
2. `pkill -9 -f "^openclaw-agent" --older 600` — orphaned agents reparented to init

After cleanup: chains 69 → 0, RAM 11 → 4.6 GiB, swap 4.0 GiB → 374 MiB. Gateway untouched, uptime preserved.

### OpenClaw version
2026.4.23 (incident reproduced on the same version family across 2026.4.17–2026.4.23)

### Operating system
Ubuntu 24.04.4 LTS (kernel 6.8.0-107-generic), systemd 255

### Install method
npm global (`/home/ubuntu/.npm-global/lib/node_modules/openclaw`), Node v22.22.1

### Model
claude-cli/claude-opus-4-7 (the agent under cron is `social-director` running on this primary)

### Provider / routing chain
cron -> /bin/sh -> coreutils `timeout` -> `openclaw agent` (CLI) -> openclaw gateway (systemd --user) -> auth-profile `anthropic:claude-cli` -> claude (CLI) -> anthropic.com

### Additional provider/model setup details
Workaround currently in production:
```cron
0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * *      timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
```
The `-k 60` flag tells coreutils-`timeout` to send SIGKILL 60s after the SIGTERM if the child hasn't exited. This *mitigates* the symptom but does not address the root cause — the agent still ignores SIGTERM, which means any other supervisor that doesn't escalate to SIGKILL has the same leak.

### Suggested direction
- **Primary:** install an async-safe SIGTERM handler on the agent process that:
  - cancels any pending synchronous gateway HTTP call,
  - flushes the current message/log buffers,
  - exits within a bounded grace window (≤30s default, configurable).
- **Secondary:** document, in the CLI/cron docs, that `timeout -k <delay>` (or `systemd-run --on-active=... --wait` with `KillMode=mixed`) is required for cron-spawned agents until the primary fix lands.

Related signal-handling / supervisor issues that touch the same surface but from different angles:
- #66399 — `Process supervisor: graceful signal escalation and drain timeout for exec tool` — directly relevant; same direction of fix needed for the `agent` CLI entry-point.
- #70026 — `Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade` — opposite end of the same problem (supervisor side).
- #65650 — `feat: wire SQLite message store into active gateway for SIGTERM resilience` — orthogonal; addresses message persistence across restarts.

---
*Reported by @nikolaykazakovvs-ux via Cognitor (claude-opus-4-7 substrate).*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: openclaw-agent ignores SIGTERM under cron, accumulates hung process chains and exhausts host RAM/swap #71710

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Suggested direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: openclaw-agent ignores SIGTERM under cron, accumulates hung process chains and exhausts host RAM/swap #71710

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Suggested direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions