Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
openclaw agent processes spawned by cron with timeout 600 openclaw agent ... (without -k <delay>) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap.
Steps to reproduce
- On a Linux host with cron, schedule any agent with a wrapper that has no SIGKILL escalation, e.g.:
0 2,8,14,20 * * * timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
0 */2 * * * timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
- Let it run for ≥48 hours. Each time the agent does something that blocks past the 600s budget (synchronous gateway call, slow tool, networked search, etc.), the wrapper sends SIGTERM at 600s — the agent does not exit.
- Inspect
ps -eo pid,ppid,stat,etime,rss,cmd | grep openclaw-agent | sort -k4. You will see process trees layered as /bin/sh -c "timeout 600 openclaw agent ..." -> timeout -> openclaw-agent, all of them surviving across many cron firings.
Expected behavior
One of:
- (a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or
- (b) The shipped CLI/cron documentation explicitly requires
timeout -k <delay> <duration> (or equivalent) for any cron-spawned agent and warns about the leak otherwise.
Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.
Actual behavior
Real incident timeline observed on this host:
- 2026-04-17 17:34 UTC — first occurrence. 14 hung
openclaw-agent processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB.
- Initial mitigation (same day) — wrapped both cron lines in
timeout 300 openclaw agent .... This stopped future indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower.
- 2026-04-25 00:14 UTC — second occurrence after we extended the budget to
timeout 600. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill.
Process layering observed (from ps -ef and pstree):
/bin/sh -c "timeout 600 openclaw agent --agent social-director --message ..."
└── timeout 600 openclaw agent --agent ... (waiting for child after SIGTERM at t=600s)
└── openclaw-agent (still alive, RSS 40-730 MB depending on age)
pkill -TERM on openclaw-agent had no effect — only pkill -9 (SIGKILL) freed the chain. Cleanup required two passes:
pkill -9 -f "/bin/sh -c timeout 600 openclaw agent --agent social-director" — outer shells
pkill -9 -f "^openclaw-agent" --older 600 — orphaned agents reparented to init
After cleanup: chains 69 → 0, RAM 11 → 4.6 GiB, swap 4.0 GiB → 374 MiB. Gateway untouched, uptime preserved.
OpenClaw version
2026.4.23 (incident reproduced on the same version family across 2026.4.17–2026.4.23)
Operating system
Ubuntu 24.04.4 LTS (kernel 6.8.0-107-generic), systemd 255
Install method
npm global (/home/ubuntu/.npm-global/lib/node_modules/openclaw), Node v22.22.1
Model
claude-cli/claude-opus-4-7 (the agent under cron is social-director running on this primary)
Provider / routing chain
cron -> /bin/sh -> coreutils timeout -> openclaw agent (CLI) -> openclaw gateway (systemd --user) -> auth-profile anthropic:claude-cli -> claude (CLI) -> anthropic.com
Additional provider/model setup details
Workaround currently in production:
0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
The -k 60 flag tells coreutils-timeout to send SIGKILL 60s after the SIGTERM if the child hasn't exited. This mitigates the symptom but does not address the root cause — the agent still ignores SIGTERM, which means any other supervisor that doesn't escalate to SIGKILL has the same leak.
Suggested direction
- Primary: install an async-safe SIGTERM handler on the agent process that:
- cancels any pending synchronous gateway HTTP call,
- flushes the current message/log buffers,
- exits within a bounded grace window (≤30s default, configurable).
- Secondary: document, in the CLI/cron docs, that
timeout -k <delay> (or systemd-run --on-active=... --wait with KillMode=mixed) is required for cron-spawned agents until the primary fix lands.
Related signal-handling / supervisor issues that touch the same surface but from different angles:
Reported by @nikolaykazakovvs-ux via Cognitor (claude-opus-4-7 substrate).
Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
openclaw agentprocesses spawned by cron withtimeout 600 openclaw agent ...(without-k <delay>) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap.Steps to reproduce
ps -eo pid,ppid,stat,etime,rss,cmd | grep openclaw-agent | sort -k4. You will see process trees layered as/bin/sh -c "timeout 600 openclaw agent ..."->timeout->openclaw-agent, all of them surviving across many cron firings.Expected behavior
One of:
timeout -k <delay> <duration>(or equivalent) for any cron-spawned agent and warns about the leak otherwise.Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.
Actual behavior
Real incident timeline observed on this host:
openclaw-agentprocesses accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB.timeout 300 openclaw agent .... This stopped future indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower.timeout 600. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill.Process layering observed (from
ps -efandpstree):pkill -TERMonopenclaw-agenthad no effect — onlypkill -9(SIGKILL) freed the chain. Cleanup required two passes:pkill -9 -f "/bin/sh -c timeout 600 openclaw agent --agent social-director"— outer shellspkill -9 -f "^openclaw-agent" --older 600— orphaned agents reparented to initAfter cleanup: chains 69 → 0, RAM 11 → 4.6 GiB, swap 4.0 GiB → 374 MiB. Gateway untouched, uptime preserved.
OpenClaw version
2026.4.23 (incident reproduced on the same version family across 2026.4.17–2026.4.23)
Operating system
Ubuntu 24.04.4 LTS (kernel 6.8.0-107-generic), systemd 255
Install method
npm global (
/home/ubuntu/.npm-global/lib/node_modules/openclaw), Node v22.22.1Model
claude-cli/claude-opus-4-7 (the agent under cron is
social-directorrunning on this primary)Provider / routing chain
cron -> /bin/sh -> coreutils
timeout->openclaw agent(CLI) -> openclaw gateway (systemd --user) -> auth-profileanthropic:claude-cli-> claude (CLI) -> anthropic.comAdditional provider/model setup details
Workaround currently in production:
The
-k 60flag tells coreutils-timeoutto send SIGKILL 60s after the SIGTERM if the child hasn't exited. This mitigates the symptom but does not address the root cause — the agent still ignores SIGTERM, which means any other supervisor that doesn't escalate to SIGKILL has the same leak.Suggested direction
timeout -k <delay>(orsystemd-run --on-active=... --waitwithKillMode=mixed) is required for cron-spawned agents until the primary fix lands.Related signal-handling / supervisor issues that touch the same surface but from different angles:
Process supervisor: graceful signal escalation and drain timeout for exec tool— directly relevant; same direction of fix needed for theagentCLI entry-point.Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade— opposite end of the same problem (supervisor side).feat: wire SQLite message store into active gateway for SIGTERM resilience— orthogonal; addresses message persistence across restarts.Reported by @nikolaykazakovvs-ux via Cognitor (claude-opus-4-7 substrate).