Skip to content

[Bug]: openclaw-agent ignores SIGTERM under cron, accumulates hung process chains and exhausts host RAM/swap #71710

@nikolaykazakovvs-ux

Description

@nikolaykazakovvs-ux

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

openclaw agent processes spawned by cron with timeout 600 openclaw agent ... (without -k <delay>) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap.

Steps to reproduce

  1. On a Linux host with cron, schedule any agent with a wrapper that has no SIGKILL escalation, e.g.:
    0 2,8,14,20 * * * timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
    0 */2 * * *      timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
    
  2. Let it run for ≥48 hours. Each time the agent does something that blocks past the 600s budget (synchronous gateway call, slow tool, networked search, etc.), the wrapper sends SIGTERM at 600s — the agent does not exit.
  3. Inspect ps -eo pid,ppid,stat,etime,rss,cmd | grep openclaw-agent | sort -k4. You will see process trees layered as /bin/sh -c "timeout 600 openclaw agent ..." -> timeout -> openclaw-agent, all of them surviving across many cron firings.

Expected behavior

One of:

  • (a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or
  • (b) The shipped CLI/cron documentation explicitly requires timeout -k <delay> <duration> (or equivalent) for any cron-spawned agent and warns about the leak otherwise.

Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.

Actual behavior

Real incident timeline observed on this host:

  • 2026-04-17 17:34 UTC — first occurrence. 14 hung openclaw-agent processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB.
  • Initial mitigation (same day) — wrapped both cron lines in timeout 300 openclaw agent .... This stopped future indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower.
  • 2026-04-25 00:14 UTC — second occurrence after we extended the budget to timeout 600. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill.

Process layering observed (from ps -ef and pstree):

/bin/sh -c "timeout 600 openclaw agent --agent social-director --message ..."
 └── timeout 600 openclaw agent --agent ... (waiting for child after SIGTERM at t=600s)
      └── openclaw-agent (still alive, RSS 40-730 MB depending on age)

pkill -TERM on openclaw-agent had no effect — only pkill -9 (SIGKILL) freed the chain. Cleanup required two passes:

  1. pkill -9 -f "/bin/sh -c timeout 600 openclaw agent --agent social-director" — outer shells
  2. pkill -9 -f "^openclaw-agent" --older 600 — orphaned agents reparented to init

After cleanup: chains 69 → 0, RAM 11 → 4.6 GiB, swap 4.0 GiB → 374 MiB. Gateway untouched, uptime preserved.

OpenClaw version

2026.4.23 (incident reproduced on the same version family across 2026.4.17–2026.4.23)

Operating system

Ubuntu 24.04.4 LTS (kernel 6.8.0-107-generic), systemd 255

Install method

npm global (/home/ubuntu/.npm-global/lib/node_modules/openclaw), Node v22.22.1

Model

claude-cli/claude-opus-4-7 (the agent under cron is social-director running on this primary)

Provider / routing chain

cron -> /bin/sh -> coreutils timeout -> openclaw agent (CLI) -> openclaw gateway (systemd --user) -> auth-profile anthropic:claude-cli -> claude (CLI) -> anthropic.com

Additional provider/model setup details

Workaround currently in production:

0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * *      timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1

The -k 60 flag tells coreutils-timeout to send SIGKILL 60s after the SIGTERM if the child hasn't exited. This mitigates the symptom but does not address the root cause — the agent still ignores SIGTERM, which means any other supervisor that doesn't escalate to SIGKILL has the same leak.

Suggested direction

  • Primary: install an async-safe SIGTERM handler on the agent process that:
    • cancels any pending synchronous gateway HTTP call,
    • flushes the current message/log buffers,
    • exits within a bounded grace window (≤30s default, configurable).
  • Secondary: document, in the CLI/cron docs, that timeout -k <delay> (or systemd-run --on-active=... --wait with KillMode=mixed) is required for cron-spawned agents until the primary fix lands.

Related signal-handling / supervisor issues that touch the same surface but from different angles:


Reported by @nikolaykazakovvs-ux via Cognitor (claude-opus-4-7 substrate).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions