Bug Report: Supervisor Sends SIGKILL Instead of SIGTERM for Long-Running Agents
Summary
When an agent run exceeds a certain duration (observed at 60-90+ seconds), the OpenClaw supervisor sends SIGKILL instead of SIGTERM to terminate the process. SIGKILL prevents any cleanup (including session lock removal), directly causing the session lock issue reported in #70004.
Environment
- OpenClaw Version: v2026.4.20 (115f05d)
- OS: macOS 15.4.1 (Darwin 25.4.0 arm64)
- Node.js: v25.8.1
Observed Behavior
Pattern 1: Agent Runs Killed After ~60-90s
# Long-running agent (e.g., Kimi K2.6 with web search + code generation)
openclaw agent --agent coder --message "complex task" --timeout 300
# Observed: Process killed around 60-90s mark
# Result: SIGKILL (no cleanup possible)
# Evidence: Session lock file remains (.jsonl.lock)
# Process no longer exists but lock persists
Pattern 2: SIGTERM vs SIGKILL
SIGTERM (graceful - gateway shutdown):
{"subsystem":"gateway","message":"signal SIGTERM received"}
{"subsystem":"gateway","message":"received SIGTERM; shutting down"}
→ Gateway handles this gracefully, cleans up resources
SIGKILL (abrupt - agent runs):
# No log entries - process is killed without warning
# Lock file: agents/coder/sessions/<uuid>.jsonl.lock
# Lock owner PID no longer exists
# All subsequent agent runs fail with "session file locked"
→ No cleanup, session lock persists indefinitely
Pattern 3: Reproducible Steps
- Start a complex agent run (e.g., researcher with web search):
openclaw agent --agent researcher \
--message "Research GLM alternatives, 10+ sources" \
--timeout 300
- Agent starts processing, makes API calls
- Around 60-90s: Process disappears (SIGKILL)
- Lock file remains:
sessions/<uuid>.jsonl.lock
- Check:
ps aux | grep <pid> → PID no longer exists
- New agent run: Fails with "session file locked (timeout 10000ms)"
Root Cause Analysis
Evidence Points to Supervisor Timeout:
-
Timeout mismatch:
- User sets
--timeout 300 (5 minutes)
- Gateway timeout: 630000ms (10.5 minutes)
- Supervisor timeout: Likely 60-90s (hardcoded?)
-
Process lifecycle:
- Gateway receives SIGTERM → graceful shutdown
- Agent run receives no signal → abruptly killed (SIGKILL)
- Suggests supervisor/process manager is killing the agent, not the gateway
-
SIGKILL characteristics:
- Cannot be caught or handled
- No cleanup possible
- Process state shows "killed" or missing PID
- Lock files remain orphaned
Impact
Suggested Fix
Option 1: Use SIGTERM with Grace Period (Recommended)
// Pseudocode for supervisor
kill(process.pid, 'SIGTERM');
setTimeout(() => {
if (processStillExists(process.pid)) {
kill(process.pid, 'SIGKILL'); // Force kill only after grace period
}
}, 5000); // 5s grace period for cleanup
Option 2: Extend Supervisor Timeout
- Make supervisor timeout configurable or match
--timeout flag
- If user sets
--timeout 300, supervisor should wait 300s before any kill
Option 3: Pre-Kill Hook
- Register cleanup function before kill:
process.on('SIGTERM', () => {
releaseSessionLock();
process.exit(0);
});
- Then use SIGTERM instead of SIGKILL
Workarounds
User-Level (Current):
# After every killed agent run:
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent"
Script-Level:
# Wrap agent calls with cleanup
run_agent() {
openclaw agent "$@"
sleep 1
rm -f ~/.openclaw/agents/*/sessions/*.lock
}
Related Issues
Additional Context
- This may be related to
openclaw agent using embedded runs vs gateway runs
- Embedded runs might have different supervisor logic than gateway-managed runs
- The 60-90s timeout suggests a hardcoded limit, not the user-specified
--timeout
Attachments
Reported by: Johannes Huijbregts via Echo assistant
Date: 2026-04-22
OpenClaw Version: v2026.4.20 (115f05d)
Bug Report: Supervisor Sends SIGKILL Instead of SIGTERM for Long-Running Agents
Summary
When an agent run exceeds a certain duration (observed at 60-90+ seconds), the OpenClaw supervisor sends SIGKILL instead of SIGTERM to terminate the process. SIGKILL prevents any cleanup (including session lock removal), directly causing the session lock issue reported in #70004.
Environment
Observed Behavior
Pattern 1: Agent Runs Killed After ~60-90s
Pattern 2: SIGTERM vs SIGKILL
SIGTERM (graceful - gateway shutdown):
→ Gateway handles this gracefully, cleans up resources
SIGKILL (abrupt - agent runs):
→ No cleanup, session lock persists indefinitely
Pattern 3: Reproducible Steps
openclaw agent --agent researcher \ --message "Research GLM alternatives, 10+ sources" \ --timeout 300sessions/<uuid>.jsonl.lockps aux | grep <pid>→ PID no longer existsRoot Cause Analysis
Evidence Points to Supervisor Timeout:
Timeout mismatch:
--timeout 300(5 minutes)Process lifecycle:
SIGKILL characteristics:
Impact
Suggested Fix
Option 1: Use SIGTERM with Grace Period (Recommended)
Option 2: Extend Supervisor Timeout
--timeoutflag--timeout 300, supervisor should wait 300s before any killOption 3: Pre-Kill Hook
Workarounds
User-Level (Current):
Script-Level:
Related Issues
Additional Context
openclaw agentusing embedded runs vs gateway runs--timeoutAttachments
Reported by: Johannes Huijbregts via Echo assistant
Date: 2026-04-22
OpenClaw Version: v2026.4.20 (115f05d)