BUG: Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade

# Bug Report: Supervisor Sends SIGKILL Instead of SIGTERM for Long-Running Agents

## Summary
When an agent run exceeds a certain duration (observed at 60-90+ seconds), the OpenClaw supervisor sends **SIGKILL** instead of SIGTERM to terminate the process. SIGKILL prevents any cleanup (including session lock removal), directly causing the session lock issue reported in #70004.

## Environment
- **OpenClaw Version:** v2026.4.20 (115f05d)
- **OS:** macOS 15.4.1 (Darwin 25.4.0 arm64)
- **Node.js:** v25.8.1

## Observed Behavior

### Pattern 1: Agent Runs Killed After ~60-90s
```
# Long-running agent (e.g., Kimi K2.6 with web search + code generation)
openclaw agent --agent coder --message "complex task" --timeout 300

# Observed: Process killed around 60-90s mark
# Result: SIGKILL (no cleanup possible)
# Evidence: Session lock file remains (.jsonl.lock)
# Process no longer exists but lock persists
```

### Pattern 2: SIGTERM vs SIGKILL

**SIGTERM (graceful - gateway shutdown):**
```
{"subsystem":"gateway","message":"signal SIGTERM received"}
{"subsystem":"gateway","message":"received SIGTERM; shutting down"}
```
→ Gateway handles this gracefully, cleans up resources

**SIGKILL (abrupt - agent runs):**
```
# No log entries - process is killed without warning
# Lock file: agents/coder/sessions/<uuid>.jsonl.lock
# Lock owner PID no longer exists
# All subsequent agent runs fail with "session file locked"
```
→ No cleanup, session lock persists indefinitely

### Pattern 3: Reproducible Steps
1. Start a complex agent run (e.g., researcher with web search):
   ```bash
   openclaw agent --agent researcher \
     --message "Research GLM alternatives, 10+ sources" \
     --timeout 300
   ```
2. Agent starts processing, makes API calls
3. Around 60-90s: Process disappears (SIGKILL)
4. Lock file remains: `sessions/<uuid>.jsonl.lock`
5. Check: `ps aux | grep <pid>` → PID no longer exists
6. New agent run: Fails with "session file locked (timeout 10000ms)"

## Root Cause Analysis

### Evidence Points to Supervisor Timeout:
1. **Timeout mismatch:**
   - User sets `--timeout 300` (5 minutes)
   - Gateway timeout: 630000ms (10.5 minutes)
   - Supervisor timeout: Likely 60-90s (hardcoded?)

2. **Process lifecycle:**
   - Gateway receives SIGTERM → graceful shutdown
   - Agent run receives no signal → abruptly killed (SIGKILL)
   - Suggests supervisor/process manager is killing the agent, not the gateway

3. **SIGKILL characteristics:**
   - Cannot be caught or handled
   - No cleanup possible
   - Process state shows "killed" or missing PID
   - Lock files remain orphaned

## Impact
- **Session Lock Issue (#70004):** Direct cause - locks not cleaned up
- **Data Loss:** Agent output lost mid-generation
- **Resource Waste:** Failed runs consume API tokens without completion
- **User Experience:** Requires manual lock cleanup after every long run

## Suggested Fix

### Option 1: Use SIGTERM with Grace Period (Recommended)
```javascript
// Pseudocode for supervisor
kill(process.pid, 'SIGTERM');
setTimeout(() => {
  if (processStillExists(process.pid)) {
    kill(process.pid, 'SIGKILL'); // Force kill only after grace period
  }
}, 5000); // 5s grace period for cleanup
```

### Option 2: Extend Supervisor Timeout
- Make supervisor timeout configurable or match `--timeout` flag
- If user sets `--timeout 300`, supervisor should wait 300s before any kill

### Option 3: Pre-Kill Hook
- Register cleanup function before kill:
```javascript
process.on('SIGTERM', () => {
  releaseSessionLock();
  process.exit(0);
});
```
- Then use SIGTERM instead of SIGKILL

## Workarounds

### User-Level (Current):
```bash
# After every killed agent run:
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent"
```

### Script-Level:
```bash
# Wrap agent calls with cleanup
run_agent() {
  openclaw agent "$@"
  sleep 1
  rm -f ~/.openclaw/agents/*/sessions/*.lock
}
```

## Related Issues
- #70004: Session Lock Not Released After Crash/SIGKILL
- Possibly related: Gateway timeout (630000ms) configuration

## Additional Context
- This may be related to `openclaw agent` using embedded runs vs gateway runs
- Embedded runs might have different supervisor logic than gateway-managed runs
- The 60-90s timeout suggests a hardcoded limit, not the user-specified `--timeout`

## Attachments
- [ ] Full log excerpt showing agent start → disappearance
- [ ] Process monitor output (ps aux timestamps)
- [ ] Session lock files with timestamps

---
*Reported by: Johannes Huijbregts via Echo assistant*
*Date: 2026-04-22*
*OpenClaw Version: v2026.4.20 (115f05d)*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade #70026

Bug Report: Supervisor Sends SIGKILL Instead of SIGTERM for Long-Running Agents

Summary

Environment

Observed Behavior

Pattern 1: Agent Runs Killed After ~60-90s

Pattern 2: SIGTERM vs SIGKILL

Pattern 3: Reproducible Steps

Root Cause Analysis

Evidence Points to Supervisor Timeout:

Impact

Suggested Fix

Option 1: Use SIGTERM with Grace Period (Recommended)

Option 2: Extend Supervisor Timeout

Option 3: Pre-Kill Hook

Workarounds

User-Level (Current):

Script-Level:

Related Issues

Additional Context

Attachments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

BUG: Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade #70026

Description

Bug Report: Supervisor Sends SIGKILL Instead of SIGTERM for Long-Running Agents

Summary

Environment

Observed Behavior

Pattern 1: Agent Runs Killed After ~60-90s

Pattern 2: SIGTERM vs SIGKILL

Pattern 3: Reproducible Steps

Root Cause Analysis

Evidence Points to Supervisor Timeout:

Impact

Suggested Fix

Option 1: Use SIGTERM with Grace Period (Recommended)

Option 2: Extend Supervisor Timeout

Option 3: Pre-Kill Hook

Workarounds

User-Level (Current):

Script-Level:

Related Issues

Additional Context

Attachments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions