Skip to content

BUG: Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade #70026

@Johannes0402

Description

@Johannes0402

Bug Report: Supervisor Sends SIGKILL Instead of SIGTERM for Long-Running Agents

Summary

When an agent run exceeds a certain duration (observed at 60-90+ seconds), the OpenClaw supervisor sends SIGKILL instead of SIGTERM to terminate the process. SIGKILL prevents any cleanup (including session lock removal), directly causing the session lock issue reported in #70004.

Environment

  • OpenClaw Version: v2026.4.20 (115f05d)
  • OS: macOS 15.4.1 (Darwin 25.4.0 arm64)
  • Node.js: v25.8.1

Observed Behavior

Pattern 1: Agent Runs Killed After ~60-90s

# Long-running agent (e.g., Kimi K2.6 with web search + code generation)
openclaw agent --agent coder --message "complex task" --timeout 300

# Observed: Process killed around 60-90s mark
# Result: SIGKILL (no cleanup possible)
# Evidence: Session lock file remains (.jsonl.lock)
# Process no longer exists but lock persists

Pattern 2: SIGTERM vs SIGKILL

SIGTERM (graceful - gateway shutdown):

{"subsystem":"gateway","message":"signal SIGTERM received"}
{"subsystem":"gateway","message":"received SIGTERM; shutting down"}

→ Gateway handles this gracefully, cleans up resources

SIGKILL (abrupt - agent runs):

# No log entries - process is killed without warning
# Lock file: agents/coder/sessions/<uuid>.jsonl.lock
# Lock owner PID no longer exists
# All subsequent agent runs fail with "session file locked"

→ No cleanup, session lock persists indefinitely

Pattern 3: Reproducible Steps

  1. Start a complex agent run (e.g., researcher with web search):
    openclaw agent --agent researcher \
      --message "Research GLM alternatives, 10+ sources" \
      --timeout 300
  2. Agent starts processing, makes API calls
  3. Around 60-90s: Process disappears (SIGKILL)
  4. Lock file remains: sessions/<uuid>.jsonl.lock
  5. Check: ps aux | grep <pid> → PID no longer exists
  6. New agent run: Fails with "session file locked (timeout 10000ms)"

Root Cause Analysis

Evidence Points to Supervisor Timeout:

  1. Timeout mismatch:

    • User sets --timeout 300 (5 minutes)
    • Gateway timeout: 630000ms (10.5 minutes)
    • Supervisor timeout: Likely 60-90s (hardcoded?)
  2. Process lifecycle:

    • Gateway receives SIGTERM → graceful shutdown
    • Agent run receives no signal → abruptly killed (SIGKILL)
    • Suggests supervisor/process manager is killing the agent, not the gateway
  3. SIGKILL characteristics:

    • Cannot be caught or handled
    • No cleanup possible
    • Process state shows "killed" or missing PID
    • Lock files remain orphaned

Impact

Suggested Fix

Option 1: Use SIGTERM with Grace Period (Recommended)

// Pseudocode for supervisor
kill(process.pid, 'SIGTERM');
setTimeout(() => {
  if (processStillExists(process.pid)) {
    kill(process.pid, 'SIGKILL'); // Force kill only after grace period
  }
}, 5000); // 5s grace period for cleanup

Option 2: Extend Supervisor Timeout

  • Make supervisor timeout configurable or match --timeout flag
  • If user sets --timeout 300, supervisor should wait 300s before any kill

Option 3: Pre-Kill Hook

  • Register cleanup function before kill:
process.on('SIGTERM', () => {
  releaseSessionLock();
  process.exit(0);
});
  • Then use SIGTERM instead of SIGKILL

Workarounds

User-Level (Current):

# After every killed agent run:
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent"

Script-Level:

# Wrap agent calls with cleanup
run_agent() {
  openclaw agent "$@"
  sleep 1
  rm -f ~/.openclaw/agents/*/sessions/*.lock
}

Related Issues

Additional Context

  • This may be related to openclaw agent using embedded runs vs gateway runs
  • Embedded runs might have different supervisor logic than gateway-managed runs
  • The 60-90s timeout suggests a hardcoded limit, not the user-specified --timeout

Attachments

  • Full log excerpt showing agent start → disappearance
  • Process monitor output (ps aux timestamps)
  • Session lock files with timestamps

Reported by: Johannes Huijbregts via Echo assistant
Date: 2026-04-22
OpenClaw Version: v2026.4.20 (115f05d)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions