Skip to content

BUG: Agent session lock not released after crash/SIGKILL — blocks all subsequent runs #70004

@Johannes0402

Description

@Johannes0402

Bug Report: Agent Session Lock Not Released After Crash/SIGKILL

Summary

When an agent run crashes (SIGKILL) or times out, the session lock file (*.jsonl.lock) is NOT released. This prevents ALL subsequent agent runs from starting, regardless of which model is requested. The lock blocks the entire agent for ALL models in the fallback chain.

Environment

  • OpenClaw Version: v2026.4.15 (also observed on v2026.4.14)
  • OS: macOS 15.4.1 (Darwin 25.4.0 arm64)
  • Node.js: v25.8.1
  • Shell: zsh
  • Host: Mac mini (Apple Silicon)

Configuration

"agents": {
  "defaults": {
    "model": {
      "primary": "ollama/glm-5.1:cloud",
      "fallbacks": [
        "ollama/qwen3.5:397b-cloud",
        "ollama/glm-5.1:cloud",
        "xai/grok-4-1-fast-reasoning",
        "anthropic/claude-haiku-4-5"
      ]
    }
  }
}

Steps to Reproduce

  1. Start a long-running agent: openclaw agent --agent coder --message "complex task" --timeout 300
  2. While the agent is running (e.g., at 12+ minutes), send a new agent command OR the gateway sends a heartbeat check
  3. First agent gets SIGKILL'd by supervisor (timeout or new request)
  4. Lock file remains: agents/coder/sessions/<uuid>.jsonl.lock
  5. All subsequent agent runs fail immediately with:
    Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock
    
  6. This cascades through ALL fallback models (5 attempts, all fail with same lock)

Observed Behavior

Error Pattern (repeated every ~10 seconds for 5 models):

session file locked (timeout 10000ms): pid=25358 /Users/botje/.openclaw/agents/coder/sessions/b68174ac-1d06-4f19-a8f7-f055b2fa51af.jsonl.lock

Full Fallback Chain Failure:

All models failed (5):
- ollama/kimi-k2.6:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/qwen3.5:397b-cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/glm-5.1:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- xai/grok-4-1-fast: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- anthropic/claude-haiku-4-5: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)

Key Observations:

  1. Lock persists for hours — The lock from pid=25358 (from a run at ~07:00) was still blocking runs at ~07:43 (40+ minutes later)
  2. Lock blocks ALL models — Not just the original model, but ALL fallback models fail with the SAME lock
  3. SIGKILL doesn't clean up — When supervisor kills the process, the lock file remains on disk
  4. Hardcoded timeout — The 10000ms (10s) timeout appears to be hardcoded, not configurable
  5. No automatic cleanup — There's no mechanism to detect stale locks (e.g., checking if PID is still alive)

Log Evidence

Repeated Lock Errors (40+ minutes):

# From 07:03 to 07:43 — same lock file blocks all attempts
# pid=22626: initial run
# pid=25358: subsequent run
# Both locks persisted until manually deleted

Model Fallback Decisions (all failing on same lock):

{
  "event": "model_fallback_decision",
  "runId": "cd6ed7c8-3cb3-47e2-a572-3c44c80d5ec0",
  "decision": "candidate_failed",
  "requestedModel": "kimi-k2.6:cloud",
  "candidateModel": "kimi-k2.6:cloud",
  "attempt": 1,
  "reason": "timeout",
  "errorPreview": "session file locked (timeout 10000ms): pid=25358 ...jsonl.lock"
}

Workarounds Found

Manual (User-Level):

# Must be done after EVERY stuck agent run
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent --agent coder"
# Then retry agent run

Issues with Workaround:

  1. Must run BEFORE each new agent command (otherwise new command fails)
  2. Loses session history for debugging
  3. User must detect the stuck state manually
  4. Not feasible for automated/scripted agent runs

Expected Behavior

  1. Stale lock detection: If PID in lock file is no longer alive, automatically remove the lock
  2. SIGKILL cleanup: Register signal handlers to clean up locks before process terminates
  3. Lock timeout: Configurable timeout (not hardcoded 10s), or at least attempt cleanup on timeout
  4. Per-run locks: Each agent invocation should get its own lock, not share a single lock file

Suggested Fix

Option A: PID-based stale lock detection (Recommended)

// Pseudocode for lock acquisition
if (lockFileExists()) {
  const lockPid = readLockFile();
  if (!isProcessAlive(lockPid)) {
    deleteLockFile(); // Stale lock, safe to remove
  } else {
    waitForLock(); // Real lock, wait
  }
}

Option B: Process signal handlers

// On SIGTERM/SIGINT/SIGKILL
cleanupLockFile();

Option C: Lock file with timestamp

// Lock includes timestamp, auto-expire after configurable timeout
// e.g., lock older than 30s is considered stale

Impact

  • High: Completely blocks all agent functionality
  • Frequency: Reproducible on every long-running (> 10min) agent
  • Affected Users: Anyone using openclaw agent with timeout > 60s
  • Regression: Likely introduced in recent session persistence feature

Additional Context

Related Issues

Attachments

  • Full OpenClaw log file (openclaw-2026-04-22.log)
  • Session lock files (if preserved)
  • openclaw.json configuration (sanitized)

Reported by: Johannes Huijbregts via Echo assistant
Date: 2026-04-22
OpenClaw Version: v2026.4.15

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions