BUG: Agent session lock not released after crash/SIGKILL — blocks all subsequent runs

# Bug Report: Agent Session Lock Not Released After Crash/SIGKILL

## Summary
When an agent run crashes (SIGKILL) or times out, the session lock file (`*.jsonl.lock`) is NOT released. This prevents ALL subsequent agent runs from starting, regardless of which model is requested. The lock blocks the entire agent for ALL models in the fallback chain.

## Environment
- **OpenClaw Version:** v2026.4.15 (also observed on v2026.4.14)
- **OS:** macOS 15.4.1 (Darwin 25.4.0 arm64)
- **Node.js:** v25.8.1
- **Shell:** zsh
- **Host:** Mac mini (Apple Silicon)

## Configuration
```json
"agents": {
  "defaults": {
    "model": {
      "primary": "ollama/glm-5.1:cloud",
      "fallbacks": [
        "ollama/qwen3.5:397b-cloud",
        "ollama/glm-5.1:cloud",
        "xai/grok-4-1-fast-reasoning",
        "anthropic/claude-haiku-4-5"
      ]
    }
  }
}
```

## Steps to Reproduce
1. Start a long-running agent: `openclaw agent --agent coder --message "complex task" --timeout 300`
2. While the agent is running (e.g., at 12+ minutes), send a new agent command OR the gateway sends a heartbeat check
3. First agent gets SIGKILL'd by supervisor (timeout or new request)
4. Lock file remains: `agents/coder/sessions/<uuid>.jsonl.lock`
5. All subsequent agent runs fail immediately with:
   ```
   Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock
   ```
6. This cascades through ALL fallback models (5 attempts, all fail with same lock)

## Observed Behavior

### Error Pattern (repeated every ~10 seconds for 5 models):
```
session file locked (timeout 10000ms): pid=25358 /Users/botje/.openclaw/agents/coder/sessions/b68174ac-1d06-4f19-a8f7-f055b2fa51af.jsonl.lock
```

### Full Fallback Chain Failure:
```
All models failed (5):
- ollama/kimi-k2.6:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/qwen3.5:397b-cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/glm-5.1:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- xai/grok-4-1-fast: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- anthropic/claude-haiku-4-5: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
```

### Key Observations:
1. **Lock persists for hours** — The lock from pid=25358 (from a run at ~07:00) was still blocking runs at ~07:43 (40+ minutes later)
2. **Lock blocks ALL models** — Not just the original model, but ALL fallback models fail with the SAME lock
3. **SIGKILL doesn't clean up** — When supervisor kills the process, the lock file remains on disk
4. **Hardcoded timeout** — The 10000ms (10s) timeout appears to be hardcoded, not configurable
5. **No automatic cleanup** — There's no mechanism to detect stale locks (e.g., checking if PID is still alive)

## Log Evidence

### Repeated Lock Errors (40+ minutes):
```
# From 07:03 to 07:43 — same lock file blocks all attempts
# pid=22626: initial run
# pid=25358: subsequent run
# Both locks persisted until manually deleted
```

### Model Fallback Decisions (all failing on same lock):
```json
{
  "event": "model_fallback_decision",
  "runId": "cd6ed7c8-3cb3-47e2-a572-3c44c80d5ec0",
  "decision": "candidate_failed",
  "requestedModel": "kimi-k2.6:cloud",
  "candidateModel": "kimi-k2.6:cloud",
  "attempt": 1,
  "reason": "timeout",
  "errorPreview": "session file locked (timeout 10000ms): pid=25358 ...jsonl.lock"
}
```

## Workarounds Found

### Manual (User-Level):
```bash
# Must be done after EVERY stuck agent run
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent --agent coder"
# Then retry agent run
```

### Issues with Workaround:
1. Must run BEFORE each new agent command (otherwise new command fails)
2. Loses session history for debugging
3. User must detect the stuck state manually
4. Not feasible for automated/scripted agent runs

## Expected Behavior
1. **Stale lock detection:** If PID in lock file is no longer alive, automatically remove the lock
2. **SIGKILL cleanup:** Register signal handlers to clean up locks before process terminates
3. **Lock timeout:** Configurable timeout (not hardcoded 10s), or at least attempt cleanup on timeout
4. **Per-run locks:** Each agent invocation should get its own lock, not share a single lock file

## Suggested Fix

### Option A: PID-based stale lock detection (Recommended)
```javascript
// Pseudocode for lock acquisition
if (lockFileExists()) {
  const lockPid = readLockFile();
  if (!isProcessAlive(lockPid)) {
    deleteLockFile(); // Stale lock, safe to remove
  } else {
    waitForLock(); // Real lock, wait
  }
}
```

### Option B: Process signal handlers
```javascript
// On SIGTERM/SIGINT/SIGKILL
cleanupLockFile();
```

### Option C: Lock file with timestamp
```javascript
// Lock includes timestamp, auto-expire after configurable timeout
// e.g., lock older than 30s is considered stale
```

## Impact
- **High:** Completely blocks all agent functionality
- **Frequency:** Reproducible on every long-running (> 10min) agent
- **Affected Users:** Anyone using `openclaw agent` with timeout > 60s
- **Regression:** Likely introduced in recent session persistence feature

## Additional Context
- Also observed: `Gateway agent failed; falling back to embedded: Error: gateway timeout after 630000ms` — suggesting the gateway timeout (10.5 min) conflicts with agent run timeout
- When gateway restarts or sends heartbeat, it may trigger agent runs that conflict with existing long-running agents
- The SIGKILL from supervisor (OpenClaw issue #66359/#66399) exacerbates this — killed agents leave locks behind

## Related Issues
- SIGKILL instead of SIGTERM: OpenClaw #66359/#66399
- Gateway timeout: 630000ms (10.5 minutes) vs agent timeout

## Attachments
- [ ] Full OpenClaw log file (openclaw-2026-04-22.log)
- [ ] Session lock files (if preserved)
- [ ] openclaw.json configuration (sanitized)

---
*Reported by: Johannes Huijbregts via Echo assistant*
*Date: 2026-04-22*
*OpenClaw Version: v2026.4.15*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Agent session lock not released after crash/SIGKILL — blocks all subsequent runs #70004

Bug Report: Agent Session Lock Not Released After Crash/SIGKILL

Summary

Environment

Configuration

Steps to Reproduce

Observed Behavior

Error Pattern (repeated every ~10 seconds for 5 models):

Full Fallback Chain Failure:

Key Observations:

Log Evidence

Repeated Lock Errors (40+ minutes):

Model Fallback Decisions (all failing on same lock):

Workarounds Found

Manual (User-Level):

Issues with Workaround:

Expected Behavior

Suggested Fix

Option A: PID-based stale lock detection (Recommended)

Option B: Process signal handlers

Option C: Lock file with timestamp

Impact

Additional Context

Related Issues

Attachments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

BUG: Agent session lock not released after crash/SIGKILL — blocks all subsequent runs #70004

Description

Bug Report: Agent Session Lock Not Released After Crash/SIGKILL

Summary

Environment

Configuration

Steps to Reproduce

Observed Behavior

Error Pattern (repeated every ~10 seconds for 5 models):

Full Fallback Chain Failure:

Key Observations:

Log Evidence

Repeated Lock Errors (40+ minutes):

Model Fallback Decisions (all failing on same lock):

Workarounds Found

Manual (User-Level):

Issues with Workaround:

Expected Behavior

Suggested Fix

Option A: PID-based stale lock detection (Recommended)

Option B: Process signal handlers

Option C: Lock file with timestamp

Impact

Additional Context

Related Issues

Attachments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions