Bug Report: Agent Session Lock Not Released After Crash/SIGKILL
Summary
When an agent run crashes (SIGKILL) or times out, the session lock file (*.jsonl.lock) is NOT released. This prevents ALL subsequent agent runs from starting, regardless of which model is requested. The lock blocks the entire agent for ALL models in the fallback chain.
Environment
- OpenClaw Version: v2026.4.15 (also observed on v2026.4.14)
- OS: macOS 15.4.1 (Darwin 25.4.0 arm64)
- Node.js: v25.8.1
- Shell: zsh
- Host: Mac mini (Apple Silicon)
Configuration
"agents": {
"defaults": {
"model": {
"primary": "ollama/glm-5.1:cloud",
"fallbacks": [
"ollama/qwen3.5:397b-cloud",
"ollama/glm-5.1:cloud",
"xai/grok-4-1-fast-reasoning",
"anthropic/claude-haiku-4-5"
]
}
}
}
Steps to Reproduce
- Start a long-running agent:
openclaw agent --agent coder --message "complex task" --timeout 300
- While the agent is running (e.g., at 12+ minutes), send a new agent command OR the gateway sends a heartbeat check
- First agent gets SIGKILL'd by supervisor (timeout or new request)
- Lock file remains:
agents/coder/sessions/<uuid>.jsonl.lock
- All subsequent agent runs fail immediately with:
Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock
- This cascades through ALL fallback models (5 attempts, all fail with same lock)
Observed Behavior
Error Pattern (repeated every ~10 seconds for 5 models):
session file locked (timeout 10000ms): pid=25358 /Users/botje/.openclaw/agents/coder/sessions/b68174ac-1d06-4f19-a8f7-f055b2fa51af.jsonl.lock
Full Fallback Chain Failure:
All models failed (5):
- ollama/kimi-k2.6:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/qwen3.5:397b-cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/glm-5.1:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- xai/grok-4-1-fast: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- anthropic/claude-haiku-4-5: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
Key Observations:
- Lock persists for hours — The lock from pid=25358 (from a run at ~07:00) was still blocking runs at ~07:43 (40+ minutes later)
- Lock blocks ALL models — Not just the original model, but ALL fallback models fail with the SAME lock
- SIGKILL doesn't clean up — When supervisor kills the process, the lock file remains on disk
- Hardcoded timeout — The 10000ms (10s) timeout appears to be hardcoded, not configurable
- No automatic cleanup — There's no mechanism to detect stale locks (e.g., checking if PID is still alive)
Log Evidence
Repeated Lock Errors (40+ minutes):
# From 07:03 to 07:43 — same lock file blocks all attempts
# pid=22626: initial run
# pid=25358: subsequent run
# Both locks persisted until manually deleted
Model Fallback Decisions (all failing on same lock):
{
"event": "model_fallback_decision",
"runId": "cd6ed7c8-3cb3-47e2-a572-3c44c80d5ec0",
"decision": "candidate_failed",
"requestedModel": "kimi-k2.6:cloud",
"candidateModel": "kimi-k2.6:cloud",
"attempt": 1,
"reason": "timeout",
"errorPreview": "session file locked (timeout 10000ms): pid=25358 ...jsonl.lock"
}
Workarounds Found
Manual (User-Level):
# Must be done after EVERY stuck agent run
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent --agent coder"
# Then retry agent run
Issues with Workaround:
- Must run BEFORE each new agent command (otherwise new command fails)
- Loses session history for debugging
- User must detect the stuck state manually
- Not feasible for automated/scripted agent runs
Expected Behavior
- Stale lock detection: If PID in lock file is no longer alive, automatically remove the lock
- SIGKILL cleanup: Register signal handlers to clean up locks before process terminates
- Lock timeout: Configurable timeout (not hardcoded 10s), or at least attempt cleanup on timeout
- Per-run locks: Each agent invocation should get its own lock, not share a single lock file
Suggested Fix
Option A: PID-based stale lock detection (Recommended)
// Pseudocode for lock acquisition
if (lockFileExists()) {
const lockPid = readLockFile();
if (!isProcessAlive(lockPid)) {
deleteLockFile(); // Stale lock, safe to remove
} else {
waitForLock(); // Real lock, wait
}
}
Option B: Process signal handlers
// On SIGTERM/SIGINT/SIGKILL
cleanupLockFile();
Option C: Lock file with timestamp
// Lock includes timestamp, auto-expire after configurable timeout
// e.g., lock older than 30s is considered stale
Impact
- High: Completely blocks all agent functionality
- Frequency: Reproducible on every long-running (> 10min) agent
- Affected Users: Anyone using
openclaw agent with timeout > 60s
- Regression: Likely introduced in recent session persistence feature
Additional Context
Related Issues
Attachments
Reported by: Johannes Huijbregts via Echo assistant
Date: 2026-04-22
OpenClaw Version: v2026.4.15
Bug Report: Agent Session Lock Not Released After Crash/SIGKILL
Summary
When an agent run crashes (SIGKILL) or times out, the session lock file (
*.jsonl.lock) is NOT released. This prevents ALL subsequent agent runs from starting, regardless of which model is requested. The lock blocks the entire agent for ALL models in the fallback chain.Environment
Configuration
Steps to Reproduce
openclaw agent --agent coder --message "complex task" --timeout 300agents/coder/sessions/<uuid>.jsonl.lockObserved Behavior
Error Pattern (repeated every ~10 seconds for 5 models):
Full Fallback Chain Failure:
Key Observations:
Log Evidence
Repeated Lock Errors (40+ minutes):
Model Fallback Decisions (all failing on same lock):
{ "event": "model_fallback_decision", "runId": "cd6ed7c8-3cb3-47e2-a572-3c44c80d5ec0", "decision": "candidate_failed", "requestedModel": "kimi-k2.6:cloud", "candidateModel": "kimi-k2.6:cloud", "attempt": 1, "reason": "timeout", "errorPreview": "session file locked (timeout 10000ms): pid=25358 ...jsonl.lock" }Workarounds Found
Manual (User-Level):
Issues with Workaround:
Expected Behavior
Suggested Fix
Option A: PID-based stale lock detection (Recommended)
Option B: Process signal handlers
Option C: Lock file with timestamp
Impact
openclaw agentwith timeout > 60sAdditional Context
Gateway agent failed; falling back to embedded: Error: gateway timeout after 630000ms— suggesting the gateway timeout (10.5 min) conflicts with agent run timeoutRelated Issues
Attachments
Reported by: Johannes Huijbregts via Echo assistant
Date: 2026-04-22
OpenClaw Version: v2026.4.15