Summary
When a model API call fails (e.g., Anthropic quota exceeded, 429 rate limit, timeout), the gateway acquires a session .jsonl.lock but never releases it. This deadlocks all subsequent messages to that session, requiring a full gateway restart.
Reproduction
- Start a session via Telegram using a model provider with limited quota (e.g., Anthropic)
- Send messages until the provider quota is exceeded mid-API-call
- The API call hangs/fails, but the
.jsonl.lock is never released
- All subsequent messages to the session block on the stale lock → gateway timeout (60s)
openclaw doctor reports "Main session transcript missing"
- Only
openclaw gateway restart resolves the issue
Evidence
a7de8453-...-a7f0240e29cd.jsonl.lock existed without a corresponding .jsonl transcript file
- A second stale lock
fbf913ec-...-4aef61a514f0.jsonl.lock from 3 weeks earlier (Jan 26) — same pattern, showing this is recurring
- Gateway process (PID 1242800) ran for 10+ hours but did not respond to WebSocket messages for the affected session
- Cron sessions continued to work fine (different session keys, not blocked by the lock)
- After gateway restart, agent responded immediately
Observed session metadata
{
"sessionId": "a7de8453-d74e-4138-b14a-a7f0240e29cd",
"model": "claude-opus-4.6",
"modelProvider": "anthropic",
"providerOverride": "github-copilot",
"inputTokens": 3,
"outputTokens": 49,
"abortedLastRun": false,
"sessionFile": "...sessions/a7de8453-...cd.jsonl" // file did not exist
}
Expected behavior
- Lock should be released on API error (try/finally or equivalent)
- Error entry should be written to the transcript
- User should receive an error message (e.g., "Model API quota exceeded, try again later")
- Session should remain functional after the error
Suggested fixes
- Wrap lock acquire + API call + transcript write in try/finally to always release lock
- Add lock timeout (e.g., 120s) as a safety net for unexpected hangs
- On gateway startup, clean stale
.jsonl.lock files (check if holding PID is dead)
- Write error entry to transcript on API failure instead of silently hanging
openclaw doctor should detect and offer to remove stale lock files (PID check)
Environment
- OpenClaw 2026.2.13 (203b5bd)
- Model provider: Anthropic (claude-opus-4-6)
- Channel: Telegram
- OS: Ubuntu (Proxmox KVM guest)
🦞 Diagnosed with help from Claude Code
Summary
When a model API call fails (e.g., Anthropic quota exceeded, 429 rate limit, timeout), the gateway acquires a session
.jsonl.lockbut never releases it. This deadlocks all subsequent messages to that session, requiring a full gateway restart.Reproduction
.jsonl.lockis never releasedopenclaw doctorreports "Main session transcript missing"openclaw gateway restartresolves the issueEvidence
a7de8453-...-a7f0240e29cd.jsonl.lockexisted without a corresponding.jsonltranscript filefbf913ec-...-4aef61a514f0.jsonl.lockfrom 3 weeks earlier (Jan 26) — same pattern, showing this is recurringObserved session metadata
{ "sessionId": "a7de8453-d74e-4138-b14a-a7f0240e29cd", "model": "claude-opus-4.6", "modelProvider": "anthropic", "providerOverride": "github-copilot", "inputTokens": 3, "outputTokens": 49, "abortedLastRun": false, "sessionFile": "...sessions/a7de8453-...cd.jsonl" // file did not exist }Expected behavior
Suggested fixes
.jsonl.lockfiles (check if holding PID is dead)openclaw doctorshould detect and offer to remove stale lock files (PID check)Environment
🦞 Diagnosed with help from Claude Code