Skip to content

Gateway deadlocks session when model API call fails (stale .jsonl.lock) #18060

@Grynn

Description

@Grynn

Summary

When a model API call fails (e.g., Anthropic quota exceeded, 429 rate limit, timeout), the gateway acquires a session .jsonl.lock but never releases it. This deadlocks all subsequent messages to that session, requiring a full gateway restart.

Reproduction

  1. Start a session via Telegram using a model provider with limited quota (e.g., Anthropic)
  2. Send messages until the provider quota is exceeded mid-API-call
  3. The API call hangs/fails, but the .jsonl.lock is never released
  4. All subsequent messages to the session block on the stale lock → gateway timeout (60s)
  5. openclaw doctor reports "Main session transcript missing"
  6. Only openclaw gateway restart resolves the issue

Evidence

  • a7de8453-...-a7f0240e29cd.jsonl.lock existed without a corresponding .jsonl transcript file
  • A second stale lock fbf913ec-...-4aef61a514f0.jsonl.lock from 3 weeks earlier (Jan 26) — same pattern, showing this is recurring
  • Gateway process (PID 1242800) ran for 10+ hours but did not respond to WebSocket messages for the affected session
  • Cron sessions continued to work fine (different session keys, not blocked by the lock)
  • After gateway restart, agent responded immediately

Observed session metadata

{
  "sessionId": "a7de8453-d74e-4138-b14a-a7f0240e29cd",
  "model": "claude-opus-4.6",
  "modelProvider": "anthropic",
  "providerOverride": "github-copilot",
  "inputTokens": 3,
  "outputTokens": 49,
  "abortedLastRun": false,
  "sessionFile": "...sessions/a7de8453-...cd.jsonl"  // file did not exist
}

Expected behavior

  • Lock should be released on API error (try/finally or equivalent)
  • Error entry should be written to the transcript
  • User should receive an error message (e.g., "Model API quota exceeded, try again later")
  • Session should remain functional after the error

Suggested fixes

  1. Wrap lock acquire + API call + transcript write in try/finally to always release lock
  2. Add lock timeout (e.g., 120s) as a safety net for unexpected hangs
  3. On gateway startup, clean stale .jsonl.lock files (check if holding PID is dead)
  4. Write error entry to transcript on API failure instead of silently hanging
  5. openclaw doctor should detect and offer to remove stale lock files (PID check)

Environment

  • OpenClaw 2026.2.13 (203b5bd)
  • Model provider: Anthropic (claude-opus-4-6)
  • Channel: Telegram
  • OS: Ubuntu (Proxmox KVM guest)

🦞 Diagnosed with help from Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions