Skip to content

fix: release stale session locks and add watchdog for hung API calls#18096

Merged
steipete merged 1 commit into
openclaw:mainfrom
Grynn:fix/session-lock-deadlock-18060
Feb 16, 2026
Merged

fix: release stale session locks and add watchdog for hung API calls#18096
steipete merged 1 commit into
openclaw:mainfrom
Grynn:fix/session-lock-deadlock-18060

Conversation

@Grynn

@Grynn Grynn commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

Summary

When a model API call hangs indefinitely (e.g., Anthropic quota exceeded mid-call, network timeout that doesn't respect abort signals), the gateway acquires a session .jsonl.lock but the promise never resolves, so the try/finally block never reaches release(). Since the owning PID is the gateway itself, isPidAlive() always returns true — stale detection cannot help. The session is permanently deadlocked until gateway restart.

Changes

1. In-process lock watchdog (session-write-lock.ts)

  • Track acquiredAt timestamp on each held lock
  • 60-second interval timer checks all held locks
  • Auto-releases any lock held longer than maxHoldMs (default 5 min)
  • This is the primary fix — catches the hung-API-call case that try/finally cannot

2. Gateway startup cleanup (server-startup.ts)

  • On boot, scan all agent session directories for *.jsonl.lock files
  • Remove locks with dead PIDs or older than staleMs (30 min)
  • Log each cleaned lock for diagnostics

3. openclaw doctor stale lock detection (doctor-session-locks.ts)

  • New health check scans for .jsonl.lock files
  • Reports PID status and age of each lock found
  • In --fix mode, removes stale locks automatically

4. Transcript error entry on API failure (attempt.ts)

  • When promptError is set, write an error marker to the session transcript before releasing the lock
  • Preserves conversation history even on model API failures

Testing

  • Added unit tests for the watchdog logic (runLockWatchdogCheck)
  • Added unit tests for cleanStaleLockFiles
  • Added test for doctor session lock health check
  • All existing tests pass, tsc --noEmit clean

Root Cause Analysis

See the detailed analysis in issue #18060 comment.

The key insight: the existing try/finally in attempt.ts does release the lock on normal errors. The problem is specifically when the API call hangs indefinitely — the abort() mechanism fires but abortable() may not force the underlying HTTP promise to reject if the connection doesn't respect the abort signal. The watchdog is the safety net for this edge case.

Closes #18060

Greptile Summary

This PR implements a multi-layered solution to prevent session deadlocks when model API calls hang indefinitely. The implementation adds an in-process watchdog timer that automatically releases locks held longer than 5 minutes, startup cleanup to remove stale lock files from crashed processes, and openclaw doctor integration for manual diagnostics. The code also improves error handling by persisting prompt transport errors to session transcripts.

Key changes:

  • Watchdog timer checks in-process locks every 60 seconds and force-releases any held longer than maxHoldMs (default 5 min)
  • Gateway startup scans for and removes stale .jsonl.lock files (dead PID or older than 30 min)
  • openclaw doctor command now reports session lock status with --fix mode to clean stale locks
  • Prompt errors are now written to transcripts as custom entries before lock release
  • Comprehensive test coverage for watchdog, cleanup, and edge cases

Implementation quality:

  • Proper object identity checks (current !== held) prevent stale release handles from affecting new locks
  • releasePromise pattern ensures cleanup operations are idempotent and wait for in-progress releases
  • Stale detection correctly distinguishes between dead PIDs, missing metadata, and age-based staleness
  • Lock acquisition checks in-process HELD_LOCKS map before file-based stale detection, preventing false positives

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-designed with comprehensive safeguards. Tests verify critical edge cases including watchdog force-release, stale handle behavior, and startup cleanup. The code correctly handles race conditions through object identity checks and idempotent release operations. The watchdog timer uses unref() to prevent blocking process exit, and all cleanup operations are best-effort with proper error handling. The changes are focused and don't modify core session logic beyond lock management.
  • No files require special attention

Last reviewed commit: a3c9e64

(4/5) You can add custom instructions or style guidelines for the agent here!

…penclaw#18060)

When a model API call hangs indefinitely (e.g. Anthropic quota exceeded
mid-call), the gateway acquires a session .jsonl.lock but the promise
never resolves, so the try/finally block never reaches release(). Since
the owning PID is the gateway itself, stale detection cannot help —
isPidAlive() always returns true.

This commit adds four layers of defense:

1. **In-process lock watchdog** (session-write-lock.ts)
   - Track acquiredAt timestamp on each held lock
   - 60-second interval timer checks all held locks
   - Auto-releases any lock held longer than maxHoldMs (default 5 min)
   - Catches the hung-API-call case that try/finally cannot

2. **Gateway startup cleanup** (server-startup.ts)
   - On boot, scan all agent session directories for *.jsonl.lock files
   - Remove locks with dead PIDs or older than staleMs (30 min)
   - Log each cleaned lock for diagnostics

3. **openclaw doctor stale lock detection** (doctor-session-locks.ts)
   - New health check scans for .jsonl.lock files
   - Reports PID status and age of each lock found
   - In --fix mode, removes stale locks automatically

4. **Transcript error entry on API failure** (attempt.ts)
   - When promptError is set, write an error marker to the session
     transcript before releasing the lock
   - Preserves conversation history even on model API failures

Closes openclaw#18060
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime commands Command implementations agents Agent runtime and tooling size: L labels Feb 16, 2026
@steipete steipete merged commit e91a5b0 into openclaw:main Feb 16, 2026
28 checks passed
@Grynn Grynn deleted the fix/session-lock-deadlock-18060 branch February 16, 2026 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling commands Command implementations gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway deadlocks session when model API call fails (stale .jsonl.lock)

2 participants