fix: release stale session locks and add watchdog for hung API calls#18096
Merged
Conversation
…penclaw#18060) When a model API call hangs indefinitely (e.g. Anthropic quota exceeded mid-call), the gateway acquires a session .jsonl.lock but the promise never resolves, so the try/finally block never reaches release(). Since the owning PID is the gateway itself, stale detection cannot help — isPidAlive() always returns true. This commit adds four layers of defense: 1. **In-process lock watchdog** (session-write-lock.ts) - Track acquiredAt timestamp on each held lock - 60-second interval timer checks all held locks - Auto-releases any lock held longer than maxHoldMs (default 5 min) - Catches the hung-API-call case that try/finally cannot 2. **Gateway startup cleanup** (server-startup.ts) - On boot, scan all agent session directories for *.jsonl.lock files - Remove locks with dead PIDs or older than staleMs (30 min) - Log each cleaned lock for diagnostics 3. **openclaw doctor stale lock detection** (doctor-session-locks.ts) - New health check scans for .jsonl.lock files - Reports PID status and age of each lock found - In --fix mode, removes stale locks automatically 4. **Transcript error entry on API failure** (attempt.ts) - When promptError is set, write an error marker to the session transcript before releasing the lock - Preserves conversation history even on model API failures Closes openclaw#18060
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a model API call hangs indefinitely (e.g., Anthropic quota exceeded mid-call, network timeout that doesn't respect abort signals), the gateway acquires a session
.jsonl.lockbut the promise never resolves, so thetry/finallyblock never reachesrelease(). Since the owning PID is the gateway itself,isPidAlive()always returnstrue— stale detection cannot help. The session is permanently deadlocked until gateway restart.Changes
1. In-process lock watchdog (
session-write-lock.ts)acquiredAttimestamp on each held lockmaxHoldMs(default 5 min)2. Gateway startup cleanup (
server-startup.ts)*.jsonl.lockfilesstaleMs(30 min)3.
openclaw doctorstale lock detection (doctor-session-locks.ts).jsonl.lockfiles--fixmode, removes stale locks automatically4. Transcript error entry on API failure (
attempt.ts)promptErroris set, write an error marker to the session transcript before releasing the lockTesting
runLockWatchdogCheck)cleanStaleLockFilestsc --noEmitcleanRoot Cause Analysis
See the detailed analysis in issue #18060 comment.
The key insight: the existing
try/finallyinattempt.tsdoes release the lock on normal errors. The problem is specifically when the API call hangs indefinitely — theabort()mechanism fires butabortable()may not force the underlying HTTP promise to reject if the connection doesn't respect the abort signal. The watchdog is the safety net for this edge case.Closes #18060
Greptile Summary
This PR implements a multi-layered solution to prevent session deadlocks when model API calls hang indefinitely. The implementation adds an in-process watchdog timer that automatically releases locks held longer than 5 minutes, startup cleanup to remove stale lock files from crashed processes, and
openclaw doctorintegration for manual diagnostics. The code also improves error handling by persisting prompt transport errors to session transcripts.Key changes:
maxHoldMs(default 5 min).jsonl.lockfiles (dead PID or older than 30 min)openclaw doctorcommand now reports session lock status with--fixmode to clean stale locksImplementation quality:
current !== held) prevent stale release handles from affecting new locksreleasePromisepattern ensures cleanup operations are idempotent and wait for in-progress releasesHELD_LOCKSmap before file-based stale detection, preventing false positivesConfidence Score: 5/5
unref()to prevent blocking process exit, and all cleanup operations are best-effort with proper error handling. The changes are focused and don't modify core session logic beyond lock management.Last reviewed commit: a3c9e64
(4/5) You can add custom instructions or style guidelines for the agent here!