Description
There is a fundamental flaw in how �cquireSessionWriteLock and inspectLockPayload handle orphaned locks held by a continuously running process (e.g., the Gateway daemon).
Steps to Reproduce (Conceptual)
- A long-running Gateway process (PID A) acquires a session write lock (*.jsonl.lock) to append incoming data.
- Due to an unhandled rejection, stream disconnect, or internal error within the withFileLock context, the promise chain breaks without reaching the
elease() handler in the finally block.
- The .lock file remains on disk containing PID A.
- Subsequent RPC calls or CLI commands (e.g., openclaw agent --deliver) attempt to access the session. They wait for imeoutMs (10s) and fail with: Error: session file locked (timeout 10000ms): pid=A.
Root Cause
In src/plugin-sdk/json-store.ts (or the transpiled bundle):
- DEFAULT_STALE_MS is hardcoded to 1800 * 1000 (30 minutes).
- inspectLockPayload checks if the PID is alive (isPidAlive(pid)). Since PID A is the Gateway daemon, it is always rue.
- Therefore, the staleReasons array does NOT include "dead-pid".
- It will only include "too-old" if the lock is older than 30 minutes.
- This means a single unhandled rejection that leaks a lock will effectively paralyze that specific session for 30 minutes before shouldReclaimContendedLockFile finally allows the
m command to break the contention.
Impact
- A single lock leak can paralyze a session for 30+ minutes
- Requires manual intervention (deleting lock files) to recover
- Affects all agents trying to access the locked session
Suggested Fix
-
Implement a Leased Lock (Heartbeat) System: Rather than a static "created at + 30m" system for long-running processes. If a lock is genuinely held for a massive text generation, the process should ouch the mtime of the lock every few seconds.
-
Reduce the DEFAULT_STALE_MS for session files: 30 minutes is excessive for a chat session append operation. 2-5 minutes is a far safer boundary.
-
Introduce an active release mechanism on global exception handlers: Flush local HELD_LOCKS maps in the Node Gateway's global exception handlers.
Temporary Workaround
We created a cron job that runs every 5 minutes to clean stale lock files:
``powershell
~/.openclaw/scripts/clean-session-locks.ps1
Get-ChildItem "C:\Users\host.openclaw\agents*\sessions*.lock" | ForEach-Object {
$lock = Get-Content $.FullName -Raw | ConvertFrom-Json
$age = (New-TimeSpan -Start $lock.createdAt).TotalMinutes
if ($age -gt 5) { Remove-Item $.FullName -Force }
}
``
Environment
- OpenClaw Version: 2026.3.13
- Node.js Version: 24.13.1
- OS: Windows 11
Description
There is a fundamental flaw in how �cquireSessionWriteLock and inspectLockPayload handle orphaned locks held by a continuously running process (e.g., the Gateway daemon).
Steps to Reproduce (Conceptual)
elease() handler in the finally block.
Root Cause
In src/plugin-sdk/json-store.ts (or the transpiled bundle):
m command to break the contention.
Impact
Suggested Fix
Implement a Leased Lock (Heartbeat) System: Rather than a static "created at + 30m" system for long-running processes. If a lock is genuinely held for a massive text generation, the process should ouch the mtime of the lock every few seconds.
Reduce the DEFAULT_STALE_MS for session files: 30 minutes is excessive for a chat session append operation. 2-5 minutes is a far safer boundary.
Introduce an active release mechanism on global exception handlers: Flush local HELD_LOCKS maps in the Node Gateway's global exception handlers.
Temporary Workaround
We created a cron job that runs every 5 minutes to clean stale lock files:
``powershell
~/.openclaw/scripts/clean-session-locks.ps1
Get-ChildItem "C:\Users\host.openclaw\agents*\sessions*.lock" | ForEach-Object {
$lock = Get-Content $ .FullName -Raw | ConvertFrom-Json
$age = (New-TimeSpan -Start $lock.createdAt).TotalMinutes
if ($age -gt 5) { Remove-Item $.FullName -Force }
}
``
Environment