Session write locks leak when Gateway encounters unhandled promise rejections / streams, causing >30min deadlocks

## Description

There is a fundamental flaw in how cquireSessionWriteLock and inspectLockPayload handle orphaned locks held by a continuously running process (e.g., the Gateway daemon).

## Steps to Reproduce (Conceptual)

1. A long-running Gateway process (PID A) acquires a session write lock (*.jsonl.lock) to append incoming data.
2. Due to an unhandled rejection, stream disconnect, or internal error within the withFileLock context, the promise chain breaks without reaching the elease() handler in the finally block.
3. The .lock file remains on disk containing PID A.
4. Subsequent RPC calls or CLI commands (e.g., openclaw agent --deliver) attempt to access the session. They wait for 	imeoutMs (10s) and fail with: Error: session file locked (timeout 10000ms): pid=A.

## Root Cause

In src/plugin-sdk/json-store.ts (or the transpiled bundle):

- DEFAULT_STALE_MS is hardcoded to 1800 * 1000 (30 minutes).
- inspectLockPayload checks if the PID is alive (isPidAlive(pid)). Since PID A is the Gateway daemon, it is always 	rue.
- Therefore, the staleReasons array does NOT include "dead-pid".
- It will only include "too-old" if the lock is older than 30 minutes.
- **This means a single unhandled rejection that leaks a lock will effectively paralyze that specific session for 30 minutes** before shouldReclaimContendedLockFile finally allows the m command to break the contention.

## Impact

- A single lock leak can paralyze a session for 30+ minutes
- Requires manual intervention (deleting lock files) to recover
- Affects all agents trying to access the locked session

## Suggested Fix

1. **Implement a Leased Lock (Heartbeat) System**: Rather than a static "created at + 30m" system for long-running processes. If a lock is genuinely held for a massive text generation, the process should 	ouch the mtime of the lock every few seconds.

2. **Reduce the DEFAULT_STALE_MS for session files**: 30 minutes is excessive for a chat session append operation. 2-5 minutes is a far safer boundary.

3. **Introduce an active release mechanism on global exception handlers**: Flush local HELD_LOCKS maps in the Node Gateway's global exception handlers.

## Temporary Workaround

We created a cron job that runs every 5 minutes to clean stale lock files:

``powershell
# ~/.openclaw/scripts/clean-session-locks.ps1
Get-ChildItem "C:\Users\host\.openclaw\agents\*\sessions\*.lock" | ForEach-Object {
    $lock = Get-Content $_.FullName -Raw | ConvertFrom-Json
    $age = (New-TimeSpan -Start $lock.createdAt).TotalMinutes
    if ($age -gt 5) { Remove-Item $_.FullName -Force }
}
``

## Environment

- OpenClaw Version: 2026.3.13
- Node.js Version: 24.13.1
- OS: Windows 11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Session write locks leak when Gateway encounters unhandled promise rejections / streams, causing >30min deadlocks #49157

Description

Steps to Reproduce (Conceptual)

Root Cause

Impact

Suggested Fix

Temporary Workaround

~/.openclaw/scripts/clean-session-locks.ps1

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Session write locks leak when Gateway encounters unhandled promise rejections / streams, causing >30min deadlocks #49157

Description

Description

Steps to Reproduce (Conceptual)

Root Cause

Impact

Suggested Fix

Temporary Workaround

~/.openclaw/scripts/clean-session-locks.ps1

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions