Skip to content

Session write locks leak when Gateway encounters unhandled promise rejections / streams, causing >30min deadlocks #49157

@zhangxue1985122219

Description

@zhangxue1985122219

Description

There is a fundamental flaw in how �cquireSessionWriteLock and inspectLockPayload handle orphaned locks held by a continuously running process (e.g., the Gateway daemon).

Steps to Reproduce (Conceptual)

  1. A long-running Gateway process (PID A) acquires a session write lock (*.jsonl.lock) to append incoming data.
  2. Due to an unhandled rejection, stream disconnect, or internal error within the withFileLock context, the promise chain breaks without reaching the
    elease() handler in the finally block.
  3. The .lock file remains on disk containing PID A.
  4. Subsequent RPC calls or CLI commands (e.g., openclaw agent --deliver) attempt to access the session. They wait for imeoutMs (10s) and fail with: Error: session file locked (timeout 10000ms): pid=A.

Root Cause

In src/plugin-sdk/json-store.ts (or the transpiled bundle):

  • DEFAULT_STALE_MS is hardcoded to 1800 * 1000 (30 minutes).
  • inspectLockPayload checks if the PID is alive (isPidAlive(pid)). Since PID A is the Gateway daemon, it is always rue.
  • Therefore, the staleReasons array does NOT include "dead-pid".
  • It will only include "too-old" if the lock is older than 30 minutes.
  • This means a single unhandled rejection that leaks a lock will effectively paralyze that specific session for 30 minutes before shouldReclaimContendedLockFile finally allows the
    m command to break the contention.

Impact

  • A single lock leak can paralyze a session for 30+ minutes
  • Requires manual intervention (deleting lock files) to recover
  • Affects all agents trying to access the locked session

Suggested Fix

  1. Implement a Leased Lock (Heartbeat) System: Rather than a static "created at + 30m" system for long-running processes. If a lock is genuinely held for a massive text generation, the process should ouch the mtime of the lock every few seconds.

  2. Reduce the DEFAULT_STALE_MS for session files: 30 minutes is excessive for a chat session append operation. 2-5 minutes is a far safer boundary.

  3. Introduce an active release mechanism on global exception handlers: Flush local HELD_LOCKS maps in the Node Gateway's global exception handlers.

Temporary Workaround

We created a cron job that runs every 5 minutes to clean stale lock files:

``powershell

~/.openclaw/scripts/clean-session-locks.ps1

Get-ChildItem "C:\Users\host.openclaw\agents*\sessions*.lock" | ForEach-Object {
$lock = Get-Content $.FullName -Raw | ConvertFrom-Json
$age = (New-TimeSpan -Start $lock.createdAt).TotalMinutes
if ($age -gt 5) { Remove-Item $
.FullName -Force }
}
``

Environment

  • OpenClaw Version: 2026.3.13
  • Node.js Version: 24.13.1
  • OS: Windows 11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions