Session write lock timeout freezes gateway when session file is large (>300KB)

## Summary

When a session transcript grows large (300KB+), writing to the session file can take longer than the default lock acquisition timeout (60 seconds). This causes repeated `SessionWriteLockTimeoutError` failures, rendering the gateway unresponsive and requiring manual restarts.

## Environment

- OpenClaw version: v2026.5.22-beta.1
- OS: Linux (Ubuntu 24.04, x64)
- Session file size at failure: ~400KB
- Session activity: ~40+ turn debugging session with multiple file reads/writes

## Steps to Reproduce

1. Run a long conversation (40+ turns) with many tool calls and file modifications
2. Session file grows to 300KB+
3. Multiple processes try to write to the same session file (e.g. main gateway + completion sub-process)
4. Lock holder takes >60s to complete write
5. Other processes wait for lock, timeout after 60s -> SessionWriteLockTimeoutError

## Expected Behavior

If a session write lock is held by a process that exceeds `maxHoldMs` (currently 5 minutes), the lock should be reclaimable by other processes during acquisition.

## Actual Behavior

The `shouldReclaim` callback in `acquireSessionWriteLock` checks 6 stale conditions (missing PID, dead PID, recycled PID, too-old per staleMs, non-openclaw owner, orphan self-lock) but does NOT check `maxHoldMs`. So even when a live process holds the lock for well over 5 minutes, other processes wait 60 seconds then throw `SessionWriteLockTimeoutError`.

Log evidence:
```
error: Embedded agent failed before reply: session file locked (timeout 60000ms): 
  pid=98386 /root/.openclaw/agents/main/sessions/xxx.jsonl.lock
```

The watchdog timer (`runLockWatchdogCheck`) DOES enforce `maxHoldMs`, but it only runs every 60 seconds and is a separate mechanism from lock acquisition. The acquisition timeout fires before the watchdog can act.

## Root Cause

In `src/agents/session-write-lock.ts`, the `shouldReclaim` callback inside `acquireSessionWriteLock` never consults `maxHoldMs`. The `maxHoldMs` value is stored as metadata but only used by the background watchdog, not during lock acquisition.

The relevant `shouldReclaim` logic only uses `staleMs` (30min default), never `maxHoldMs` (5min default). A lock could be held for 10 minutes by a live process, and `shouldReclaim` would still return `false` because the PID is alive and the age hasn't exceeded `staleMs`.

## Workaround

Set environment variables to increase timeout and reduce staleness threshold:
```
OPENCLAW_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS=120000
OPENCLAW_SESSION_WRITE_LOCK_STALE_MS=300000
```

## Proposed Fix

In `shouldReclaim`, also enforce `maxHoldMs`: if `nowMs - createdAt` exceeds `maxHoldMs`, treat the lock as reclaimable (add "hold-exceeded" to staleReasons). This makes the acquisition path respect the same `maxHoldMs` constraint that the watchdog already enforces.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Session write lock timeout freezes gateway when session file is large (>300KB) #85762

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Workaround

Proposed Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Session write lock timeout freezes gateway when session file is large (>300KB) #85762

Description

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Workaround

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions