Skip to content

Session write lock timeout freezes gateway when session file is large (>300KB) #85762

@njuboy11

Description

@njuboy11

Summary

When a session transcript grows large (300KB+), writing to the session file can take longer than the default lock acquisition timeout (60 seconds). This causes repeated SessionWriteLockTimeoutError failures, rendering the gateway unresponsive and requiring manual restarts.

Environment

  • OpenClaw version: v2026.5.22-beta.1
  • OS: Linux (Ubuntu 24.04, x64)
  • Session file size at failure: ~400KB
  • Session activity: ~40+ turn debugging session with multiple file reads/writes

Steps to Reproduce

  1. Run a long conversation (40+ turns) with many tool calls and file modifications
  2. Session file grows to 300KB+
  3. Multiple processes try to write to the same session file (e.g. main gateway + completion sub-process)
  4. Lock holder takes >60s to complete write
  5. Other processes wait for lock, timeout after 60s -> SessionWriteLockTimeoutError

Expected Behavior

If a session write lock is held by a process that exceeds maxHoldMs (currently 5 minutes), the lock should be reclaimable by other processes during acquisition.

Actual Behavior

The shouldReclaim callback in acquireSessionWriteLock checks 6 stale conditions (missing PID, dead PID, recycled PID, too-old per staleMs, non-openclaw owner, orphan self-lock) but does NOT check maxHoldMs. So even when a live process holds the lock for well over 5 minutes, other processes wait 60 seconds then throw SessionWriteLockTimeoutError.

Log evidence:

error: Embedded agent failed before reply: session file locked (timeout 60000ms): 
  pid=98386 /root/.openclaw/agents/main/sessions/xxx.jsonl.lock

The watchdog timer (runLockWatchdogCheck) DOES enforce maxHoldMs, but it only runs every 60 seconds and is a separate mechanism from lock acquisition. The acquisition timeout fires before the watchdog can act.

Root Cause

In src/agents/session-write-lock.ts, the shouldReclaim callback inside acquireSessionWriteLock never consults maxHoldMs. The maxHoldMs value is stored as metadata but only used by the background watchdog, not during lock acquisition.

The relevant shouldReclaim logic only uses staleMs (30min default), never maxHoldMs (5min default). A lock could be held for 10 minutes by a live process, and shouldReclaim would still return false because the PID is alive and the age hasn't exceeded staleMs.

Workaround

Set environment variables to increase timeout and reduce staleness threshold:

OPENCLAW_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS=120000
OPENCLAW_SESSION_WRITE_LOCK_STALE_MS=300000

Proposed Fix

In shouldReclaim, also enforce maxHoldMs: if nowMs - createdAt exceeds maxHoldMs, treat the lock as reclaimable (add "hold-exceeded" to staleReasons). This makes the acquisition path respect the same maxHoldMs constraint that the watchdog already enforces.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions