Skip to content

Make session write lock configurable + narrow lock scope (avoid timeout=All models failed) #13744

@ruslanavapro

Description

@ruslanavapro

Problem

OpenClaw uses a file lock ${sessionFile}.lock for session writes. In current build (npm openclaw@2026.2.9), the lock defaults are:

  • timeoutMs = 10000 (10s)
  • staleMs = 30min

However, call sites invoke it without overrides:

  • acquireSessionWriteLock({ sessionFile })
    so timeoutMs/staleMs are effectively hardcoded defaults and not configurable via openclaw.json.

Additionally, the lock appears to be held across a broad portion of the embedded run (including LLM/tool execution), not just during the actual transcript append/flush. Under concurrent inbound messages to the same session, this produces:

  • Error: session file locked (timeout 10000ms)
  • which then propagates as All models failed / failover, even though this is not a model/provider error.

Why it matters

If two channels (e.g., Telegram + Webchat) end up hitting the same session concurrently, the second request times out after 10s and fails the run, causing user-visible outages. This is concurrency/locking contention, not provider failure.

Requested changes

  1. Configurable lock timeouts

    • Allow configuring sessionWriteLock.timeoutMs and sessionWriteLock.staleMs via openclaw.json (global defaults), and/or via env.
  2. Narrow lock scope

    • Hold the lock only during critical sections that read/repair/prewarm/append to the session file.
    • Avoid holding the lock across the entire LLM run.
    • Alternative: per-session async queue / single-writer actor to serialize writes without blocking the whole run.
  3. Better error handling

    • Distinguish lock-timeout from model/provider failures.
    • Prefer a retry/backoff or a graceful 409/429-style response with “please retry” instead of surfacing All models failed.

Evidence (from dist bundle)

In dist/reply-*.js:

  • Defaults inside acquireSessionWriteLock() include timeoutMs ?? 1e4 and staleMs ?? 1800*1e3.
  • Multiple call sites use acquireSessionWriteLock({ sessionFile }) (no overrides).
  • The lock is acquired early and released after the run finishes.

Workarounds we are using

  • External cleanup of PID-dead .jsonl.lock files.
  • Avoiding cross-channel contention by ensuring different channels don’t share the same sessionId.

Happy to provide exact line snippets from the bundle if you want, or test a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.enhancementNew feature or requestimpact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions