Skip to content

[Bug]: Session file lock not released properly by watchdog #87483

@sally-lemo

Description

@sally-lemo

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Session write lock files persist beyond maxHoldMs timeout; watchdog fails to reclaim stale locks, causing "session file locked" errors on subsequent requests.

Steps to reproduce

  1. Start OpenClaw 2026.5.22 gateway and let it run for extended periods (overnight)
  2. Use multiple sessions or let sessions accumulate
  3. Observe lock files in ~/.openclaw/agents/main/sessions/
  4. Attempt to use OpenClaw after lock has exceeded maxHoldMs
  5. Observe "session file locked" error

Expected behavior

Lock files should be automatically released when:

  • The holding process exits
  • maxHoldMs timeout (300000ms / 5 minutes) is exceeded
  • Lock is marked as stale after staleMs (1800000ms / 30 minutes)

The watchdog should reclaim stale locks without manual intervention.

Actual behavior

Lock files persist for 8+ hours despite maxHoldMs being 300000ms (5 minutes)
Lock file shows maxHoldMs: 1020000 (17 minutes) instead of configured 300000ms
Watchdog does not reclaim stale locks automatically
User must manually delete lock files or restart gateway to resolve
Error message: "session file locked (timeout 60000ms): pid=16834 /path/to/session.jsonl.lock"

OpenClaw version

2026.5.22 (a374c3a)

Operating system

macOS Darwin 25.5.0 (arm64)

Install method

npm global

Model

qwen/kimi-k2.5

Provider / routing chain

openclaw -> modelstudio/qwen3.5-plus

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Lock file content:

{
  "pid": 16834,
  "createdAt": "2026-05-28T01:12:54.261Z",
  "maxHoldMs": 1020000
}


Process status:

pid 16834  83.3% CPU  openclaw gateway

(Process running for 170+ minutes, lock held for 8+ hours)

Configuration:
- session.writeLock.acquireTimeoutMs: 60000
- session.writeLock.staleMs: 1800000  
- session.writeLock.maxHoldMs: 300000

Impact and severity

Affected: All OpenClaw users on 2026.5.22 with long-running gateway
Severity: Medium (requires manual intervention or workaround)
Frequency: Observed multiple times after overnight operation
Consequence: Agents fail to respond, user must manually delete lock files or restart gateway

Additional information

Workaround applied:

  1. Extended timeouts via environment variables:

    • OPENCLAW_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS=120000
    • OPENCLAW_SESSION_WRITE_LOCK_STALE_MS=3600000
    • OPENCLAW_SESSION_WRITE_LOCK_MAX_HOLD_MS=600000
  2. Created cleanup script via crontab to remove stale locks every 10 minutes

Possible root causes:

  1. Watchdog timer not properly checking lock expiration
  2. maxHoldMs value being overridden somewhere (1020000 vs 300000)
  3. Process exit detection not triggering lock cleanup
  4. Race condition in lock acquisition/release

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.bugSomething isn't workingclawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions