Skip to content

Bug: Session store lock contention causes /new and /reset to timeout under long-context + cron write load (OpenClaw 2026.2.12) #15507

@quqi1599

Description

@quqi1599

Environment

  • Date observed: 2026-02-13
  • OpenClaw: 2026.2.12 (本地运行 2026.2.9,配置为 2026.2.12)
  • Node.js: 22.22.0
  • OS: macOS 26.3 (arm64)
  • Channel: Telegram
  • State dir: default (~/.openclaw)

Summary

When session history is large and cron writes happen concurrently, /new or /reset can appear to "do nothing". Logs show session-store lock timeout and related write lock contention.

Precondition

  • ~/.openclaw/agents/main/sessions/sessions.json grows large
  • ~/.openclaw/agents/main/sessions/ contains many transcript files
  • There is active session traffic plus cron jobs writing state

Actual Environment Output

$ openclaw --version
2026.2.9
Config was last written by a newer OpenClaw (2026.2.12); current version is 2026.2.9.

$ du -sh ~/.openclaw/agents/main/sessions ~/.openclaw/agents/main/sessions/sessions.json
242M    ~/.openclaw/agents/main/sessions
540K    ~/.openclaw/agents/main/sessions/sessions.json

$ ls -la ~/.openclaw/agents/main/sessions/ | wc -l
35 files (including dirs)

Key Error Signals (from logs)

2026-02-13T14:39:14.838Z [diagnostic] lane task error: lane=cron durationMs=600046 error="FailoverError: LLM request timed out."
2026-02-13T14:39:14.842Z [diagnostic] lane task error: lane=session:agent:main:cron:48ab4446-9dca-465a-ae04-56b12b7c797d durationMs=600052 error="FailoverError: LLM request timed out."
2026-02-13T14:49:42.759Z [diagnostic] lane task error: lane=cron durationMs=600111 error="FailoverError: LLM request timed out."
2026-02-13T14:56:41.082Z [diagnostic] lane wait exceeded: lane=session:agent:main:telegram:direct:7688058064 waitedMs=24273 queueAhead=0

Suspected Root Cause

  • Session store is a single shared JSON file with a global lock.
  • Lock acquisition timeout is fixed at 10s (timeoutMs = 10000) under heavy concurrent writers.
  • Long sessions + cron writes increase lock hold time and contention probability.

Relevant Code References

  • src/config/sessions/paths.ts:33 (default session store path = sessions.json)
  • src/config/sessions/store.ts:208 (whole-store JSON stringify/write)
  • src/config/sessions/store.ts:302 (lock file path)
  • src/config/sessions/store.ts:339 (timeout acquiring lock error)

Impact

  • User-facing reliability issue in production-like usage.
  • Reset/new commands become non-deterministic under load.
  • Perceived "bot freeze" in Telegram/interactive channels.

Temporary Workaround

  1. Stop gateway.
  2. Archive old session transcripts out of hot path.
  3. Reset sessions.json to a small store.
  4. Restart gateway.
  5. Reduce cron frequency / concurrent writes.

Proposed Fix Direction

  • Move from single global session index write to sharded/per-session metadata storage or append-only journal.
  • Add lock retry/backoff and better degradation for reset path.
  • Make /new and /reset resilient when lock acquisition fails (queue/retry/ user-visible reason).
  • Add load/concurrency regression tests for session-store contention.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions