Skip to content

Session store lock contention causes Telegram handler timeouts under concurrent load #5092

@epanonymous

Description

@epanonymous

Summary

When multiple Telegram threads/groups send messages concurrently, some messages fail silently due to session store lock contention. The error logged is:

[telegram] handler failed: Error: timeout acquiring session store lock: /home/ubuntu/.clawdbot/agents/main/sessions/sessions.json.lock

Root Cause

The session store uses a single global lock (sessions.json.lock) for all session metadata updates. In src/config/sessions/store.ts:269-337:

async function withSessionStoreLock<T>(
  storePath: string,
  fn: () => Promise<T>,
  opts: SessionStoreLockOptions = {},
): Promise<T> {
  const timeoutMs = opts.timeoutMs ?? 10_000;  // 10 second timeout
  const pollIntervalMs = opts.pollIntervalMs ?? 25;
  // ...

When multiple Telegram handlers try to update session metadata simultaneously:

  1. First handler acquires the lock
  2. Other handlers queue up, polling every 25ms
  3. If the first handler takes >10 seconds (common with API calls, tool execution), others timeout
  4. Timed-out handlers fail silently - messages get no response

Observed Behavior

From production logs:

01:55:29 [telegram] handler failed: Error: timeout acquiring session store lock
01:55:50 [telegram] handler failed: Error: timeout acquiring session store lock  
01:58:54 [telegram] handler failed: Error: timeout acquiring session store lock
01:59:06 [telegram] handler failed: Error: timeout acquiring session store lock
01:59:19 [telegram] handler failed: Error: timeout acquiring session store lock
01:59:56 [telegram] handler failed: Error: timeout acquiring session store lock

Some threads work while others don't - it depends on which thread happens to acquire the lock first.

Environment

  • Server: AWS Lightsail (3.7GB RAM)
  • Multiple Telegram groups/threads active simultaneously
  • Gateway running with claude-opus-4-5 model (longer response times)

Suggested Solutions

  1. Per-session locking: Use individual locks per session ID rather than a global lock
  2. Lock-free updates: Use atomic file operations or a lightweight database (SQLite with WAL mode)
  3. Increased timeout with backoff: Longer timeout with exponential backoff (temporary mitigation)
  4. Queue-based approach: Serialize session updates through a single writer with a queue

Workaround

Restarting the gateway clears the backlog but kills active sessions/subagents, so this is not sustainable.

Impact

  • Messages in some Telegram threads get no response
  • Users perceive the bot as unreliable
  • No user-visible error - messages just disappear

Reported from production server running openclaw gateway

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions