Skip to content

Stale session resume at gateway startup blocks lane=main indefinitely — no session GC or startup budget #44687

@spikefcz

Description

@spikefcz

Summary

Gateway startup resumes all sessions referenced in sessions.json bindings via embedded LLM runs, regardless of session age, failure state, or count. These embedded runs share lane=main with user messages and have a 600,000ms (10 min) timeout. In production with 3 agents accumulating sessions over days/weeks, this causes:

  1. Lane starvation — all user messages queue behind stale embedded runs
  2. Cascading timeouts — one stuck embedded run blocks the lane for 10 minutes
  3. Silent bot failure — Telegram users see "typing..." indefinitely, then nothing

Root Cause Analysis

Three compounding issues:

Issue 1: No session garbage collection

sessions.json bindings accumulate indefinitely. There is no mechanism to:

  • Expire bindings by age or idle time
  • Limit the number of bindings per agent
  • Remove bindings for sessions that ended in error

Evidence: After 10 days of operation, we found:

  • illarion: 32 bindings (including 10 cron task bindings)
  • main: 21 bindings
  • marketer: 13 bindings
  • Total: 66 bindings → 56 session files

Issue 2: Aggressive startup resume

On gateway restart, every binding in sessions.json triggers a session resume via embedded LLM run. No filtering by:

  • Session age (we had sessions from 5+ days ago)
  • Session completion state (failed sessions get resumed)
  • Session file validity (corrupt/incomplete .jsonl files)

There is no startup budget or concurrency limit for resume operations.

Issue 3: Embedded runs share lane=main

Embedded run resumes use the same lane (main) as incoming user messages. With maxConcurrent default of 4:

  • 3 agents × multiple stale sessions = lane slots exhausted instantly
  • All new user messages wait in FIFO queue
  • Each stuck embedded run holds its slot for up to 600,000ms

Reproduction

  1. Run gateway with 3 agents for several days
  2. Accumulate sessions naturally (user messages, cron tasks, etc.)
  3. Restart gateway (systemctl --user restart openclaw-gateway)
  4. Observe: all stale sessions resume simultaneously, blocking lane=main

Production Log Evidence

[session/resume] agent=illarion sessionId=f4b7f86b binding=agent:illarion:telegram:direct:40382952
[embedded-run/start] sessionId=f4b7f86b lane=main timeout=600000
[model-fallback/decision] decision=skip_candidate requested=anthropic/claude-sonnet-4-6 reason=auth_permanent
[embedded-run/timeout] sessionId=f4b7f86b elapsed=600000 lane=main
[lane/wait-exceeded] lane=main queue=7 maxConcurrent=4

Pattern repeats across all 3 agents on every gateway restart.

Scale

Agent Bindings Session Files Oldest Session
illarion 32 19 5+ days
main 21 22 4+ days
marketer 13 15 3+ days
Total 66 56

Expected Behavior

  1. Session GC: Bindings should expire based on configurable maxAgeHours / idleHours (similar to session.threadBindings settings that exist but don't seem to apply to sessions.json)
  2. Startup budget: Limit concurrent session resumes at startup (e.g., maxConcurrentResumes: 2)
  3. Stale session filtering: Skip sessions older than a threshold or in error state
  4. Separate lane for embedded runs: Embedded run resumes should not compete with lane=main user messages, or at minimum have lower priority
  5. Session resume timeout: A shorter timeout for startup resumes (e.g., 60s instead of 600s)

Current Workaround

Manual cleanup before restart:

# 1. Clear all session bindings
for agent in main illarion marketer; do
  echo '{}' > /home/user/.openclaw/agents/$agent/sessions/sessions.json
done

# 2. Archive stale session files
find /home/user/.openclaw/agents/*/sessions/ -name "*.jsonl" -mtime +1 \
  -exec mv {} {}.stuck-bak \;

# 3. Restart gateway
systemctl --user restart openclaw-gateway

This must be done on every restart, which is not sustainable.

Related Configuration

The following session.threadBindings settings exist in openclaw.json but do not appear to affect sessions.json binding accumulation:

"session": {
  "threadBindings": {
    "maxAgeHours": 120,
    "idleHours": 24,
    "reset": { "mode": "daily", "atHour": 4 }
  }
}

Environment

  • OpenClaw: 2026.3.11
  • Node.js: 22.x
  • Platform: WSL2 (Ubuntu 24.04) on Windows 11
  • 3 agents, Telegram channel, cron tasks active
  • Models: Claude Sonnet 4.6 (via proxy), Ollama qwen2.5-coder:32b (fallback)

Suggested Labels

bug, session-management, lane-system

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions