Summary
Gateway startup resumes all sessions referenced in sessions.json bindings via embedded LLM runs, regardless of session age, failure state, or count. These embedded runs share lane=main with user messages and have a 600,000ms (10 min) timeout. In production with 3 agents accumulating sessions over days/weeks, this causes:
- Lane starvation — all user messages queue behind stale embedded runs
- Cascading timeouts — one stuck embedded run blocks the lane for 10 minutes
- Silent bot failure — Telegram users see "typing..." indefinitely, then nothing
Root Cause Analysis
Three compounding issues:
Issue 1: No session garbage collection
sessions.json bindings accumulate indefinitely. There is no mechanism to:
- Expire bindings by age or idle time
- Limit the number of bindings per agent
- Remove bindings for sessions that ended in error
Evidence: After 10 days of operation, we found:
illarion: 32 bindings (including 10 cron task bindings)
main: 21 bindings
marketer: 13 bindings
- Total: 66 bindings → 56 session files
Issue 2: Aggressive startup resume
On gateway restart, every binding in sessions.json triggers a session resume via embedded LLM run. No filtering by:
- Session age (we had sessions from 5+ days ago)
- Session completion state (failed sessions get resumed)
- Session file validity (corrupt/incomplete
.jsonl files)
There is no startup budget or concurrency limit for resume operations.
Issue 3: Embedded runs share lane=main
Embedded run resumes use the same lane (main) as incoming user messages. With maxConcurrent default of 4:
- 3 agents × multiple stale sessions = lane slots exhausted instantly
- All new user messages wait in FIFO queue
- Each stuck embedded run holds its slot for up to 600,000ms
Reproduction
- Run gateway with 3 agents for several days
- Accumulate sessions naturally (user messages, cron tasks, etc.)
- Restart gateway (
systemctl --user restart openclaw-gateway)
- Observe: all stale sessions resume simultaneously, blocking
lane=main
Production Log Evidence
[session/resume] agent=illarion sessionId=f4b7f86b binding=agent:illarion:telegram:direct:40382952
[embedded-run/start] sessionId=f4b7f86b lane=main timeout=600000
[model-fallback/decision] decision=skip_candidate requested=anthropic/claude-sonnet-4-6 reason=auth_permanent
[embedded-run/timeout] sessionId=f4b7f86b elapsed=600000 lane=main
[lane/wait-exceeded] lane=main queue=7 maxConcurrent=4
Pattern repeats across all 3 agents on every gateway restart.
Scale
| Agent |
Bindings |
Session Files |
Oldest Session |
| illarion |
32 |
19 |
5+ days |
| main |
21 |
22 |
4+ days |
| marketer |
13 |
15 |
3+ days |
| Total |
66 |
56 |
— |
Expected Behavior
- Session GC: Bindings should expire based on configurable
maxAgeHours / idleHours (similar to session.threadBindings settings that exist but don't seem to apply to sessions.json)
- Startup budget: Limit concurrent session resumes at startup (e.g.,
maxConcurrentResumes: 2)
- Stale session filtering: Skip sessions older than a threshold or in error state
- Separate lane for embedded runs: Embedded run resumes should not compete with
lane=main user messages, or at minimum have lower priority
- Session resume timeout: A shorter timeout for startup resumes (e.g., 60s instead of 600s)
Current Workaround
Manual cleanup before restart:
# 1. Clear all session bindings
for agent in main illarion marketer; do
echo '{}' > /home/user/.openclaw/agents/$agent/sessions/sessions.json
done
# 2. Archive stale session files
find /home/user/.openclaw/agents/*/sessions/ -name "*.jsonl" -mtime +1 \
-exec mv {} {}.stuck-bak \;
# 3. Restart gateway
systemctl --user restart openclaw-gateway
This must be done on every restart, which is not sustainable.
Related Configuration
The following session.threadBindings settings exist in openclaw.json but do not appear to affect sessions.json binding accumulation:
"session": {
"threadBindings": {
"maxAgeHours": 120,
"idleHours": 24,
"reset": { "mode": "daily", "atHour": 4 }
}
}
Environment
- OpenClaw: 2026.3.11
- Node.js: 22.x
- Platform: WSL2 (Ubuntu 24.04) on Windows 11
- 3 agents, Telegram channel, cron tasks active
- Models: Claude Sonnet 4.6 (via proxy), Ollama qwen2.5-coder:32b (fallback)
Suggested Labels
bug, session-management, lane-system
Summary
Gateway startup resumes all sessions referenced in
sessions.jsonbindings via embedded LLM runs, regardless of session age, failure state, or count. These embedded runs sharelane=mainwith user messages and have a 600,000ms (10 min) timeout. In production with 3 agents accumulating sessions over days/weeks, this causes:Root Cause Analysis
Three compounding issues:
Issue 1: No session garbage collection
sessions.jsonbindings accumulate indefinitely. There is no mechanism to:Evidence: After 10 days of operation, we found:
illarion: 32 bindings (including 10 cron task bindings)main: 21 bindingsmarketer: 13 bindingsIssue 2: Aggressive startup resume
On gateway restart, every binding in
sessions.jsontriggers a session resume via embedded LLM run. No filtering by:.jsonlfiles)There is no startup budget or concurrency limit for resume operations.
Issue 3: Embedded runs share
lane=mainEmbedded run resumes use the same lane (
main) as incoming user messages. WithmaxConcurrentdefault of 4:Reproduction
systemctl --user restart openclaw-gateway)lane=mainProduction Log Evidence
Pattern repeats across all 3 agents on every gateway restart.
Scale
Expected Behavior
maxAgeHours/idleHours(similar tosession.threadBindingssettings that exist but don't seem to apply tosessions.json)maxConcurrentResumes: 2)lane=mainuser messages, or at minimum have lower priorityCurrent Workaround
Manual cleanup before restart:
This must be done on every restart, which is not sustainable.
Related Configuration
The following
session.threadBindingssettings exist inopenclaw.jsonbut do not appear to affectsessions.jsonbinding accumulation:Environment
Suggested Labels
bug,session-management,lane-system