Stale session resume at gateway startup blocks lane=main indefinitely — no session GC or startup budget

## Summary

Gateway startup resumes **all** sessions referenced in `sessions.json` bindings via embedded LLM runs, regardless of session age, failure state, or count. These embedded runs share `lane=main` with user messages and have a 600,000ms (10 min) timeout. In production with 3 agents accumulating sessions over days/weeks, this causes:

1. **Lane starvation** — all user messages queue behind stale embedded runs
2. **Cascading timeouts** — one stuck embedded run blocks the lane for 10 minutes
3. **Silent bot failure** — Telegram users see "typing..." indefinitely, then nothing

## Root Cause Analysis

Three compounding issues:

### Issue 1: No session garbage collection

`sessions.json` bindings accumulate indefinitely. There is no mechanism to:
- Expire bindings by age or idle time
- Limit the number of bindings per agent
- Remove bindings for sessions that ended in error

**Evidence:** After 10 days of operation, we found:
- `illarion`: 32 bindings (including 10 cron task bindings)
- `main`: 21 bindings
- `marketer`: 13 bindings
- **Total: 66 bindings → 56 session files**

### Issue 2: Aggressive startup resume

On gateway restart, **every** binding in `sessions.json` triggers a session resume via embedded LLM run. No filtering by:
- Session age (we had sessions from 5+ days ago)
- Session completion state (failed sessions get resumed)
- Session file validity (corrupt/incomplete `.jsonl` files)

There is no startup budget or concurrency limit for resume operations.

### Issue 3: Embedded runs share `lane=main`

Embedded run resumes use the same lane (`main`) as incoming user messages. With `maxConcurrent` default of 4:
- 3 agents × multiple stale sessions = lane slots exhausted instantly
- All new user messages wait in FIFO queue
- Each stuck embedded run holds its slot for up to 600,000ms

## Reproduction

1. Run gateway with 3 agents for several days
2. Accumulate sessions naturally (user messages, cron tasks, etc.)
3. Restart gateway (`systemctl --user restart openclaw-gateway`)
4. Observe: all stale sessions resume simultaneously, blocking `lane=main`

## Production Log Evidence

```
[session/resume] agent=illarion sessionId=f4b7f86b binding=agent:illarion:telegram:direct:40382952
[embedded-run/start] sessionId=f4b7f86b lane=main timeout=600000
[model-fallback/decision] decision=skip_candidate requested=anthropic/claude-sonnet-4-6 reason=auth_permanent
[embedded-run/timeout] sessionId=f4b7f86b elapsed=600000 lane=main
[lane/wait-exceeded] lane=main queue=7 maxConcurrent=4
```

Pattern repeats across all 3 agents on every gateway restart.

## Scale

| Agent | Bindings | Session Files | Oldest Session |
|-------|----------|---------------|----------------|
| illarion | 32 | 19 | 5+ days |
| main | 21 | 22 | 4+ days |
| marketer | 13 | 15 | 3+ days |
| **Total** | **66** | **56** | — |

## Expected Behavior

1. **Session GC**: Bindings should expire based on configurable `maxAgeHours` / `idleHours` (similar to `session.threadBindings` settings that exist but don't seem to apply to `sessions.json`)
2. **Startup budget**: Limit concurrent session resumes at startup (e.g., `maxConcurrentResumes: 2`)
3. **Stale session filtering**: Skip sessions older than a threshold or in error state
4. **Separate lane for embedded runs**: Embedded run resumes should not compete with `lane=main` user messages, or at minimum have lower priority
5. **Session resume timeout**: A shorter timeout for startup resumes (e.g., 60s instead of 600s)

## Current Workaround

Manual cleanup before restart:

```bash
# 1. Clear all session bindings
for agent in main illarion marketer; do
  echo '{}' > /home/user/.openclaw/agents/$agent/sessions/sessions.json
done

# 2. Archive stale session files
find /home/user/.openclaw/agents/*/sessions/ -name "*.jsonl" -mtime +1 \
  -exec mv {} {}.stuck-bak \;

# 3. Restart gateway
systemctl --user restart openclaw-gateway
```

This must be done on every restart, which is not sustainable.

## Related Configuration

The following `session.threadBindings` settings exist in `openclaw.json` but do not appear to affect `sessions.json` binding accumulation:

```json
"session": {
  "threadBindings": {
    "maxAgeHours": 120,
    "idleHours": 24,
    "reset": { "mode": "daily", "atHour": 4 }
  }
}
```

## Environment

- OpenClaw: 2026.3.11
- Node.js: 22.x
- Platform: WSL2 (Ubuntu 24.04) on Windows 11
- 3 agents, Telegram channel, cron tasks active
- Models: Claude Sonnet 4.6 (via proxy), Ollama qwen2.5-coder:32b (fallback)

## Suggested Labels

`bug`, `session-management`, `lane-system`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stale session resume at gateway startup blocks lane=main indefinitely — no session GC or startup budget #44687

Summary

Root Cause Analysis

Issue 1: No session garbage collection

Issue 2: Aggressive startup resume

Issue 3: Embedded runs share `lane=main`

Reproduction

Production Log Evidence

Scale

Expected Behavior

Current Workaround

Related Configuration

Environment

Suggested Labels

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Agent	Bindings	Session Files	Oldest Session
illarion	32	19	5+ days
main	21	22	4+ days
marketer	13	15	3+ days
Total	66	56	—

Uh oh!

Stale session resume at gateway startup blocks lane=main indefinitely — no session GC or startup budget #44687

Description

Summary

Root Cause Analysis

Issue 1: No session garbage collection

Issue 2: Aggressive startup resume

Issue 3: Embedded runs share lane=main

Reproduction

Production Log Evidence

Scale

Expected Behavior

Current Workaround

Related Configuration

Environment

Suggested Labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue 3: Embedded runs share `lane=main`