Problem
When compaction times out, selectCompactionTimeoutSnapshot() falls back to an already-overflowed snapshot (e.g., 234k tokens in a 200k context window). The LLM call then hangs or fails repeatedly, blocking the entire lane. All subsequent messages to that agent queue up indefinitely — the bot appears "dead."
Expected behavior
On compaction timeout, there should be a forced recovery path that prevents the session from becoming permanently stuck. For example:
- Truncate to system/bootstrap prompt + last N turns
- Archive the overflowed transcript and start a fresh session
- Skip the failed session and process other queued messages
Suggestion
Add a compaction.timeoutAction setting:
"compaction": {
"timeoutAction": "truncate" // "reset" | "truncate" | "fallback"
}
"truncate" — keep system prompt + bootstrap + last N turns, discard the rest
"reset" — archive transcript, create a new empty session
"fallback" — current behavior (use timeout snapshot as-is)
Additional: Lane isolation
A single session's compaction failure should not block the entire lane. Other sessions/messages should continue to be processed. Consider per-session error isolation so one stuck session doesn't take down the agent.
Current workaround
- Aggressive compaction settings (
maxHistoryShare: 0.4, recentTurnsPreserve: 3, early memory flush)
- External
session_overflow_guard.sh that scans sessions.json for >90% token usage and archives/removes overflow sessions
- Called from
self_heal.sh (every 5 minutes via cron)
Environment
- OpenClaw 2026.3.7
- Model: gpt-5.4 (200k context window)
- Observed at 234k/200k (117% overflow)
Problem
When compaction times out,
selectCompactionTimeoutSnapshot()falls back to an already-overflowed snapshot (e.g., 234k tokens in a 200k context window). The LLM call then hangs or fails repeatedly, blocking the entire lane. All subsequent messages to that agent queue up indefinitely — the bot appears "dead."Expected behavior
On compaction timeout, there should be a forced recovery path that prevents the session from becoming permanently stuck. For example:
Suggestion
Add a
compaction.timeoutActionsetting:"truncate"— keep system prompt + bootstrap + last N turns, discard the rest"reset"— archive transcript, create a new empty session"fallback"— current behavior (use timeout snapshot as-is)Additional: Lane isolation
A single session's compaction failure should not block the entire lane. Other sessions/messages should continue to be processed. Consider per-session error isolation so one stuck session doesn't take down the agent.
Current workaround
maxHistoryShare: 0.4,recentTurnsPreserve: 3, early memory flush)session_overflow_guard.shthat scanssessions.jsonfor >90% token usage and archives/removes overflow sessionsself_heal.sh(every 5 minutes via cron)Environment