Summary
Compaction timeouts create an unrecoverable deadlock on the main session lane. When compaction fails (timeout at 300s or 600s), recovery commands (/new, /reset, --reset-session) queue behind the compaction in the same session lane and cannot execute. The only recovery path is kill -9 + manual session file rename — which took the user ~1 hour to discover.
This has occurred twice in three days (March 6 and March 8, 2026).
Incident 1 — March 6, 2026
Trigger: Large toolResult payloads in session history (single blobs up to 399,999 and 167,483 chars).
Compaction failures:
Session f84eb979 (Anthropic claude-sonnet-4-6):
- 12:31 PM — compaction start.
pre.estTokens=417895, pre.toolResultChars=1,292,139. Top contributor: toolResult:gateway = 399,999 chars
- 12:36 PM — timeout after 300,119ms
- 12:36 PM — retry.
pre.estTokens=260736, pre.toolResultChars=666,940
- 12:41 PM — timeout after 300,150ms
Session 46d47d54 (openai-codex/gpt-5.3-codex):
- 6:56 PM — compaction start.
pre.estTokens=192823, pre.toolResultChars=514,373. Top contributor: toolResult:exec = 167,483 chars
- 7:01 PM — timeout after 300,070ms
- 7:03 PM — retry with gpt-5.2-codex
- 7:08 PM — timeout after 300,071ms
Additional failure mode: Anthropic summarization returned repeated 429 rate-limit errors during compaction (~6:49–6:50 PM), causing both full and partial summarization to fail before the timeout even hit.
Incident 2 — March 8, 2026
Trigger: Main Telegram DM session (cd8786f3) grew to ~3MB / 759 messages / ~1.19M characters with compactionCount: 0 — compaction had never completed successfully on this session.
Timeline (EST):
- ~4:15–4:32 PM — Telegram polling stalls begin. Six stall detections with increasing backoff (2s → 30s).
- 4:20 PM — First compaction timeout.
runId=9ad93d4f, timeoutMs=600000. Gateway fell back to current snapshot.
- 5:19 PM — Second compaction timeout.
runId=33a3c6ef, timeoutMs=600000. Lane wait hit 506,539ms (8.4 minutes) with zero jobs ahead — the compaction itself was the blocker.
- 5:22–5:25 PM — Subagent announce retries (4 attempts) all failed with gateway timeout (60,000ms each).
- 5:26–5:48 PM — Six gateway restarts via SIGTERM. Each restart: gateway starts → Telegram poller connects → typing indicator shows ~2 min → typing TTL expires → no response → SIGTERM. Gateway could not break the cycle.
- ~5:50 PM — User tried
/new in TUI. TUI had stale auth token (v2026.2.26 token mismatch — 112 occurrences). Command did not execute.
- ~5:55 PM — User tried
openclaw acp --session "agent:main:main" --reset-session. Command hung — session locked in compaction, reset queued behind it.
- ~6:00 PM — User tried new ACP session with
uuidgen. Opened but did not affect Telegram DM routing (pinned to agent:main:main).
- ~9:45 PM — Resolution:
kill -9, manually renamed session .jsonl to .jsonl.reset.manual, LaunchAgent restarted gateway with fresh session.
The deadlock:
Every incoming Telegram message triggered safeguard-mode compaction → compaction timed out after 10 minutes → blocked the session lane → all recovery commands (/new, /reset) entered the same lane queue → could not execute until compaction completed → compaction never completed.
Root Cause
- Session lane is single-threaded. Compaction, message processing, and administrative commands (
/new, /reset) all share the same lane. A timed-out compaction blocks everything.
- No compaction circuit breaker. Sessions that fail compaction repeatedly will keep attempting it on every incoming message, consuming the full timeout window each time.
- No out-of-band session reset. All reset paths go through the gateway session lane. If the lane is blocked, there is no recovery without filesystem surgery.
Expected Behavior
/new and /reset should preempt or abort an active compaction, not queue behind it
- Compaction should have a circuit breaker — after N failures, stop retrying on every message
- Session size should trigger a warning or auto-action before compaction becomes untenable (e.g., >500K chars or >500 messages)
- A CLI command should exist for direct session file operations without going through the gateway (e.g.,
openclaw sessions reset --agent main --force)
Environment
- OpenClaw gateway (LaunchAgent, macOS)
- Compaction providers:
anthropic/claude-sonnet-4-6, openai-codex/gpt-5.3-codex, openai-codex/gpt-5.2-codex
- Compaction timeouts: 300s (March 6), 600s (March 8)
- Channel: Telegram DM
Log Sources
- Gateway logs:
~/.openclaw/logs/gateway.err.log
- Session logs:
/tmp/openclaw/openclaw-2026-03-06.log, /tmp/openclaw/openclaw-2026-03-08.log
Summary
Compaction timeouts create an unrecoverable deadlock on the main session lane. When compaction fails (timeout at 300s or 600s), recovery commands (
/new,/reset,--reset-session) queue behind the compaction in the same session lane and cannot execute. The only recovery path iskill -9+ manual session file rename — which took the user ~1 hour to discover.This has occurred twice in three days (March 6 and March 8, 2026).
Incident 1 — March 6, 2026
Trigger: Large
toolResultpayloads in session history (single blobs up to 399,999 and 167,483 chars).Compaction failures:
Session
f84eb979(Anthropic claude-sonnet-4-6):pre.estTokens=417895,pre.toolResultChars=1,292,139. Top contributor:toolResult:gateway = 399,999 charspre.estTokens=260736,pre.toolResultChars=666,940Session
46d47d54(openai-codex/gpt-5.3-codex):pre.estTokens=192823,pre.toolResultChars=514,373. Top contributor:toolResult:exec = 167,483 charsAdditional failure mode: Anthropic summarization returned repeated
429rate-limit errors during compaction (~6:49–6:50 PM), causing both full and partial summarization to fail before the timeout even hit.Incident 2 — March 8, 2026
Trigger: Main Telegram DM session (
cd8786f3) grew to ~3MB / 759 messages / ~1.19M characters withcompactionCount: 0— compaction had never completed successfully on this session.Timeline (EST):
runId=9ad93d4f,timeoutMs=600000. Gateway fell back to current snapshot.runId=33a3c6ef,timeoutMs=600000. Lane wait hit 506,539ms (8.4 minutes) with zero jobs ahead — the compaction itself was the blocker./newin TUI. TUI had stale auth token (v2026.2.26 token mismatch — 112 occurrences). Command did not execute.openclaw acp --session "agent:main:main" --reset-session. Command hung — session locked in compaction, reset queued behind it.uuidgen. Opened but did not affect Telegram DM routing (pinned toagent:main:main).kill -9, manually renamed session.jsonlto.jsonl.reset.manual, LaunchAgent restarted gateway with fresh session.The deadlock:
Every incoming Telegram message triggered safeguard-mode compaction → compaction timed out after 10 minutes → blocked the session lane → all recovery commands (
/new,/reset) entered the same lane queue → could not execute until compaction completed → compaction never completed.Root Cause
/new,/reset) all share the same lane. A timed-out compaction blocks everything.Expected Behavior
/newand/resetshould preempt or abort an active compaction, not queue behind itopenclaw sessions reset --agent main --force)Environment
anthropic/claude-sonnet-4-6,openai-codex/gpt-5.3-codex,openai-codex/gpt-5.2-codexLog Sources
~/.openclaw/logs/gateway.err.log/tmp/openclaw/openclaw-2026-03-06.log,/tmp/openclaw/openclaw-2026-03-08.log