Problem
When the primary model (e.g. Opus) is degraded or overloaded, context compaction can enter a death spiral:
- Context hits the limit, compaction triggers
- Compaction sends full context to the model, which times out (10min timeout)
- Failure kills Telegram polling ("Unsubscribed during compaction")
- Next inbound message triggers another compaction attempt
- Repeat indefinitely
This blocked all message processing for over an hour tonight (2026-03-13, ~9:25 PM to ~10:30 PM ET). The gateway was healthy the entire time. The agent session was completely unresponsive.
From the error log:
[agent/embedded] embedded run timeout: runId=... timeoutMs=600000
[agent/embedded] using current snapshot: timed out during compaction
[telegram] Restarting polling after unhandled network error: Unsubscribed during compaction
[telegram] polling runner stopped (unhandled network error); restarting in 23.29s
[diagnostic] lane wait exceeded: lane=session:agent:main:main waitedMs=600160 queueAhead=0
This repeated 5+ times in sequence.
Proposed solutions
-
Fallback model for compaction: If compaction fails 2x in a row with the primary model, retry with a faster/larger-context fallback (e.g. Sonnet, which is cheaper and has a bigger context window). Could be configurable: agents.defaults.compactionFallbackModel.
-
Force-truncate after N failures: If compaction fails 3x total (including fallback), hard-truncate the context (drop oldest messages) rather than retrying indefinitely. Lossy but better than total unresponsiveness.
-
Don't block Telegram polling during compaction: The compaction failure currently crashes the polling connection. Compaction should not take down the channel transport. Even if the agent can't respond yet, it should still be receiving messages.
-
Expose compaction health in gateway status: openclaw gateway status should show if compaction is currently running, how many times it's failed, and whether the session is effectively stuck.
Environment
- OpenClaw 2026.3.12
- Model: anthropic/claude-opus-4-6
- Channel: Telegram
- OS: macOS (Darwin 25.3.0, arm64)
Problem
When the primary model (e.g. Opus) is degraded or overloaded, context compaction can enter a death spiral:
This blocked all message processing for over an hour tonight (2026-03-13, ~9:25 PM to ~10:30 PM ET). The gateway was healthy the entire time. The agent session was completely unresponsive.
From the error log:
This repeated 5+ times in sequence.
Proposed solutions
Fallback model for compaction: If compaction fails 2x in a row with the primary model, retry with a faster/larger-context fallback (e.g. Sonnet, which is cheaper and has a bigger context window). Could be configurable:
agents.defaults.compactionFallbackModel.Force-truncate after N failures: If compaction fails 3x total (including fallback), hard-truncate the context (drop oldest messages) rather than retrying indefinitely. Lossy but better than total unresponsiveness.
Don't block Telegram polling during compaction: The compaction failure currently crashes the polling connection. Compaction should not take down the channel transport. Even if the agent can't respond yet, it should still be receiving messages.
Expose compaction health in gateway status:
openclaw gateway statusshould show if compaction is currently running, how many times it's failed, and whether the session is effectively stuck.Environment