Summary
After upgrading to OpenClaw 2026.4.8, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior.
This does not look like just “Anthropic fallback is broken.”
The stronger bug shape is a failure-chain:
- huge-session overflow + compaction timeout
- gateway enters draining/restart state and rejects new tasks
- subagent completion announce retries fail/give up
- fallback decisions during drain can still route into stale runtime fallback candidates
The user reports this behavior did not happen before this version.
Environment
- OpenClaw:
2026.4.8
- OS: macOS arm64
- Channel: Telegram (multiple accounts)
- Large long-lived sessions (hundreds to >1300 messages)
Evidence (local)
1) Massive overflow + compaction timeout
From artifacts/openclaw-2026-04-08-incident-extract.txt:
estimatedPromptTokens=1014988 / overflowTokens=759372
messages=1331 / messages=1351+ on affected Telegram sessions
- multiple compaction failures at ~900s:
outcome=failed reason=timeout durationMs=900342
outcome=failed reason=timeout durationMs=900533
2) During failure chain, gateway rejects tasks as draining
From incident extract and ~/.openclaw/logs/gateway.err.log:
GatewayDrainingError: Gateway is draining for restart; new tasks are not accepted
- repeated
drain timeout reached; proceeding with restart
3) Subagent completion announce failures/retries during this state
Subagent completion direct announce failed ... GatewayDrainingError
Subagent announce completion ... transient failure, retrying
Subagent announce give up (retry-limit)
4) Anthropic fallback still appeared in live fallback decisions during drain
Even after removing Anthropic fallback from config on disk, log lines during draining still showed:
next=anthropic/claude-haiku-4-5 detail=Gateway is draining for restart; new tasks are not accepted
- and auth failure attempts:
candidate=anthropic/claude-haiku-4-5 reason=auth ... HTTP 401 authentication_error: invalid x-api-key
5) On-disk config had Anthropic removed, but runtime lagged until restart
- Current
~/.openclaw/openclaw.json fallback list is only:
openai-codex/gpt-5.4 -> openai-codex/gpt-5.3-codex
- Local commit removing Anthropic fallback:
19172db Remove Anthropic model fallback config
openclaw.json metadata shows it was touched at 2026-04-08T17:10:23.793Z
- But runtime logs still had
next=anthropic/claude-haiku-4-5 at 2026-04-08T17:10:27.547+01:00
This suggests live runtime config/fallback chain can remain stale until gateway restart/reload.
Doctor note
openclaw doctor --fix was run locally, but this alone did not reload the running gateway process.
Expected behavior
- Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops.
- Subagent completion announce should not be lost/give-up during gateway drain windows.
- Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk.
- If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output.
Actual behavior
- Overflow + compaction timeout chain coincided with gateway draining errors and task rejection.
- Subagent announce retries frequently failed/gave up.
- Fallback routing still referenced Anthropic during draining, causing 401 auth errors, despite Anthropic fallback being removed on disk.
Potentially related (but not exact duplicate)
Request
Please investigate this as a possible 2026.4.8 regression/failure-chain interaction:
- large-session overflow/compaction timeout
- gateway drain/restart task rejection behavior
- subagent announce resilience during drain
- runtime config/fallback-chain reload semantics (especially after removing providers)
If useful, I can provide the extracted incident artifacts/log snippets listed above.
Summary
After upgrading to OpenClaw 2026.4.8, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior.
This does not look like just “Anthropic fallback is broken.”
The stronger bug shape is a failure-chain:
The user reports this behavior did not happen before this version.
Environment
2026.4.8Evidence (local)
1) Massive overflow + compaction timeout
From
artifacts/openclaw-2026-04-08-incident-extract.txt:estimatedPromptTokens=1014988/overflowTokens=759372messages=1331/messages=1351+on affected Telegram sessionsoutcome=failed reason=timeout durationMs=900342outcome=failed reason=timeout durationMs=9005332) During failure chain, gateway rejects tasks as draining
From incident extract and
~/.openclaw/logs/gateway.err.log:GatewayDrainingError: Gateway is draining for restart; new tasks are not accepteddrain timeout reached; proceeding with restart3) Subagent completion announce failures/retries during this state
Subagent completion direct announce failed ... GatewayDrainingErrorSubagent announce completion ... transient failure, retryingSubagent announce give up (retry-limit)4) Anthropic fallback still appeared in live fallback decisions during drain
Even after removing Anthropic fallback from config on disk, log lines during draining still showed:
next=anthropic/claude-haiku-4-5 detail=Gateway is draining for restart; new tasks are not acceptedcandidate=anthropic/claude-haiku-4-5 reason=auth ... HTTP 401 authentication_error: invalid x-api-key5) On-disk config had Anthropic removed, but runtime lagged until restart
~/.openclaw/openclaw.jsonfallback list is only:openai-codex/gpt-5.4 -> openai-codex/gpt-5.3-codex19172db Remove Anthropic model fallback configopenclaw.jsonmetadata shows it was touched at2026-04-08T17:10:23.793Znext=anthropic/claude-haiku-4-5at2026-04-08T17:10:27.547+01:00This suggests live runtime config/fallback chain can remain stale until gateway restart/reload.
Doctor note
openclaw doctor --fixwas run locally, but this alone did not reload the running gateway process.Expected behavior
Actual behavior
Potentially related (but not exact duplicate)
doctor --fixexpectations)Request
Please investigate this as a possible 2026.4.8 regression/failure-chain interaction:
If useful, I can provide the extracted incident artifacts/log snippets listed above.