Summary
On a long-running embedded-agent session, OpenClaw can:
- detect context overflow,
- successfully auto-compact the session,
- retry the prompt,
- hit a 120s LLM idle timeout on the retried run,
- exhaust fallback candidates,
- and still collapse the user-facing result into the generic:
⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.
This is confusing because the real failure chain is more specific and more actionable than the final message suggests.
This looks related to, but not the same as:
#58957 ("Model switch can fail silently when carried-over session context is too large")
- closed
#20910 ("Auto-reset session when all models time out")
The new evidence here is that the failure still happens even after compaction succeeds. The user is left with the same generic fallback text instead of a compaction/timeout-aware recovery path.
Steps to reproduce
- Run a long-lived embedded-agent session with enough accumulated conversation/tool history to trigger context-overflow handling.
- Use
openai-codex/gpt-5.4 as the active model.
- Ensure fallback models exist, but make them unavailable or exhausted so the retried primary-model result matters.
- Send another message on the long-running session.
- Observe this sequence:
- context overflow detected
- auto-compaction succeeds
- retry begins
- the retried run later hits
LLM idle timeout (120s): no response from model
- fallback candidates are unavailable
- the user receives the generic "Something went wrong..." +
/new message
Expected behavior
After a successful compaction attempt, if the retried run still fails, the user should receive a failure message that preserves the real cause chain, for example:
- compaction was attempted and succeeded
- the retried model call timed out after the LLM idle-timeout window
- fallback models were unavailable
- suggested next action should be tied to that state, not only a generic
/new
At minimum, the final user-facing message should not erase the fact that:
- this was a long-session / compaction-related incident
- the timeout happened after compaction
- the channel itself was healthy
Actual behavior
The embedded run falls through to the generic external-run failure text in src/auto-reply/reply/agent-runner-execution.ts:
return "⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.";
That makes a long-session + post-compaction + timeout failure look the same as many unrelated errors.
OpenClaw version
v2026.4.14 runtime on Linux, from a live dev checkout with dist/ aligned to the running commit.
Operating system
Ubuntu 25.10
Install method
Git clone + local build / user systemd gateway
Model
openai-codex/gpt-5.4
Provider / routing chain
Observed request path:
- primary:
openai-codex/gpt-5.4
- fallbacks configured via Anthropic models
In the reproduced incident, Anthropic fallbacks were unavailable because the auth profile was in a billing-failure state, which exposed the final user-facing failure path.
Additional provider/model setup details
- Channel where this was observed: WhatsApp group
- Important non-finding: this was not a WhatsApp transport outage
- The gateway and WhatsApp provider stayed healthy throughout the incident
Logs, screenshots, and evidence
Redacted timestamps from one live incident on 2026-04-16:
12:42:19 [agent/embedded] [context-overflow-diag] ... error=Context overflow: estimated context size exceeds safe threshold during tool loop.
12:42:19 [agent/embedded] context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.4
12:44:51 [agent/embedded] auto-compaction succeeded for openai-codex/gpt-5.4; retrying prompt
12:48:52 [diagnostic] lane task error ... error="FailoverError: LLM request timed out."
12:48:52 [model-fallback/decision] ... candidate_failed requested=openai-codex/gpt-5.4 ... reason=timeout
12:48:52 [model-fallback/decision] ... skip_candidate ... anthropic ... reason=billing
12:48:52 Embedded agent failed before reply: All models failed (3): openai-codex/gpt-5.4: LLM request timed out. (timeout) | anthropic/...: Provider anthropic has billing issue ...
Session artifact from the same run recorded:
LLM idle timeout (120s): no response from model | LLM idle timeout (120s): no response from model
Additional local facts from the same session:
- session artifact size at time of inspection:
392 lines / 2.2M
- the failure occurred on the retried run after compaction, not on the original overflow detection
- gateway stayed
active/running
Impact and severity
Medium to high.
This creates a misleading operator experience:
- the channel looks broken even though transport is healthy
- the user gets a generic
/new recovery message even though OpenClaw already attempted compaction
- the actionable distinction between "context too large", "post-compaction retry timed out", and "fallbacks unavailable" is lost
On long-running channel sessions, this also risks repeated retry/death-spiral behavior if the same session keeps getting retried.
Additional information
Two upstream directions seem plausible:
- Preserve failure-cause specificity after successful compaction, instead of collapsing to the generic external-run failure text.
- Revisit the closed
#20910 class of "all models timed out on a bloated session" recovery, but for the newer path where compaction already succeeded and the retried run still timed out.
I am filing this as a new issue rather than only commenting on #58957 or closed #20910 because this reproduction is on a current v2026.4.14 runtime and the key distinguishing detail is:
- compaction succeeded,
- then the retried run timed out,
- and the user still got the same generic fallback.
Summary
On a long-running embedded-agent session, OpenClaw can:
⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.This is confusing because the real failure chain is more specific and more actionable than the final message suggests.
This looks related to, but not the same as:
#58957("Model switch can fail silently when carried-over session context is too large")#20910("Auto-reset session when all models time out")The new evidence here is that the failure still happens even after compaction succeeds. The user is left with the same generic fallback text instead of a compaction/timeout-aware recovery path.
Steps to reproduce
openai-codex/gpt-5.4as the active model.LLM idle timeout (120s): no response from model/newmessageExpected behavior
After a successful compaction attempt, if the retried run still fails, the user should receive a failure message that preserves the real cause chain, for example:
/newAt minimum, the final user-facing message should not erase the fact that:
Actual behavior
The embedded run falls through to the generic external-run failure text in
src/auto-reply/reply/agent-runner-execution.ts:That makes a long-session + post-compaction + timeout failure look the same as many unrelated errors.
OpenClaw version
v2026.4.14runtime on Linux, from a live dev checkout withdist/aligned to the running commit.Operating system
Ubuntu 25.10
Install method
Git clone + local build / user systemd gateway
Model
openai-codex/gpt-5.4Provider / routing chain
Observed request path:
openai-codex/gpt-5.4In the reproduced incident, Anthropic fallbacks were unavailable because the auth profile was in a billing-failure state, which exposed the final user-facing failure path.
Additional provider/model setup details
Logs, screenshots, and evidence
Redacted timestamps from one live incident on 2026-04-16:
Session artifact from the same run recorded:
Additional local facts from the same session:
392lines /2.2Mactive/runningImpact and severity
Medium to high.
This creates a misleading operator experience:
/newrecovery message even though OpenClaw already attempted compactionOn long-running channel sessions, this also risks repeated retry/death-spiral behavior if the same session keeps getting retried.
Additional information
Two upstream directions seem plausible:
#20910class of "all models timed out on a bloated session" recovery, but for the newer path where compaction already succeeded and the retried run still timed out.I am filing this as a new issue rather than only commenting on
#58957or closed#20910because this reproduction is on a currentv2026.4.14runtime and the key distinguishing detail is: