Skip to content

[Bug]: Successful auto-compaction can still end in a 120s embedded timeout and generic /new fallback #67750

@Jackten

Description

@Jackten

Summary

On a long-running embedded-agent session, OpenClaw can:

  1. detect context overflow,
  2. successfully auto-compact the session,
  3. retry the prompt,
  4. hit a 120s LLM idle timeout on the retried run,
  5. exhaust fallback candidates,
  6. and still collapse the user-facing result into the generic:

⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.

This is confusing because the real failure chain is more specific and more actionable than the final message suggests.

This looks related to, but not the same as:

  • #58957 ("Model switch can fail silently when carried-over session context is too large")
  • closed #20910 ("Auto-reset session when all models time out")

The new evidence here is that the failure still happens even after compaction succeeds. The user is left with the same generic fallback text instead of a compaction/timeout-aware recovery path.

Steps to reproduce

  1. Run a long-lived embedded-agent session with enough accumulated conversation/tool history to trigger context-overflow handling.
  2. Use openai-codex/gpt-5.4 as the active model.
  3. Ensure fallback models exist, but make them unavailable or exhausted so the retried primary-model result matters.
  4. Send another message on the long-running session.
  5. Observe this sequence:
    • context overflow detected
    • auto-compaction succeeds
    • retry begins
    • the retried run later hits LLM idle timeout (120s): no response from model
    • fallback candidates are unavailable
    • the user receives the generic "Something went wrong..." + /new message

Expected behavior

After a successful compaction attempt, if the retried run still fails, the user should receive a failure message that preserves the real cause chain, for example:

  • compaction was attempted and succeeded
  • the retried model call timed out after the LLM idle-timeout window
  • fallback models were unavailable
  • suggested next action should be tied to that state, not only a generic /new

At minimum, the final user-facing message should not erase the fact that:

  • this was a long-session / compaction-related incident
  • the timeout happened after compaction
  • the channel itself was healthy

Actual behavior

The embedded run falls through to the generic external-run failure text in src/auto-reply/reply/agent-runner-execution.ts:

return "⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.";

That makes a long-session + post-compaction + timeout failure look the same as many unrelated errors.

OpenClaw version

v2026.4.14 runtime on Linux, from a live dev checkout with dist/ aligned to the running commit.

Operating system

Ubuntu 25.10

Install method

Git clone + local build / user systemd gateway

Model

openai-codex/gpt-5.4

Provider / routing chain

Observed request path:

  • primary: openai-codex/gpt-5.4
  • fallbacks configured via Anthropic models

In the reproduced incident, Anthropic fallbacks were unavailable because the auth profile was in a billing-failure state, which exposed the final user-facing failure path.

Additional provider/model setup details

  • Channel where this was observed: WhatsApp group
  • Important non-finding: this was not a WhatsApp transport outage
  • The gateway and WhatsApp provider stayed healthy throughout the incident

Logs, screenshots, and evidence

Redacted timestamps from one live incident on 2026-04-16:

12:42:19 [agent/embedded] [context-overflow-diag] ... error=Context overflow: estimated context size exceeds safe threshold during tool loop.
12:42:19 [agent/embedded] context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.4
12:44:51 [agent/embedded] auto-compaction succeeded for openai-codex/gpt-5.4; retrying prompt
12:48:52 [diagnostic] lane task error ... error="FailoverError: LLM request timed out."
12:48:52 [model-fallback/decision] ... candidate_failed requested=openai-codex/gpt-5.4 ... reason=timeout
12:48:52 [model-fallback/decision] ... skip_candidate ... anthropic ... reason=billing
12:48:52 Embedded agent failed before reply: All models failed (3): openai-codex/gpt-5.4: LLM request timed out. (timeout) | anthropic/...: Provider anthropic has billing issue ...

Session artifact from the same run recorded:

LLM idle timeout (120s): no response from model | LLM idle timeout (120s): no response from model

Additional local facts from the same session:

  • session artifact size at time of inspection: 392 lines / 2.2M
  • the failure occurred on the retried run after compaction, not on the original overflow detection
  • gateway stayed active/running

Impact and severity

Medium to high.

This creates a misleading operator experience:

  • the channel looks broken even though transport is healthy
  • the user gets a generic /new recovery message even though OpenClaw already attempted compaction
  • the actionable distinction between "context too large", "post-compaction retry timed out", and "fallbacks unavailable" is lost

On long-running channel sessions, this also risks repeated retry/death-spiral behavior if the same session keeps getting retried.

Additional information

Two upstream directions seem plausible:

  1. Preserve failure-cause specificity after successful compaction, instead of collapsing to the generic external-run failure text.
  2. Revisit the closed #20910 class of "all models timed out on a bloated session" recovery, but for the newer path where compaction already succeeded and the retried run still timed out.

I am filing this as a new issue rather than only commenting on #58957 or closed #20910 because this reproduction is on a current v2026.4.14 runtime and the key distinguishing detail is:

  • compaction succeeded,
  • then the retried run timed out,
  • and the user still got the same generic fallback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions