[Bug]: Successful auto-compaction can still end in a 120s embedded timeout and generic `/new` fallback

## Summary

On a long-running embedded-agent session, OpenClaw can:

1. detect context overflow,
2. successfully auto-compact the session,
3. retry the prompt,
4. hit a 120s LLM idle timeout on the retried run,
5. exhaust fallback candidates,
6. and still collapse the user-facing result into the generic:

`⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.`

This is confusing because the real failure chain is more specific and more actionable than the final message suggests.

This looks related to, but not the same as:

- `#58957` ("Model switch can fail silently when carried-over session context is too large")
- closed `#20910` ("Auto-reset session when all models time out")

The new evidence here is that the failure still happens even after compaction succeeds. The user is left with the same generic fallback text instead of a compaction/timeout-aware recovery path.

## Steps to reproduce

1. Run a long-lived embedded-agent session with enough accumulated conversation/tool history to trigger context-overflow handling.
2. Use `openai-codex/gpt-5.4` as the active model.
3. Ensure fallback models exist, but make them unavailable or exhausted so the retried primary-model result matters.
4. Send another message on the long-running session.
5. Observe this sequence:
   - context overflow detected
   - auto-compaction succeeds
   - retry begins
   - the retried run later hits `LLM idle timeout (120s): no response from model`
   - fallback candidates are unavailable
   - the user receives the generic "Something went wrong..." + `/new` message

## Expected behavior

After a successful compaction attempt, if the retried run still fails, the user should receive a failure message that preserves the real cause chain, for example:

- compaction was attempted and succeeded
- the retried model call timed out after the LLM idle-timeout window
- fallback models were unavailable
- suggested next action should be tied to that state, not only a generic `/new`

At minimum, the final user-facing message should not erase the fact that:

- this was a long-session / compaction-related incident
- the timeout happened after compaction
- the channel itself was healthy

## Actual behavior

The embedded run falls through to the generic external-run failure text in `src/auto-reply/reply/agent-runner-execution.ts`:

```ts
return "⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.";
```

That makes a long-session + post-compaction + timeout failure look the same as many unrelated errors.

## OpenClaw version

`v2026.4.14` runtime on Linux, from a live dev checkout with `dist/` aligned to the running commit.

## Operating system

Ubuntu 25.10

## Install method

Git clone + local build / user systemd gateway

## Model

`openai-codex/gpt-5.4`

## Provider / routing chain

Observed request path:

- primary: `openai-codex/gpt-5.4`
- fallbacks configured via Anthropic models

In the reproduced incident, Anthropic fallbacks were unavailable because the auth profile was in a billing-failure state, which exposed the final user-facing failure path.

## Additional provider/model setup details

- Channel where this was observed: WhatsApp group
- Important non-finding: this was **not** a WhatsApp transport outage
- The gateway and WhatsApp provider stayed healthy throughout the incident

## Logs, screenshots, and evidence

Redacted timestamps from one live incident on 2026-04-16:

```text
12:42:19 [agent/embedded] [context-overflow-diag] ... error=Context overflow: estimated context size exceeds safe threshold during tool loop.
12:42:19 [agent/embedded] context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.4
12:44:51 [agent/embedded] auto-compaction succeeded for openai-codex/gpt-5.4; retrying prompt
12:48:52 [diagnostic] lane task error ... error="FailoverError: LLM request timed out."
12:48:52 [model-fallback/decision] ... candidate_failed requested=openai-codex/gpt-5.4 ... reason=timeout
12:48:52 [model-fallback/decision] ... skip_candidate ... anthropic ... reason=billing
12:48:52 Embedded agent failed before reply: All models failed (3): openai-codex/gpt-5.4: LLM request timed out. (timeout) | anthropic/...: Provider anthropic has billing issue ...
```

Session artifact from the same run recorded:

```text
LLM idle timeout (120s): no response from model | LLM idle timeout (120s): no response from model
```

Additional local facts from the same session:

- session artifact size at time of inspection: `392` lines / `2.2M`
- the failure occurred on the retried run after compaction, not on the original overflow detection
- gateway stayed `active/running`

## Impact and severity

Medium to high.

This creates a misleading operator experience:

- the channel looks broken even though transport is healthy
- the user gets a generic `/new` recovery message even though OpenClaw already attempted compaction
- the actionable distinction between "context too large", "post-compaction retry timed out", and "fallbacks unavailable" is lost

On long-running channel sessions, this also risks repeated retry/death-spiral behavior if the same session keeps getting retried.

## Additional information

Two upstream directions seem plausible:

1. Preserve failure-cause specificity after successful compaction, instead of collapsing to the generic external-run failure text.
2. Revisit the closed `#20910` class of "all models timed out on a bloated session" recovery, but for the newer path where compaction already succeeded and the retried run still timed out.

I am filing this as a new issue rather than only commenting on `#58957` or closed `#20910` because this reproduction is on a current `v2026.4.14` runtime and the key distinguishing detail is:

- compaction succeeded,
- then the retried run timed out,
- and the user still got the same generic fallback.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Successful auto-compaction can still end in a 120s embedded timeout and generic `/new` fallback #67750

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Successful auto-compaction can still end in a 120s embedded timeout and generic /new fallback #67750

Description

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: Successful auto-compaction can still end in a 120s embedded timeout and generic `/new` fallback #67750