[Bug]: Agent goes completely silent for 30+ minutes after any provider error — no reply, no error message, no fallback (reproduced on Anthropic AND OpenAI)

## Bug type

Regression - critical correctness failure, not a quality issue

## Summary

**OpenClaw can enter a state where a user message is accepted and tool progress is visible, then the session goes completely silent - no reply, no error message, no fallback - for up to 30+ minutes.**

This was reproduced on two separate model providers (`anthropic/claude-sonnet-4-6` and `openai-codex/gpt-5.4`), which means this is **not a provider-specific overload problem**. It is a systemic silent-finalization bug in the embedded runner / reply pipeline.

From the user side it looks like this:
- you send a message
- the agent does one or two tool calls
- then nothing. no typing indicator, no error, no reply
- you wait. and wait. and wait
- 30 minutes later, nothing has changed

## Evidence

### Incident timeline (2026-03-16, Europe/London)

| Time | Event |
|------|-------|
| 15:09:26Z | User sends message. Session starts on `anthropic/claude-sonnet-4-6`. |
| 15:09:29Z | Assistant ends with `stopReason: "error"`, `overloaded_error`. **Empty visible content.** |
| 15:09:36Z | Assistant emits `memory_search` tool call - session *looks* alive. |
| 15:09:38Z | Tool result returns successfully. |
| 15:09:42Z | Assistant ends again with `overloaded_error`. **Still no visible text.** |
| 15:13:55Z | User sends a follow-up message. `exec` tool calls fire. |
| 15:13:57–14:01Z | Tool results return successfully. |
| 15:14:04Z | Assistant ends again with `overloaded_error`. **Still no reply to user.** |
| 15:38:48Z | User manually switches model to `openai-codex/gpt-5.4`. |
| 15:39:06Z | First visible reply finally appears - 30 minutes after the original message. |

### Gateway log evidence

```
2026-03-16T15:09:29Z embedded run agent end: runId=605aeb28 isError=true model=claude-sonnet-4-6 provider=anthropic error=overloaded_error
2026-03-16T15:13:48Z embedded run agent end: runId=c26a034d isError=true model=claude-sonnet-4-6 provider=anthropic error=overloaded_error
```

Gateway/ws timings throughout were healthy (~60–170ms). No gateway restart during the incident window.

### Provider-agnostic confirmation (diagnosis 2)

Further session transcript inspection under `~/.openclaw/agents/navi/sessions/` confirmed that **`gpt-5.4` (OpenAI/Codex) exhibits structurally identical failures:**

- Session `d31236e3`: gpt-5.4 emits a tool call, tool returns successfully, then assistant records `stopReason: "error"` with `server_error` and empty content. No user-visible reply.
- Session `1ab38720`: gpt-5.4 completes real work, final assistant message has `stopReason: "stop"` but `text: ""`. Empty visible reply on a successful run.

This means the root cause is **not Anthropic overload**. Anthropic overload is one trigger. OpenAI server errors are another. Both surface the same broken finalization path.

## Root Cause (diagnosis 2 conclusion)

The embedded runner / reply bridge **lacks a hard invariant** that every user-visible run must terminate in a user-visible reply. Specifically:

1. In `src/auto-reply/reply/agent-runner.ts`: if `payloadArray.length === 0` the code returns `finalizeWithFollowup(undefined, ...)` - producing **no message at all**.
2. When tool calls have already been emitted, the user perceives the session as live. When the follow-up assistant turn fails and the reply bridge silently returns nothing, the user gets silence with no indication of what happened.
3. Retry/fallback logic (in `src/agents/pi-embedded-runner/run.ts` and `src/agents/model-fallback.ts`) runs silently in the background with no user-facing status during retries.
4. Per-session Discord lane serialization (`src/plugin-sdk/keyed-async-queue.ts`) with a 30-minute worker timeout (`extensions/discord/src/monitor/timeouts.ts`) means subsequent user messages also queue behind the dead run.

## Steps to reproduce

1. Use a Discord session for any agent on any provider.
2. Prompt a workflow that triggers at least one tool call.
3. Force or wait for a provider error (`overloaded_error`, `server_error`, or similar) on the assistant turn *after* a tool result returns.
4. Observe: tool results are visible, but no assistant reply follows. Ever.

Alternatively: complete a real multi-step task and observe whether the final assistant `text` is empty.

## Expected behaviour

If a user-visible run starts (tool calls visible, typing indicator shown, or assistant acknowledgement sent), OpenClaw **must** produce one of:
- a visible final reply,
- a visible error/blocked message,
- an explicit `NO_REPLY` that is semantically correct for that channel/task.

Silent finalization for interactive channels like Discord should be treated as a **bug**, not a valid outcome.

## Actual behaviour

The run silently terminates. No message is sent. The session lane may remain effectively blocked for up to 30 minutes (Discord worker timeout ceiling). The user has no idea what happened.

## Implicated files

- `src/auto-reply/reply/agent-runner.ts` - empty-payload exit path
- `src/auto-reply/reply/agent-runner-execution.ts` - tool/typing bridge
- `src/auto-reply/reply/reply-dispatcher.ts` - final dispatch; does not synthesize failure reply
- `src/agents/pi-embedded-runner/run.ts` - retry/fallback orchestration; no user status during retries
- `src/agents/pi-embedded-runner/run/payloads.ts` - error payload construction
- `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts` - lifecycle end logging
- `extensions/discord/src/monitor/inbound-job.ts` - per-session queue key
- `src/plugin-sdk/keyed-async-queue.ts` - serial lane per session
- `extensions/discord/src/monitor/timeouts.ts` - 30-minute worker timeout

## Proposed fix

**Minimum viable patch (highest priority):**

In `src/auto-reply/reply/agent-runner.ts`: enforce that if a run becomes user-visible and ends with no payloads or empty text, synthesize a visible failure reply. Something like: `⚠️ The AI encountered an error and could not complete your request. Please try again.`

**Second priority:**

Add a watchdog: if tool progress has been emitted but no terminal visible reply arrives within N seconds, surface a status update.

**Third priority:**

Emit a one-shot user-visible status during retries/fallback: `⏳ provider overloaded, retrying...` so the user knows the system is still trying.

**Lower priority:**

Add logging on every `payloadArray.length === 0` exit and every blank final text emission on interactive channels.

## OpenClaw version

Built from source, commit era 2026-03-13 to 2026-03-16.

## Operating system

Ubuntu (Linux 6.17.0-19-generic x64)

## Install method

Source (`~/oclaw`)

## Model / provider

Reproduced on both:
- `anthropic/claude-sonnet-4-6`
- `openai-codex/gpt-5.4`

## Impact and severity

**Critical.** This is not a cosmetic bug. A user can be locked out of their agent for 30+ minutes with no indication of what went wrong, no way to know if the system is retrying, and no error message. Subsequent messages may also queue behind the dead run. The experience is equivalent to the agent becoming a black hole.

## Related issues

- #38792 - agent loop silently stalls after API error + tool chains
- #48342 - delay talking to agent when doing nothing
- #40631 - execution-state bug: assistant confirms action but nothing follows

## Additional notes

Investigation performed with OpenCode against live gateway logs and session transcripts. Diagnosis went through two iterations - the second confirmed the provider-agnostic nature of the bug via direct transcript evidence from GPT-5.4 sessions showing identical structural failures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Agent goes completely silent for 30+ minutes after any provider error — no reply, no error message, no fallback (reproduced on Anthropic AND OpenAI) #48361

Bug type

Summary

Evidence

Incident timeline (2026-03-16, Europe/London)

Gateway log evidence

Provider-agnostic confirmation (diagnosis 2)

Root Cause (diagnosis 2 conclusion)

Steps to reproduce

Expected behaviour

Actual behaviour

Implicated files

Proposed fix

OpenClaw version

Operating system

Install method

Model / provider

Impact and severity

Related issues

Additional notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time	Event
15:09:26Z	User sends message. Session starts on `anthropic/claude-sonnet-4-6`.
15:09:29Z	Assistant ends with `stopReason: "error"`, `overloaded_error`. Empty visible content.
15:09:36Z	Assistant emits `memory_search` tool call - session looks alive.
15:09:38Z	Tool result returns successfully.
15:09:42Z	Assistant ends again with `overloaded_error`. Still no visible text.
15:13:55Z	User sends a follow-up message. `exec` tool calls fire.
15:13:57–14:01Z	Tool results return successfully.
15:14:04Z	Assistant ends again with `overloaded_error`. Still no reply to user.
15:38:48Z	User manually switches model to `openai-codex/gpt-5.4`.
15:39:06Z	First visible reply finally appears - 30 minutes after the original message.

Uh oh!

[Bug]: Agent goes completely silent for 30+ minutes after any provider error — no reply, no error message, no fallback (reproduced on Anthropic AND OpenAI) #48361

Description

Bug type

Summary

Evidence

Incident timeline (2026-03-16, Europe/London)

Gateway log evidence

Provider-agnostic confirmation (diagnosis 2)

Root Cause (diagnosis 2 conclusion)

Steps to reproduce

Expected behaviour

Actual behaviour

Implicated files

Proposed fix

OpenClaw version

Operating system

Install method

Model / provider

Impact and severity

Related issues

Additional notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions