Bug type
Regression - critical correctness failure, not a quality issue
Summary
OpenClaw can enter a state where a user message is accepted and tool progress is visible, then the session goes completely silent - no reply, no error message, no fallback - for up to 30+ minutes.
This was reproduced on two separate model providers (anthropic/claude-sonnet-4-6 and openai-codex/gpt-5.4), which means this is not a provider-specific overload problem. It is a systemic silent-finalization bug in the embedded runner / reply pipeline.
From the user side it looks like this:
- you send a message
- the agent does one or two tool calls
- then nothing. no typing indicator, no error, no reply
- you wait. and wait. and wait
- 30 minutes later, nothing has changed
Evidence
Incident timeline (2026-03-16, Europe/London)
| Time |
Event |
| 15:09:26Z |
User sends message. Session starts on anthropic/claude-sonnet-4-6. |
| 15:09:29Z |
Assistant ends with stopReason: "error", overloaded_error. Empty visible content. |
| 15:09:36Z |
Assistant emits memory_search tool call - session looks alive. |
| 15:09:38Z |
Tool result returns successfully. |
| 15:09:42Z |
Assistant ends again with overloaded_error. Still no visible text. |
| 15:13:55Z |
User sends a follow-up message. exec tool calls fire. |
| 15:13:57–14:01Z |
Tool results return successfully. |
| 15:14:04Z |
Assistant ends again with overloaded_error. Still no reply to user. |
| 15:38:48Z |
User manually switches model to openai-codex/gpt-5.4. |
| 15:39:06Z |
First visible reply finally appears - 30 minutes after the original message. |
Gateway log evidence
2026-03-16T15:09:29Z embedded run agent end: runId=605aeb28 isError=true model=claude-sonnet-4-6 provider=anthropic error=overloaded_error
2026-03-16T15:13:48Z embedded run agent end: runId=c26a034d isError=true model=claude-sonnet-4-6 provider=anthropic error=overloaded_error
Gateway/ws timings throughout were healthy (~60–170ms). No gateway restart during the incident window.
Provider-agnostic confirmation (diagnosis 2)
Further session transcript inspection under ~/.openclaw/agents/navi/sessions/ confirmed that gpt-5.4 (OpenAI/Codex) exhibits structurally identical failures:
- Session
d31236e3: gpt-5.4 emits a tool call, tool returns successfully, then assistant records stopReason: "error" with server_error and empty content. No user-visible reply.
- Session
1ab38720: gpt-5.4 completes real work, final assistant message has stopReason: "stop" but text: "". Empty visible reply on a successful run.
This means the root cause is not Anthropic overload. Anthropic overload is one trigger. OpenAI server errors are another. Both surface the same broken finalization path.
Root Cause (diagnosis 2 conclusion)
The embedded runner / reply bridge lacks a hard invariant that every user-visible run must terminate in a user-visible reply. Specifically:
- In
src/auto-reply/reply/agent-runner.ts: if payloadArray.length === 0 the code returns finalizeWithFollowup(undefined, ...) - producing no message at all.
- When tool calls have already been emitted, the user perceives the session as live. When the follow-up assistant turn fails and the reply bridge silently returns nothing, the user gets silence with no indication of what happened.
- Retry/fallback logic (in
src/agents/pi-embedded-runner/run.ts and src/agents/model-fallback.ts) runs silently in the background with no user-facing status during retries.
- Per-session Discord lane serialization (
src/plugin-sdk/keyed-async-queue.ts) with a 30-minute worker timeout (extensions/discord/src/monitor/timeouts.ts) means subsequent user messages also queue behind the dead run.
Steps to reproduce
- Use a Discord session for any agent on any provider.
- Prompt a workflow that triggers at least one tool call.
- Force or wait for a provider error (
overloaded_error, server_error, or similar) on the assistant turn after a tool result returns.
- Observe: tool results are visible, but no assistant reply follows. Ever.
Alternatively: complete a real multi-step task and observe whether the final assistant text is empty.
Expected behaviour
If a user-visible run starts (tool calls visible, typing indicator shown, or assistant acknowledgement sent), OpenClaw must produce one of:
- a visible final reply,
- a visible error/blocked message,
- an explicit
NO_REPLY that is semantically correct for that channel/task.
Silent finalization for interactive channels like Discord should be treated as a bug, not a valid outcome.
Actual behaviour
The run silently terminates. No message is sent. The session lane may remain effectively blocked for up to 30 minutes (Discord worker timeout ceiling). The user has no idea what happened.
Implicated files
src/auto-reply/reply/agent-runner.ts - empty-payload exit path
src/auto-reply/reply/agent-runner-execution.ts - tool/typing bridge
src/auto-reply/reply/reply-dispatcher.ts - final dispatch; does not synthesize failure reply
src/agents/pi-embedded-runner/run.ts - retry/fallback orchestration; no user status during retries
src/agents/pi-embedded-runner/run/payloads.ts - error payload construction
src/agents/pi-embedded-subscribe.handlers.lifecycle.ts - lifecycle end logging
extensions/discord/src/monitor/inbound-job.ts - per-session queue key
src/plugin-sdk/keyed-async-queue.ts - serial lane per session
extensions/discord/src/monitor/timeouts.ts - 30-minute worker timeout
Proposed fix
Minimum viable patch (highest priority):
In src/auto-reply/reply/agent-runner.ts: enforce that if a run becomes user-visible and ends with no payloads or empty text, synthesize a visible failure reply. Something like: ⚠️ The AI encountered an error and could not complete your request. Please try again.
Second priority:
Add a watchdog: if tool progress has been emitted but no terminal visible reply arrives within N seconds, surface a status update.
Third priority:
Emit a one-shot user-visible status during retries/fallback: ⏳ provider overloaded, retrying... so the user knows the system is still trying.
Lower priority:
Add logging on every payloadArray.length === 0 exit and every blank final text emission on interactive channels.
OpenClaw version
Built from source, commit era 2026-03-13 to 2026-03-16.
Operating system
Ubuntu (Linux 6.17.0-19-generic x64)
Install method
Source (~/oclaw)
Model / provider
Reproduced on both:
anthropic/claude-sonnet-4-6
openai-codex/gpt-5.4
Impact and severity
Critical. This is not a cosmetic bug. A user can be locked out of their agent for 30+ minutes with no indication of what went wrong, no way to know if the system is retrying, and no error message. Subsequent messages may also queue behind the dead run. The experience is equivalent to the agent becoming a black hole.
Related issues
Additional notes
Investigation performed with OpenCode against live gateway logs and session transcripts. Diagnosis went through two iterations - the second confirmed the provider-agnostic nature of the bug via direct transcript evidence from GPT-5.4 sessions showing identical structural failures.
Bug type
Regression - critical correctness failure, not a quality issue
Summary
OpenClaw can enter a state where a user message is accepted and tool progress is visible, then the session goes completely silent - no reply, no error message, no fallback - for up to 30+ minutes.
This was reproduced on two separate model providers (
anthropic/claude-sonnet-4-6andopenai-codex/gpt-5.4), which means this is not a provider-specific overload problem. It is a systemic silent-finalization bug in the embedded runner / reply pipeline.From the user side it looks like this:
Evidence
Incident timeline (2026-03-16, Europe/London)
anthropic/claude-sonnet-4-6.stopReason: "error",overloaded_error. Empty visible content.memory_searchtool call - session looks alive.overloaded_error. Still no visible text.exectool calls fire.overloaded_error. Still no reply to user.openai-codex/gpt-5.4.Gateway log evidence
Gateway/ws timings throughout were healthy (~60–170ms). No gateway restart during the incident window.
Provider-agnostic confirmation (diagnosis 2)
Further session transcript inspection under
~/.openclaw/agents/navi/sessions/confirmed thatgpt-5.4(OpenAI/Codex) exhibits structurally identical failures:d31236e3: gpt-5.4 emits a tool call, tool returns successfully, then assistant recordsstopReason: "error"withserver_errorand empty content. No user-visible reply.1ab38720: gpt-5.4 completes real work, final assistant message hasstopReason: "stop"buttext: "". Empty visible reply on a successful run.This means the root cause is not Anthropic overload. Anthropic overload is one trigger. OpenAI server errors are another. Both surface the same broken finalization path.
Root Cause (diagnosis 2 conclusion)
The embedded runner / reply bridge lacks a hard invariant that every user-visible run must terminate in a user-visible reply. Specifically:
src/auto-reply/reply/agent-runner.ts: ifpayloadArray.length === 0the code returnsfinalizeWithFollowup(undefined, ...)- producing no message at all.src/agents/pi-embedded-runner/run.tsandsrc/agents/model-fallback.ts) runs silently in the background with no user-facing status during retries.src/plugin-sdk/keyed-async-queue.ts) with a 30-minute worker timeout (extensions/discord/src/monitor/timeouts.ts) means subsequent user messages also queue behind the dead run.Steps to reproduce
overloaded_error,server_error, or similar) on the assistant turn after a tool result returns.Alternatively: complete a real multi-step task and observe whether the final assistant
textis empty.Expected behaviour
If a user-visible run starts (tool calls visible, typing indicator shown, or assistant acknowledgement sent), OpenClaw must produce one of:
NO_REPLYthat is semantically correct for that channel/task.Silent finalization for interactive channels like Discord should be treated as a bug, not a valid outcome.
Actual behaviour
The run silently terminates. No message is sent. The session lane may remain effectively blocked for up to 30 minutes (Discord worker timeout ceiling). The user has no idea what happened.
Implicated files
src/auto-reply/reply/agent-runner.ts- empty-payload exit pathsrc/auto-reply/reply/agent-runner-execution.ts- tool/typing bridgesrc/auto-reply/reply/reply-dispatcher.ts- final dispatch; does not synthesize failure replysrc/agents/pi-embedded-runner/run.ts- retry/fallback orchestration; no user status during retriessrc/agents/pi-embedded-runner/run/payloads.ts- error payload constructionsrc/agents/pi-embedded-subscribe.handlers.lifecycle.ts- lifecycle end loggingextensions/discord/src/monitor/inbound-job.ts- per-session queue keysrc/plugin-sdk/keyed-async-queue.ts- serial lane per sessionextensions/discord/src/monitor/timeouts.ts- 30-minute worker timeoutProposed fix
Minimum viable patch (highest priority):
In
src/auto-reply/reply/agent-runner.ts: enforce that if a run becomes user-visible and ends with no payloads or empty text, synthesize a visible failure reply. Something like:⚠️ The AI encountered an error and could not complete your request. Please try again.Second priority:
Add a watchdog: if tool progress has been emitted but no terminal visible reply arrives within N seconds, surface a status update.
Third priority:
Emit a one-shot user-visible status during retries/fallback:
⏳ provider overloaded, retrying...so the user knows the system is still trying.Lower priority:
Add logging on every
payloadArray.length === 0exit and every blank final text emission on interactive channels.OpenClaw version
Built from source, commit era 2026-03-13 to 2026-03-16.
Operating system
Ubuntu (Linux 6.17.0-19-generic x64)
Install method
Source (
~/oclaw)Model / provider
Reproduced on both:
anthropic/claude-sonnet-4-6openai-codex/gpt-5.4Impact and severity
Critical. This is not a cosmetic bug. A user can be locked out of their agent for 30+ minutes with no indication of what went wrong, no way to know if the system is retrying, and no error message. Subsequent messages may also queue behind the dead run. The experience is equivalent to the agent becoming a black hole.
Related issues
Additional notes
Investigation performed with OpenCode against live gateway logs and session transcripts. Diagnosis went through two iterations - the second confirmed the provider-agnostic nature of the bug via direct transcript evidence from GPT-5.4 sessions showing identical structural failures.