Skip to content

[Bug]: Agent goes completely silent for 30+ minutes after any provider error — no reply, no error message, no fallback (reproduced on Anthropic AND OpenAI) #48361

@rbutera

Description

@rbutera

Bug type

Regression - critical correctness failure, not a quality issue

Summary

OpenClaw can enter a state where a user message is accepted and tool progress is visible, then the session goes completely silent - no reply, no error message, no fallback - for up to 30+ minutes.

This was reproduced on two separate model providers (anthropic/claude-sonnet-4-6 and openai-codex/gpt-5.4), which means this is not a provider-specific overload problem. It is a systemic silent-finalization bug in the embedded runner / reply pipeline.

From the user side it looks like this:

  • you send a message
  • the agent does one or two tool calls
  • then nothing. no typing indicator, no error, no reply
  • you wait. and wait. and wait
  • 30 minutes later, nothing has changed

Evidence

Incident timeline (2026-03-16, Europe/London)

Time Event
15:09:26Z User sends message. Session starts on anthropic/claude-sonnet-4-6.
15:09:29Z Assistant ends with stopReason: "error", overloaded_error. Empty visible content.
15:09:36Z Assistant emits memory_search tool call - session looks alive.
15:09:38Z Tool result returns successfully.
15:09:42Z Assistant ends again with overloaded_error. Still no visible text.
15:13:55Z User sends a follow-up message. exec tool calls fire.
15:13:57–14:01Z Tool results return successfully.
15:14:04Z Assistant ends again with overloaded_error. Still no reply to user.
15:38:48Z User manually switches model to openai-codex/gpt-5.4.
15:39:06Z First visible reply finally appears - 30 minutes after the original message.

Gateway log evidence

2026-03-16T15:09:29Z embedded run agent end: runId=605aeb28 isError=true model=claude-sonnet-4-6 provider=anthropic error=overloaded_error
2026-03-16T15:13:48Z embedded run agent end: runId=c26a034d isError=true model=claude-sonnet-4-6 provider=anthropic error=overloaded_error

Gateway/ws timings throughout were healthy (~60–170ms). No gateway restart during the incident window.

Provider-agnostic confirmation (diagnosis 2)

Further session transcript inspection under ~/.openclaw/agents/navi/sessions/ confirmed that gpt-5.4 (OpenAI/Codex) exhibits structurally identical failures:

  • Session d31236e3: gpt-5.4 emits a tool call, tool returns successfully, then assistant records stopReason: "error" with server_error and empty content. No user-visible reply.
  • Session 1ab38720: gpt-5.4 completes real work, final assistant message has stopReason: "stop" but text: "". Empty visible reply on a successful run.

This means the root cause is not Anthropic overload. Anthropic overload is one trigger. OpenAI server errors are another. Both surface the same broken finalization path.

Root Cause (diagnosis 2 conclusion)

The embedded runner / reply bridge lacks a hard invariant that every user-visible run must terminate in a user-visible reply. Specifically:

  1. In src/auto-reply/reply/agent-runner.ts: if payloadArray.length === 0 the code returns finalizeWithFollowup(undefined, ...) - producing no message at all.
  2. When tool calls have already been emitted, the user perceives the session as live. When the follow-up assistant turn fails and the reply bridge silently returns nothing, the user gets silence with no indication of what happened.
  3. Retry/fallback logic (in src/agents/pi-embedded-runner/run.ts and src/agents/model-fallback.ts) runs silently in the background with no user-facing status during retries.
  4. Per-session Discord lane serialization (src/plugin-sdk/keyed-async-queue.ts) with a 30-minute worker timeout (extensions/discord/src/monitor/timeouts.ts) means subsequent user messages also queue behind the dead run.

Steps to reproduce

  1. Use a Discord session for any agent on any provider.
  2. Prompt a workflow that triggers at least one tool call.
  3. Force or wait for a provider error (overloaded_error, server_error, or similar) on the assistant turn after a tool result returns.
  4. Observe: tool results are visible, but no assistant reply follows. Ever.

Alternatively: complete a real multi-step task and observe whether the final assistant text is empty.

Expected behaviour

If a user-visible run starts (tool calls visible, typing indicator shown, or assistant acknowledgement sent), OpenClaw must produce one of:

  • a visible final reply,
  • a visible error/blocked message,
  • an explicit NO_REPLY that is semantically correct for that channel/task.

Silent finalization for interactive channels like Discord should be treated as a bug, not a valid outcome.

Actual behaviour

The run silently terminates. No message is sent. The session lane may remain effectively blocked for up to 30 minutes (Discord worker timeout ceiling). The user has no idea what happened.

Implicated files

  • src/auto-reply/reply/agent-runner.ts - empty-payload exit path
  • src/auto-reply/reply/agent-runner-execution.ts - tool/typing bridge
  • src/auto-reply/reply/reply-dispatcher.ts - final dispatch; does not synthesize failure reply
  • src/agents/pi-embedded-runner/run.ts - retry/fallback orchestration; no user status during retries
  • src/agents/pi-embedded-runner/run/payloads.ts - error payload construction
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts - lifecycle end logging
  • extensions/discord/src/monitor/inbound-job.ts - per-session queue key
  • src/plugin-sdk/keyed-async-queue.ts - serial lane per session
  • extensions/discord/src/monitor/timeouts.ts - 30-minute worker timeout

Proposed fix

Minimum viable patch (highest priority):

In src/auto-reply/reply/agent-runner.ts: enforce that if a run becomes user-visible and ends with no payloads or empty text, synthesize a visible failure reply. Something like: ⚠️ The AI encountered an error and could not complete your request. Please try again.

Second priority:

Add a watchdog: if tool progress has been emitted but no terminal visible reply arrives within N seconds, surface a status update.

Third priority:

Emit a one-shot user-visible status during retries/fallback: ⏳ provider overloaded, retrying... so the user knows the system is still trying.

Lower priority:

Add logging on every payloadArray.length === 0 exit and every blank final text emission on interactive channels.

OpenClaw version

Built from source, commit era 2026-03-13 to 2026-03-16.

Operating system

Ubuntu (Linux 6.17.0-19-generic x64)

Install method

Source (~/oclaw)

Model / provider

Reproduced on both:

  • anthropic/claude-sonnet-4-6
  • openai-codex/gpt-5.4

Impact and severity

Critical. This is not a cosmetic bug. A user can be locked out of their agent for 30+ minutes with no indication of what went wrong, no way to know if the system is retrying, and no error message. Subsequent messages may also queue behind the dead run. The experience is equivalent to the agent becoming a black hole.

Related issues

Additional notes

Investigation performed with OpenCode against live gateway logs and session transcripts. Diagnosis went through two iterations - the second confirmed the provider-agnostic nature of the bug via direct transcript evidence from GPT-5.4 sessions showing identical structural failures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions