Skip to content

fix(gateway): broadcast idle timeout errors to clients after agent run started#85176

Open
JulyanXu wants to merge 1 commit into
openclaw:mainfrom
JulyanXu:fix/chat-broadcast-error-on-idle-timeout
Open

fix(gateway): broadcast idle timeout errors to clients after agent run started#85176
JulyanXu wants to merge 1 commit into
openclaw:mainfrom
JulyanXu:fix/chat-broadcast-error-on-idle-timeout

Conversation

@JulyanXu

Copy link
Copy Markdown
Contributor

Summary

When an LLM idle timeout occurs after the agent has started (e.g., after tool calls), the error is returned as a normal { text, isError: true } payload via the deliver callback — it is NOT thrown. The .then() handler at line 2882 only called emitUserTranscriptUpdate() in the agentRunStarted=true branch, never checking deliveredReplies for error payloads and never calling broadcastChatError. Connected clients received no error feedback — the response silently stopped.

This fix checks deliveredReplies for payloads with isError: true when agentRunStarted=true and !hasBeforeAgentRunGate. If error payloads are found, broadcastChatError is called to notify clients. If not, the existing emitUserTranscriptUpdate() path is preserved.

Closes #84945

Verification

  • Single file change: src/gateway/server-methods/chat.ts (20 lines added, 5 removed)
  • broadcastChatError is already used in the .catch() handler at line 2938 with the same signature
  • deliveredReplies is already filtered for isError in the !agentRunStarted branch (btwReplies)
  • Normal agent completions (no error payloads) continue through emitUserTranscriptUpdate() as before

Real behavior proof

  • Behavior addressed: LLM idle timeout after agent start silently drops — clients receive no error event
  • Real environment tested: Source-level verification against current main
  • Exact steps: Start agent session, agent makes tool calls, subsequent LLM call times out (120s idle) → after fix, client receives state: "error" chat event with the timeout message
  • Evidence after fix: broadcastChatError fires for timeout errors in the agent-started path
  • What was not tested: Live gateway with idle timeout triggered

…n started

When an LLM idle timeout occurs after the agent has started (e.g.,
after tool calls), the error is returned as a normal payload with
isError:true rather than thrown. The .then() handler only called
emitUserTranscriptUpdate() in the agent-started branch, never
checking deliveredReplies for error payloads and never calling
broadcastChatError. Connected clients received no error feedback.

Now checks deliveredReplies for isError payloads when
agentRunStarted=true and calls broadcastChatError when found.

Closes openclaw#84945
@openclaw-barnacle openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 22, 2026
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The branch changes chat.send post-dispatch handling to broadcast a chat error when an agent-started run returns delivered replies with isError: true.

Reproducibility: yes. for source-level reproduction: make dispatchInboundMessage call onAgentRunStart and then deliver a final isError payload; current main reaches only the transcript-update branch. I did not run a live gateway idle-timeout scenario in this read-only review.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Summary: The PR is a useful narrow fix attempt, but missing real behavior proof and the incorrect terminal dedupe state make it not merge-ready yet.

Rank-up moves:

  • Update the error-payload path to cache ok: false/status: "error" with the timeout summary.
  • Add a focused gateway regression test for an agent-started returned isError payload and dedupe status.
  • Add redacted live output, logs, a terminal screenshot, or a linked artifact showing the after-fix gateway error event in a real run.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs stronger real behavior proof before merge: The PR body only describes source-level verification and says no live gateway idle-timeout run was tested, so external-contributor real behavior proof is still needed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • The new error broadcast still leaves chat:<runId> cached as ok: true/status: "ok", so duplicate chat.send calls or agent.wait can observe success for the same timed-out run.
  • The PR body describes source-level verification and explicitly says a live gateway idle-timeout scenario was not tested, so the external real-behavior proof gate is not met.

Maintainer options:

  1. Fix terminal chat dedupe (recommended)
    Update the error-payload path so the same run records ok: false and status: "error" with an error summary before merge.
  2. Accept broadcast-only behavior
    Maintainers could intentionally merge the visible error event first, but cached retry and wait observers may still see an incorrect successful terminal state until a follow-up lands.

Next step before merge
The PR needs contributor real-behavior proof and a dedupe-state fix before normal maintainer review can proceed; ClawSweeper should not open repair markers while the external proof gate is unmet.

Security
Cleared: The diff only changes gateway chat error-event handling and does not touch secrets, dependencies, workflows, install scripts, or supply-chain surfaces.

Review findings

  • [P2] Record the returned error in terminal dedupe — src/gateway/server-methods/chat.ts:2890-2895
Review details

Best possible solution:

Merge a fix that broadcasts the agent-started returned error payload and records the matching terminal chat dedupe state as an error, with focused gateway coverage and redacted real runtime proof.

Do we have a high-confidence way to reproduce the issue?

Yes for source-level reproduction: make dispatchInboundMessage call onAgentRunStart and then deliver a final isError payload; current main reaches only the transcript-update branch. I did not run a live gateway idle-timeout scenario in this read-only review.

Is this the best way to solve the issue?

No; the direction is right, but the current patch should also write the terminal chat dedupe entry as an error and add focused coverage. Source-only proof is not enough for this external PR before merge.

Label changes:

  • add P1: The linked bug affects a real gateway/ACP flow where users lose terminal timeout feedback from a running agent session.
  • add merge-risk: 🚨 message-delivery: The PR changes chat error delivery but can still cache the failed run as successful for duplicate sends or waiters.
  • add rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🦐 gold shrimp, and The PR is a useful narrow fix attempt, but missing real behavior proof and the incorrect terminal dedupe state make it not merge-ready yet.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR body only describes source-level verification and says no live gateway idle-timeout run was tested, so external-contributor real behavior proof is still needed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Label justifications:

  • P1: The linked bug affects a real gateway/ACP flow where users lose terminal timeout feedback from a running agent session.
  • merge-risk: 🚨 message-delivery: The PR changes chat error delivery but can still cache the failed run as successful for duplicate sends or waiters.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🦐 gold shrimp, and The PR is a useful narrow fix attempt, but missing real behavior proof and the incorrect terminal dedupe state make it not merge-ready yet.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR body only describes source-level verification and says no live gateway idle-timeout run was tested, so external-contributor real behavior proof is still needed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Full review comments:

  • [P2] Record the returned error in terminal dedupe — src/gateway/server-methods/chat.ts:2890-2895
    This branch broadcasts state: "error" for returned error payloads, but execution then falls through to the unchanged dedupe write, which records ok: true and status: "ok" for the same run. Duplicate chat.send calls and agent.wait read chat:<runId> from this map, so a timed-out run can still be reported as successful after the error event. Please set the dedupe entry to an error state with the timeout summary when errorPayloads.length > 0.
    Confidence: 0.91

Overall correctness: patch is incorrect
Overall confidence: 0.91

Acceptance criteria:

  • node scripts/run-vitest.mjs src/gateway/server-methods/chat.directive-tags.test.ts
  • git diff --check

What I checked:

  • Current main skips returned errors after agent start: On current main, deliveredReplies are collected, but once agentRunStarted is true the post-dispatch branch only calls emitUserTranscriptUpdate() and does not inspect isError payloads. (src/gateway/server-methods/chat.ts:2433, c8a35c4645dc)
  • Timeout path returns an error payload: The embedded runner turns a prompt idle timeout into a returned payload with text and isError: true, matching the linked issue's non-throwing flow. (src/agents/pi-embedded-runner/run.ts:2693, c8a35c4645dc)
  • PR adds a broadcast without changing terminal dedupe: The PR diff adds the broadcastChatError branch, then falls through to the unchanged terminal setGatewayDedupeEntry block that writes ok: true and status: "ok". (src/gateway/server-methods/chat.ts:2890, 882f7f0cf557)
  • Cached chat sends reuse dedupe truth: Duplicate chat.send calls return the cached dedupe entry's ok, payload, and error, so an error run cached as ok can later be reported as success. (src/gateway/server-methods/chat.ts:2148, c8a35c4645dc)
  • Agent wait reads terminal status from the same dedupe map: readTerminalSnapshotFromDedupeEntry treats status: "error" or !entry.ok as the error signal; status: "ok" with ok: true is not an error snapshot. (src/gateway/server-methods/agent-wait-dedupe.ts:134, c8a35c4645dc)
  • Related issue review already called out dedupe consistency: The linked issue comments include a source-level ClawSweeper review and a follow-up note that the dedupe entry should probably be updated for this error case.

Likely related people:

  • medns: Current blame for the chat.send post-dispatch branch and terminal dedupe write points to commit 01d95b9, whose GitHub commit author is medns. (role: recent area contributor; confidence: high; commits: 01d95b9757a0; files: src/gateway/server-methods/chat.ts, src/gateway/server-methods/chat.directive-tags.test.ts)
  • steipete: git log -S agentRunStarted surfaced the earlier inbound dispatch unification commit as part of the chat.send lifecycle history. (role: feature-history contributor; confidence: medium; commits: 2e0a835e0750; files: src/gateway/server-methods/chat.ts)
  • dutifulbob: git log -S setGatewayDedupeEntry surfaced prior gateway work that bridged chat.send run IDs into agent.wait terminal dedupe behavior. (role: adjacent dedupe contributor; confidence: medium; commits: 61f7cea48bd7; files: src/gateway/server-methods/agent-wait-dedupe.ts, src/gateway/server-methods/chat.ts)
  • bittoby: Gateway chat delivery history includes prior work on webchat final/media delivery adjacent to the deliveredReplies final-event path. (role: adjacent delivery contributor; confidence: low; commits: c50d7183d64a; files: src/gateway/server-methods/chat.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against c8a35c4645dc.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. labels May 22, 2026
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@BingqingLyu

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui gateway Gateway runtime merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: XS status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM idle timeout error silently dropped when agentRunStarted is true

2 participants