fix(gateway): stop chat timeout fallback cascade#87085
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 10:48 PM ET / 02:48 UTC. Summary PR surface: Source +62, Tests +255. Total +317 across 13 files. Reproducibility: yes. The linked report has concrete WebChat logs showing every fallback candidate receiving the same timeout abort, and the patched source/tests now trace that caller abort signal through the fallback boundary; I did not run a live Control UI repro in this read-only review. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land this scoped gateway/WebChat timeout fix if maintainers accept the fallback compatibility change, then keep the broader terminal-abort sources in #62682. Do we have a high-confidence way to reproduce the issue? Yes. The linked report has concrete WebChat logs showing every fallback candidate receiving the same timeout abort, and the patched source/tests now trace that caller abort signal through the fallback boundary; I did not run a live Control UI repro in this read-only review. Is this the best way to solve the issue? Yes for the scoped WebChat/gateway outage path. Tagging the timeout source and carrying the admitted run abort signal to the shared fallback boundary is narrower than changing provider timeout behavior, while broader abort sources remain separated for the related PR. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 9119492f158a. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +62, Tests +255. Total +317 across 13 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
There was a problem hiding this comment.
Pull request overview
This PR fixes WebChat/Control UI timeout handling so a stale chat.send maintenance timeout is treated as a terminal abort and does not cascade through every model fallback candidate.
Changes:
- Tags chat timeout aborts with a
TimeoutErrorabort reason. - Threads the caller abort signal into model fallback and stops fallback on terminal timeout/client-disconnect aborts.
- Adds regression coverage and a changelog entry for the WebChat timeout cascade fix.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/gateway/chat-abort.ts |
Adds timeout-specific abort reason tagging for chat run aborts. |
src/gateway/chat-abort.test.ts |
Verifies timeout aborts carry a TimeoutError reason. |
src/agents/model-fallback.ts |
Detects terminal caller aborts and stops fallback immediately. |
src/agents/model-fallback.test.ts |
Covers terminal timeout abort behavior through runWithModelFallback. |
src/agents/agent-command.ts |
Passes the agent command abort signal into fallback orchestration. |
CHANGELOG.md |
Documents the Control UI/WebChat timeout cascade fix. |
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Gilded Review Wisp Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@BunsDev — heads up on overlap with #62682 (you have it in Three additional terminal-abort sources #62682 covers that #87085 doesn't, in case useful for one combined landing:
Happy to coordinate landing order — if #87085 merges first, I'll rebase #62682 and trim the chat-side |
|
@simonusa thanks for the overlap map. Maintainer preference here is to land #87085 first as the narrow WebChat/chat-send timeout fix for #83962. The pieces I’d keep in #87085 are the chat timeout source tagging, the agent-command abort-signal plumbing, and the caller-signal terminal abort check at the model-fallback boundary. The three extra sources you called out are real, but I’d keep them in #62682 / #60388 rather than folding them into this P1 PR:
So if #87085 lands first, please rebase #62682 afterward and trim the duplicated |
|
Blocking path mismatch found in review. The core idea looks right, but this PR does not currently wire the new terminal-abort gate into the webchat path from #83962. Evidence:
Requested fix:
I would not merge this as-is: the design is sound, but the current diff misses the user-reported path. |
d4dd9bd to
f2a46a3
Compare
|
Verification after rewrite: Behavior addressed: gateway/chat fallback timeout aborts now propagate the admitted run abort signal through fallback orchestration, memory flush fallback, CLI/embedded candidates, and async gateway agent terminal handling. Provider/runtime timeouts without the gateway abort signal still surface as errors instead of successful aborts. Real environment tested: local macOS checkout plus Blacksmith Testbox changed gate. Exact steps or command run after this patch:
Evidence after fix:
Observed result after fix: timeout-aborted fallback attempts stop instead of cascading, wrapped timeout aborts preserve timeout status/stopReason, and provider TimeoutError rejections without the gateway abort signal remain errors. What was not tested: live provider timeout against a real remote provider. |
f2a46a3 to
8f42085
Compare
|
Verification after final rebase: Behavior addressed: gateway/chat fallback timeout aborts now propagate the admitted run abort signal through fallback orchestration, memory flush fallback, CLI/embedded candidates, and async gateway agent terminal handling. Provider/runtime timeouts without the gateway abort signal still surface as errors instead of successful aborts. Real environment tested: local macOS checkout plus Blacksmith Testbox changed gate. Exact steps or command run after this patch:
Evidence after fix:
Observed result after fix: timeout-aborted fallback attempts stop instead of cascading, wrapped timeout aborts preserve timeout status/stopReason, and provider TimeoutError rejections without the gateway abort signal remain errors. What was not tested: live provider timeout against a real remote provider. |
…alAbort (closes openclaw#60388) ClientDisconnectError on signal.reason) plus abortSignal threading through the model-fallback and chat-side callers. This change adds the coverage - ClientDisconnectError class + wiring in http-common.ts so the reason.name === "ClientDisconnectError" branch openclaw#87085 added is actually reachable (upstream watchClientDisconnect still aborts bare). - cron run-budget string reasons (prefix match) — cron timer aborts with a plain string, which the Error-only base check skips. - .cause-chain walking + isTerminalAbortFromError gated on the OPENCLAW_ABORTABLE_WRAPPER marker, for the embedded run-budget timer that aborts a private controller (not the caller signal). - compaction-path abortSignal forwarding (compactEmbeddedPiSessionDirect). - timedOutByRunBudget plumbing through attempt/failover-policy/assistant-failover so run-budget timeouts skip the fallback chain and wasted compaction.
…alAbort (closes openclaw#60388) PR openclaw#87085 landed the base isTerminalAbort(signal) check (TimeoutError / ClientDisconnectError on signal.reason) plus abortSignal threading through the model-fallback and chat-side callers. This change adds the coverage that PR openclaw#87085 did not include: - ClientDisconnectError class + wiring in http-common.ts so the reason.name === "ClientDisconnectError" branch PR openclaw#87085 added is actually reachable (upstream watchClientDisconnect still aborts bare). - cron run-budget string reasons (prefix match) — cron timer aborts with a plain string, which the Error-only base check skips. - .cause-chain walking + isTerminalAbortFromError gated on the OPENCLAW_ABORTABLE_WRAPPER marker, for the embedded run-budget timer that aborts a private controller (not the caller signal). - compaction-path abortSignal forwarding (compactEmbeddedPiSessionDirect). - timedOutByRunBudget plumbing through attempt/failover-policy/assistant-failover so run-budget timeouts skip the fallback chain and wasted compaction.
…alAbort (closes openclaw#60388) PR openclaw#87085 landed the base isTerminalAbort(signal) check (TimeoutError / ClientDisconnectError on signal.reason) plus abortSignal threading through the model-fallback and chat-side callers. This change adds the coverage that PR openclaw#87085 did not include: - ClientDisconnectError class + wiring in http-common.ts so the reason.name === "ClientDisconnectError" branch PR openclaw#87085 added is actually reachable (upstream watchClientDisconnect still aborts bare). - cron run-budget string reasons (prefix match) — cron timer aborts with a plain string, which the Error-only base check skips. - .cause-chain walking + isTerminalAbortFromError gated on the OPENCLAW_ABORTABLE_WRAPPER marker, for the embedded run-budget timer that aborts a private controller (not the caller signal). - compaction-path abortSignal forwarding (compactEmbeddedPiSessionDirect). - timedOutByRunBudget plumbing through attempt/failover-policy/assistant-failover so run-budget timeouts skip the fallback chain and wasted compaction.
…alAbort (closes openclaw#60388) PR openclaw#87085 landed the base isTerminalAbort(signal) check (TimeoutError / ClientDisconnectError on signal.reason) plus abortSignal threading through the model-fallback and chat-side callers. This change adds the coverage that PR openclaw#87085 did not include: - ClientDisconnectError class + wiring in http-common.ts so the reason.name === "ClientDisconnectError" branch PR openclaw#87085 added is actually reachable (upstream watchClientDisconnect still aborts bare). - cron run-budget string reasons (prefix match) — cron timer aborts with a plain string, which the Error-only base check skips. - .cause-chain walking + isTerminalAbortFromError gated on the OPENCLAW_ABORTABLE_WRAPPER marker, for the embedded run-budget timer that aborts a private controller (not the caller signal). - compaction-path abortSignal forwarding (compactEmbeddedPiSessionDirect). - timedOutByRunBudget plumbing through attempt/failover-policy/assistant-failover so run-budget timeouts skip the fallback chain and wasted compaction.
…alAbort (closes openclaw#60388) PR openclaw#87085 landed the base isTerminalAbort(signal) check (TimeoutError / ClientDisconnectError on signal.reason) plus abortSignal threading through the model-fallback and chat-side callers. This change adds the coverage that PR openclaw#87085 did not include: - ClientDisconnectError class + wiring in http-common.ts so the reason.name === "ClientDisconnectError" branch PR openclaw#87085 added is actually reachable (upstream watchClientDisconnect still aborts bare). - cron run-budget string reasons (prefix match) — cron timer aborts with a plain string, which the Error-only base check skips. - .cause-chain walking + isTerminalAbortFromError gated on the OPENCLAW_ABORTABLE_WRAPPER marker, for the embedded run-budget timer that aborts a private controller (not the caller signal). - compaction-path abortSignal forwarding (compactEmbeddedPiSessionDirect). - timedOutByRunBudget plumbing through attempt/failover-policy/assistant-failover so run-budget timeouts skip the fallback chain and wasted compaction.
Summary
chat.sendtimeout cleanup aborts with a terminalTimeoutErrorreason.Fixes #83962.
Refs #83963, #62682.
Verification
pnpm installafter rebasing onto currentorigin/mainto refresh missing post-rebase dependencyrastermill; no repo diff retained.node scripts/run-vitest.mjs src/agents/model-fallback.test.ts src/gateway/chat-abort.test.tsnode scripts/run-vitest.mjs src/agents/model-fallback.run-embedded.e2e.test.ts src/gateway/server-methods/agent.test.ts src/gateway/server.chat.gateway-server-chat.test.tsgit diff --check origin/main..HEAD.agents/skills/autoreview/scripts/autoreview --mode branch --base origin/mainReal behavior proof
Behavior addressed: A stale Control UI/WebChat
chat.sendmaintenance timeout now becomes a terminalTimeoutErrorat the shared chat abort controller, and model fallback stops after the active candidate observes that terminal caller abort instead of cascading the same expired signal through every fallback candidate.Real environment tested: Local macOS Codex worktree, branch
meow/control-ui-webchat-timeout-fallback, commitc78e976b12d5, rebased on currentorigin/main6ef0cbb94f38.Exact steps or command run after this patch: Ran the focused unit tests for
chat-abortandmodel-fallback, then ran the embedded fallback and gateway chat/agent server-method tests listed above.Evidence after fix: Focused suite passed 5 files / 160 tests; broader suite passed 3 files / 275 tests; branch autoreview reported no accepted/actionable findings.
Observed result after fix: Timeout-derived chat aborts carry
signal.reason.name === "TimeoutError", preserveabortStopReason === "timeout", and terminal caller aborts stop fallback after one failed candidate while ordinary provider failover remains covered by existing tests.What was not tested: A live browser Control UI startup delay against real provider fallback was not run; this PR uses gateway/fallback boundary regressions for the shipped failure mode.