fix(agents): distinguish terminal aborts from retryable failures (#60388) by simonusa · Pull Request #62682 · openclaw/openclaw

simonusa · 2026-04-07T19:49:04Z

Addresses #60388 (complementary to #52365, see "Relationship to PR #52365" below)

Today the fallback layer cannot tell the difference between two very different aborts:

"This model failed, try another" -> fallback should retry
"The whole run is over" -> fallback should stop immediately

Two situations where the run is over and retrying with another model wastes resources:

Run-budget timeout (Don't trigger model fallback when abort reason is the run's own timeout budget #60388): The embedded runner's scheduleAbortTimer fires runAbortController.abort(makeTimeoutAbortReason()) after the configured agents.defaults.timeoutSeconds. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. The user issue description (Don't trigger model fallback when abort reason is the run's own timeout budget #60388) says "On a fleet of ~100 cron jobs: 30+ model-fallback events per day, almost all triggered by run timeouts."
HTTP client disconnect: When a client closes its connection mid-request, watchClientDisconnect aborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted.

Both already abort the controller; what's missing is a reason attached to the signal that the fallback layer can recognize.

Summary

Problem: Run-budget timeout aborts (and HTTP client disconnects) currently flow through the model fallback chain as if they were retryable provider failures, wasting API calls and lane occupancy.
Why it matters: Each unwanted fallback attempt holds a session lane and burns tokens for no benefit. On busy systems (Don't trigger model fallback when abort reason is the run's own timeout budget #60388 reports 30+ events/day from cron timeouts alone) this is meaningful overhead.
What changed: The fallback layer now checks AbortSignal.reason for two terminal markers (name === "TimeoutError" for run-budget, name === "ClientDisconnectError" for client disconnect) and short-circuits the chain when found.
What did NOT change (scope boundary): No changes to how aborts are triggered, no changes to per-provider timeout/retry semantics, no changes to AbortController lifecycles. Issues Cron job timeout aborts entire model fallback chain via shared AbortController #37505 (cron AbortController sharing) and [bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight #58578 (mid-flight primary restore) are different abort sources upstream of the fallback layer and remain unchanged.

Change Type

Bug fix

Scope

Gateway / orchestration

Linked Issue

Closes Don't trigger model fallback when abort reason is the run's own timeout budget #60388
This PR fixes a bug or regression

Root Cause

Root cause: shouldRethrowAbort() in model-fallback.ts checks isFallbackAbortError(err) && !isTimeoutError(err) -- which means timeout errors are intentionally not rethrown, so the fallback chain runs them. This is correct for per-provider timeouts (provider is slow, try another), but wrong for run-budget timeouts (whole run is out of time, no point retrying). The two cannot be told apart from the error alone.
Missing detection / guardrail: No check for why the abort happened. The signal.reason carrying TimeoutError (set by pi-embedded-runner/run/attempt.ts:1382-1386 via makeTimeoutAbortReason) was never inspected.
Contributing context: The HTTP client disconnect path added in fix(gateway): propagate AbortController signal on HTTP client disconnect #54388 has the same shape -- it tags the abort with no reason today, so a downstream client disconnect also flows through fallback retries.

Regression Test Plan

Coverage level: [x] Unit test
Target test or file: src/agents/model-fallback.test.ts
Scenario the tests lock in: Six new tests under describe("terminal abort propagation (closes #60388)"):
- signal.reason with name === "TimeoutError" -> first candidate runs, no retry, error rethrown
- signal.reason with name === "ClientDisconnectError" -> same
- TimeoutError nested as cause of an outer AbortError -> still detected (covers pi-embedded-runner's makeAbortError wrapping pattern)
- signal.reason with a generic error -> fallback runs normally (non-terminal)
- No abortSignal passed -> fallback runs normally (back-compat for existing callers)
- abortSignal provided but not aborted -> fallback runs normally (live-signal back-compat)
Why this is the smallest reliable guardrail: The tests construct the abort signal directly and assert on run.mock.calls.length === 1 to verify the chain stopped. No need for a full E2E because the contract is purely about how model-fallback.ts interprets AbortSignal.reason.
Existing test that already covers this: None.
If no new test is added, why not: 6 new tests added.

User-visible / Behavior Changes

When a run-budget timeout fires (the agent run exceeds agents.defaults.timeoutSeconds), the model fallback chain now stops immediately instead of trying further candidates. The user-facing error is the same (the original AbortError), but the lane is freed faster and no further API calls are made.

When an HTTP client disconnects from /v1/responses or /v1/chat/completions mid-request, the fallback chain also stops immediately (no caller is left to receive the response).

Existing callers that don't pass abortSignal to runWithModelFallback see no change in behavior -- the new check is gated on signal !== undefined && signal.aborted.

Diagram

Before:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)
  -> agent attempt throws AbortError
  -> shouldRethrowAbort(err) returns false (because isTimeoutError(err) is true)
  -> runFallbackCandidate returns { ok: false } -> tries next candidate
  -> next candidate also times out (~0ms budget left) -> tries next ...
  -> wasted API calls

After:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)  // unchanged
  -> agent attempt throws AbortError
  -> isTerminalAbort(signal) returns true (signal.reason.name === "TimeoutError")
  -> shouldRethrowAbort(err, signal) returns true
  -> error rethrown immediately, no further candidates tried

Security Impact

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No (this REDUCES network calls in the timeout path)
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS (Docker linux/amd64)
Runtime/container: Node 22 / OpenClaw built from this branch
Test runner: Vitest

Steps

Configure an agent with a small agents.defaults.timeoutSeconds (e.g. 5s) and one or more model fallbacks
Send a request that takes longer than the timeout
Observe the fallback chain behavior

Expected (after fix)

Primary candidate runs, hits the timeout, fallback chain stops, single error returned
Logs show no [model-fallback] candidate_failed entries for fallback candidates beyond the first

Actual (before fix)

Primary candidate runs, hits the timeout
Fallback layer tries the next candidate with ~0ms budget remaining
Next candidate also times out, fallback layer tries the next, and so on
2-3x the API calls before the chain exhausts and returns an error

Evidence

$ npx vitest run src/agents/model-fallback.test.ts

Test Files  1 passed (1)
     Tests  70 passed (70)

The new test cases:

rethrows immediately when signal.reason has name=TimeoutError (run-budget timeout)
rethrows immediately when signal.reason has name=ClientDisconnectError
detects TimeoutError nested as cause of an outer Error
falls back normally when signal is aborted with a non-terminal reason
falls back normally when no abortSignal is passed (back-compat)
falls back normally when signal is provided but not aborted

Human Verification

Verified scenarios:
- All 70 tests in model-fallback.test.ts pass after the change
- All 20 tests in model-fallback.probe.test.ts pass
- All 8 tests in agent-command.live-model-switch.test.ts pass
- End-to-end smoke: container rebuilt from this branch, HTTP client disconnect on /v1/responses produces the expected [openresponses] client disconnected, aborting streaming run runId=... log line and the agent run terminates within ~1s (the upstream watchClientDisconnect plumbing already in aad3bbedd works correctly with the new ClientDisconnectError reason tag)
Edge cases checked: cause-chain walking (one level deep) for TimeoutError/ClientDisconnectError wrapped inside an outer AbortError -- this matches pi-embedded-runner/run/attempt.ts:1387's makeAbortError pattern
What I did not verify:
- Direct E2E of the fallback-skip path with a real cron timeout firing (would require setting timeoutSeconds to a very small value; the unit tests cover the contract)
- The image-model fallback variant (runWithImageModelFallback) -- it gets the new abortSignal? parameter for consistency but no callers currently pass a signal

Compatibility / Migration

Backward compatible? Yes -- the new abortSignal? parameter is optional. Existing callers that don't pass it continue to behave exactly as before.
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: A future caller passes an abort signal whose signal.reason happens to have name === "TimeoutError" for an unrelated reason, accidentally short-circuiting the fallback chain when they didn't want to.
- Mitigation: The check is intentionally narrow -- only the exact name strings TimeoutError and ClientDisconnectError match. Both names are already reserved for run-budget timeout (set by makeTimeoutAbortReason in pi-embedded-runner/run/attempt.ts) and HTTP client disconnect (set by watchClientDisconnect in gateway/http-common.ts after this PR).
Risk: TimeoutError from a per-provider request timeout (not run-budget) gets mistaken for a run-budget timeout and skips fallback when it shouldn't.
- Mitigation: Per-provider timeouts come from the LLM SDK's internal fetch timeout, which throws an error directly -- they don't propagate through runAbortController.abort(). Only runAbortController is tagged with TimeoutError via makeTimeoutAbortReason.

Out of scope (for separate PRs)

Issue Cron job timeout aborts entire model fallback chain via shared AbortController #37505 (cron timeout aborts entire fallback chain via shared AbortController) -- different abort source (cron service executeJobCoreWithTimeout), needs per-attempt AbortController isolation. PR fix(cron): per-attempt AbortControllers and deferred execution timeout #42482 already addresses this.
Issue [bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight #58578 (fallback chain aborted by premature primary restore mid-flight) -- different abort source (requestLiveSessionModelSwitch), needs coordination state between live-model-switch and fallback layers

Relationship to PR #52365

PR #52365 (fix(cron): stop fallback attempts when cron budget is exhausted) addresses the same underlying problem from #60388 but with a different, complementary mechanism:

fix(cron): stop fallback attempts when cron budget is exhausted #52365 is proactive: a new beforeAttempt hook in runWithModelFallback that lets the cron layer check its remaining budget before each attempt and stop the chain if budget is too low.
This PR is reactive: an isTerminalAbort(signal) check in shouldRethrowAbort that inspects signal.reason after an attempt aborts to decide whether to rethrow or retry.

These are complementary, not exclusive. #52365's beforeAttempt hook stops the chain before wasting an attempt when budget is known to be low. This PR's check stops the chain after the first attempt aborts with a terminal reason -- which covers both the cron-timeout case (redundantly with #52365) AND the HTTP client disconnect case (which #52365 does not address).

Notable differences:

Scope: this PR is 8 files / ~276 lines touching only model-fallback.ts, http-common.ts, agent-command.ts, 3 auto-reply callers, and one cron caller. fix(cron): stop fallback attempts when cron budget is exhausted #52365 is 47 files / ~1900 lines (includes unrelated slack/plugin-sdk changes).
Client disconnect coverage: this PR handles the HTTP /v1/responses and /v1/chat/completions client-disconnect case by tagging the abort with ClientDisconnectError in watchClientDisconnect. fix(cron): stop fallback attempts when cron budget is exhausted #52365 does not touch the gateway HTTP path.
Detection layer: this PR checks signal.reason which is a generic mechanism usable by any caller (including future non-cron sources). fix(cron): stop fallback attempts when cron budget is exhausted #52365's beforeAttempt hook requires each caller to implement its own budget-check logic.

Either PR alone fixes the #60388 cron-timeout case. Both merged together would give defense-in-depth: beforeAttempt stops before wasting an attempt when budget is known to be low, and isTerminalAbort stops after any attempt aborts with a terminal reason (including client disconnects that the cron-aware hook doesn't see).

If the maintainers prefer #52365's approach and would rather not have two overlapping mechanisms, I'd suggest retargeting this PR to cover only the ClientDisconnectError branch of isTerminalAbort (the unique coverage) and letting #52365 handle the cron-timeout case via its proactive hook.

greptile-apps · 2026-04-07T19:53:00Z

Greptile Summary

This PR adds terminal abort detection to the model fallback chain to prevent wasteful API retries when a run-budget timeout or HTTP client disconnect has already terminated the run. It introduces isTerminalAbort(signal) which inspects signal.reason.name for "TimeoutError" and "ClientDisconnectError", threads an optional abortSignal parameter through runWithModelFallback and runWithImageModelFallback, and tags watchClientDisconnect aborts with the new ClientDisconnectError reason. The implementation correctly places the terminal-abort guard before coerceToFailoverError to prevent rate-limit-shaped errors from masking a terminal signal, and the 7 new unit tests cover the full contract including the RESOURCE_EXHAUSTED-race edge case.

Confidence Score: 5/5

Safe to merge — the new abortSignal parameter is optional, all existing callers are unaffected, and the terminal-abort path is gated behind an explicit signal.reason name check.

All remaining findings are P2 or below. The implementation is well-scoped, the double isTerminalAbort guard (pre- and post-coerceToFailoverError) is defensive but not harmful, tests cover all advertised scenarios including the failover-normalization race, and backward compatibility is preserved by design.

No files require special attention.

_{Reviews (2): Last reviewed commit: "fix(agents): distinguish terminal aborts..." | Re-trigger Greptile}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 537fee730e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c59102df7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

simonusa · 2026-04-07T20:29:16Z

@greptileai review

Flagged by codex review on openclaw#62682: `isTerminalAbort` only handled `reason instanceof Error`, but `src/cron/service/timer.ts:90` aborts the run controller with a plain string (`timeoutErrorMessage()`), so cron timeouts were falling through to fallback retries. Fix: add a known-terminal-strings set (currently just "cron: job execution timed out") and match string reasons against it. Also match Error objects whose `.message` equals a known terminal string, to cover callers that wrap the message in `new Error(...)` before passing it to `abort()`. Three new regression tests cover: string reason match, wrapped-in-Error match, and back-compat (unrelated strings still flow through fallback). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

simonusa · 2026-04-09T11:07:58Z

checks-node-test failed with an OOM crash (FATAL ERROR: Reached heap limit) unrelated to this PR. Could a maintainer re-run the check?

simonusa · 2026-04-09T21:44:07Z

check-additional failures are pre-existing upstream lint debt unrelated to this PR:

lint:tmp:no-random-messaging — os.tmpdir() usage in extensions/active-memory, browser, memory-core, etc. (being addressed by #63902)
lint:tmp:channel-agnostic-boundaries — channel id literals in src/agents/acp-spawn.ts
lint:tmp:no-raw-channel-fetch — raw fetch() in extensions/browser/src/browser/client-fetch.ts
All three fail on the current upstream/main HEAD independently of this PR. This PR only touches model-fallback.ts, agent-command.ts, gateway/http-common.ts, and followup-runner.ts.

Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clawsweeper · 2026-05-01T00:31:01Z

Codex review: needs maintainer review before merge.

Summary
The PR teaches model fallback to stop on terminal run aborts, tags HTTP disconnect aborts, adds an embedded run-budget timeout failover flag, records it in trajectory metadata, and adds regression tests plus a changelog entry.

Reproducibility: yes. A source-level reproduction is an embedded run with a short timeoutMs and fallback candidates: scheduleAbortTimer aborts the private runAbortController with a TimeoutError, which current main can map into timeout failover; the PR adds unit and policy coverage for both the thrown cause-chain and explicit policy-flag paths.

Next step before merge
No narrow automated repair remains; maintainer review should decide whether to merge this terminal-abort path alongside the related cron-budget PRs.

Security
Cleared: The diff changes in-process abort classification, signal plumbing, gateway disconnect tagging, trajectory metadata, tests, and changelog only; it adds no dependency, workflow, package, secret, download, or execution-surface change.

Review details

Best possible solution:

Land this narrow terminal-abort contract if maintainers want reactive coverage for embedded budget exhaustion, cron timeout aborts, and HTTP disconnects, while leaving broader cron controller redesign to the related cron PRs.

Do we have a high-confidence way to reproduce the issue?

Yes. A source-level reproduction is an embedded run with a short timeoutMs and fallback candidates: scheduleAbortTimer aborts the private runAbortController with a TimeoutError, which current main can map into timeout failover; the PR adds unit and policy coverage for both the thrown cause-chain and explicit policy-flag paths.

Is this the best way to solve the issue?

Yes. The latest patch is a narrow maintainable fix for the reactive terminal-abort gap: classify explicit terminal abort markers at the fallback boundary and add a runner-owned timedOutByRunBudget flag, while preserving ordinary LLM/provider timeouts as fallback-eligible.

What I checked:

Current fallback gap on main: Current main only rethrows AbortError when it is not recognized as a timeout, and fallback normalization runs before that guard, so timeout-shaped aborts can continue through the fallback chain. (src/agents/model-fallback.ts:107, 9eb79bcf997b)
Embedded run-budget timeout source: Current main's run-budget timer calls abortRun(true), which aborts the private runAbortController with a TimeoutError reason; the caller abort signal is handled separately in onAbort. (src/agents/pi-embedded-runner/run/attempt.ts:2245, 9eb79bcf997b)
Current failover policy rotates on run timeout: Current main still rotates/falls back for timedOut assistant attempts unless the timeout happened during compaction or tool execution. (src/agents/pi-embedded-runner/run/failover-policy.ts:82, 9eb79bcf997b)
PR classifies terminal aborts before normalization: The PR checks terminal abort state before failover normalization and inspects both the caller signal and the thrown error cause chain, covering the private embedded controller path described in the earlier review. (src/agents/model-fallback.ts:377, 3fa442cb0ef1)
PR marks embedded budget exhaustion explicitly: The PR sets timedOutByRunBudget when scheduleAbortTimer fires, then carries that marker into trajectory artifacts and run results. (src/agents/pi-embedded-runner/run/attempt.ts:2425, 3fa442cb0ef1)
PR preserves fallback for non-budget timeouts: The PR updates shouldRotateAssistant to skip rotation only when timedOutByRunBudget is set, leaving ordinary LLM/provider timeout fallback eligible. (src/agents/pi-embedded-runner/run/failover-policy.ts:88, 3fa442cb0ef1)

Likely related people:

steipete: Recent commit history routes core model-fallback and embedded-runner failover maintenance through this handle, including lazy-loader refactors, model alias fallback fixes, raw fallback schema preservation, and extraction of embedded failover helpers. (role: recent maintainer and adjacent owner; confidence: high; commits: 59fb9e5ca7fe, aec5efed8d43, 5b39be3653ac; files: src/agents/model-fallback.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/failover-policy.ts)
obviyus: Merged and recently touched the closely related tool-execution timeout exemption in the same embedded-runner failover path, which this PR parallels for run-budget exhaustion. (role: recent merger and timeout-failover maintainer; confidence: high; commits: 731f640bca0a, 2605490dbd30; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/failover-policy.ts)
simonusa: Beyond authoring this PR, this handle appears in merged history for the adjacent fix(agents): exempt tool-execution timeouts from model fallback (#52147) #75873 timeout-failover exemption, making them relevant to the current policy shape. (role: prior merged contributor in this area; confidence: medium; commits: 2605490dbd30; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/failover-policy.ts)
Lellansin: Introduced the merged HTTP client-disconnect abort plumbing that this PR now tags with a terminal abort reason. (role: HTTP disconnect surface contributor; confidence: medium; commits: aad3bbebdd87; files: src/gateway/http-common.ts, src/gateway/openai-http.ts, src/gateway/openresponses-http.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9eb79bcf997b.

simonusa · 2026-05-01T00:39:35Z

Rebased onto current upstream/main (commit e47a7448e9). Three patches preserved:

a6444d0a7f — fix(agents): distinguish terminal aborts from retryable failures (closes Don't trigger model fallback when abort reason is the run's own timeout budget #60388)
6e68d934b0 — fix(agents): check isTerminalAbort before coerceToFailoverError
59e763c87f — fix(agents): treat string abort reasons as terminal aborts

Conflicts resolved across 4 files (src/agents/agent-command.ts, src/agents/model-fallback.ts, src/auto-reply/reply/agent-runner-execution.ts, src/auto-reply/reply/followup-runner.ts) — all "keep both" merges where upstream added new options to runWithModelFallback (onFallbackStep, classifyResult) and our patch adds abortSignal. No semantic conflict — both extensions land alongside each other.

Local validation: npx vitest run src/agents/model-fallback.test.ts → 152/152 pass, including the 6 new abort-signal tests added in this PR.

Re: the 2 CI failures (checks-node-core + checks-node-core-runtime-infra):

Both surface the same root cause — src/config/schema.base.generated.test.ts failing with "expected schema to deeply equal generated payload." This is a stale generated baseline, not a real test failure: a recent upstream config schema change wasn't followed by pnpm generate:base-config-schema. Same class as #63902 (the lint-debt failures from my prior CI run that have since been cleared). Unrelated to this PR's changes (which are scoped to src/agents/, src/auto-reply/, no config schema touch).

The original CI failures from Apr 9 (check-additional lint debt + checks-node-test OOM) are now passing — net progress from the rebase.

Ready for review.

Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openclaw#60388) When a model fallback chain runs after an abort, today the layer cannot tell the difference between: 1. "This model failed, try another" -> fallback should retry 2. "The whole run is over" -> fallback should stop immediately Two situations where the run is over and retrying wastes resources: - **Run-budget timeout**: The embedded runner's `scheduleAbortTimer` fires `runAbortController.abort(makeTimeoutAbortReason())` after the configured `agents.defaults.timeoutSeconds`. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. This is openclaw#60388. - **HTTP client disconnect**: When a client closes its connection mid-request, `watchClientDisconnect` aborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted. Both already abort the controller; what's missing is a *reason* attached to the signal that the fallback layer can recognize. This patch: - Adds `ClientDisconnectError` and tags the abort in `watchClientDisconnect` via `abortController.abort(new ClientDisconnectError())`. The run-timeout case already tags with `name === "TimeoutError"` via `makeTimeoutAbortReason`. - Adds `isTerminalAbort(signal)` in `model-fallback.ts` that walks one cause level (because pi-embedded-runner's `makeAbortError` wraps the original reason as an outer AbortError with `.cause` set) and recognizes both `TimeoutError` and `ClientDisconnectError` as terminal markers. - Wires `shouldRethrowAbort(err, signal)` to short-circuit the fallback chain when the signal carries a terminal reason. - Adds `abortSignal` parameter to `runWithModelFallback` and plumbs it through `runFallbackAttempt` -> `runFallbackCandidate`. Updates the five callers that have a signal in scope (`agent-command.ts`, `agent-runner-execution.ts`, `agent-runner-memory.ts`, `followup-runner.ts`, `cron/isolated-agent/run-executor.ts`). - Six new unit tests in `model-fallback.test.ts` covering: TimeoutError reason, ClientDisconnectError reason, nested cause chain detection, generic reason falls back normally, no signal back-compat, signal-not-aborted back-compat. `runWithImageModelFallback` also gets the `abortSignal?` parameter for consistency. Detection is via `signal.reason` (not `err.message` or `err.name`) because the fetch wrapper re-throws aborts as a generic `AbortError` that loses the original tag in `.name`. The `signal.reason` survives this round-trip. Test results: `model-fallback.test.ts` 70/70 passed, `model-fallback.probe.test.ts` 20/20 passed, `agent-command.live-model-switch.test.ts` 8/8 passed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Flagged by codex review on openclaw#62682: `isTerminalAbort` only handled `reason instanceof Error`, but `src/cron/service/timer.ts:90` aborts the run controller with a plain string (`timeoutErrorMessage()`), so cron timeouts were falling through to fallback retries. Fix: add a known-terminal-strings set (currently just "cron: job execution timed out") and match string reasons against it. Also match Error objects whose `.message` equals a known terminal string, to cover callers that wrap the message in `new Error(...)` before passing it to `abort()`. Three new regression tests cover: string reason match, wrapped-in-Error match, and back-compat (unrelated strings still flow through fallback). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address P2/P3 bot review on PR openclaw#62682: [P2] Embedded run-budget timeouts misclassified — `opts.abortSignal` is the caller-provided signal (HTTP disconnect, cron wrapper). It is NOT the embedded runner's private `runAbortController` — when `scheduleAbortTimer` fires and calls `abortRun(true)`, only the internal controller is aborted with a TimeoutError. The previous patch's `isTerminalAbort(opts.abortSignal)` check therefore missed the openclaw#60388 case entirely; the fallback chain still tried alternative models even though the whole run's deadline had elapsed. Two complementary changes address this: (1) `EmbeddedRunAttemptResult.timedOutByRunBudget` flag — set explicitly in attempt.ts when the run-budget timer fires (NOT when LLM idle watchdog fires; that's still LLM-phase and benefits from fallback). Threaded through to failover-policy `shouldRotateAssistant` alongside the existing `timedOutDuringCompaction` and `timedOutDuringToolExecution` exemptions. Also gates the timeout-triggered compaction branch in run.ts (compacting wouldn't help — the deadline is exhausted). Optional in result type for public harness SDK back-compat (matches openclaw#75873 pattern). (2) `isTerminalAbortFromError(err)` — new helper that mirrors the existing `isTerminalAbort(signal)` checks but reads the thrown error's `.cause` chain. Needed because `abortable()` wraps the embedded controller's TimeoutError in an outer AbortError, and the fallback layer only sees the thrown error (not the embedded controller's signal). Used in `runFallbackCandidate.catch` next to the existing signal check. [P3] CHANGELOG entry added under Unreleased/Fixes. Tests: 197/197 pass in failover-policy, assistant-failover, model-fallback, and trajectory-metadata suites. New regression cases cover: (a) thrown error with TimeoutError in cause chain rethrows; (b) thrown error with ClientDisconnectError in cause chain rethrows; (c) generic AbortError still falls back; (d) failover-policy exempts run-budget timeouts both pre- and post-rotation. Closes openclaw#60388.

simonusa · 2026-05-02T19:55:31Z

Addressed both clawsweeper review findings in 3fa442cb0e:

[P2] Embedded run-budget timeouts now marked as terminal

The bot was correct that the prior approach (isTerminalAbort(opts.abortSignal)) only caught caller-driven aborts (HTTP disconnect, cron wrapper) — it missed the scheduleAbortTimer case described in #60388, where the embedded runner aborts a private runAbortController that the fallback layer never sees.

Two complementary changes address this:

EmbeddedRunAttemptResult.timedOutByRunBudget flag — set explicitly in attempt.ts when the run-budget timer fires (NOT when LLM idle watchdog fires; that's still LLM-phase and benefits from fallback). Threaded through to failover-policy.shouldRotateAssistant alongside the existing timedOutDuringCompaction and timedOutDuringToolExecution exemptions, and gates the timeout-triggered compaction branch in run.ts. Optional in result type for public harness SDK back-compat (matches the pattern proven in fix(agents): exempt tool-execution timeouts from model fallback (#52147) #75873).
isTerminalAbortFromError(err) helper — mirrors the existing isTerminalAbort(signal) checks but inspects the thrown error's .cause chain. Needed because abortable() wraps the embedded controller's TimeoutError in an outer AbortError; the fallback layer only sees the thrown error, not the embedded controller's signal.

[P3] CHANGELOG entry added under Unreleased / Fixes.

Tests: 197/197 pass in model-fallback.test.ts, failover-policy.test.ts, assistant-failover.test.ts, and trajectory/metadata.test.ts. Five new regression cases cover:

Thrown error with TimeoutError in cause chain → rethrows immediately (the embedded run-budget case)
Thrown error with ClientDisconnectError in cause chain → rethrows immediately
Generic AbortError without terminal cause → still falls back normally (back-compat)
Failover-policy: timedOutByRunBudget=true returns continue_normal both pre- and post-rotation

@steipete @obviyus — flagging since you both maintain this area and @obviyus recently merged the closely-related #75873 (timedOutDuringToolExecution exemption for tool-phase timeouts). This PR adds the parallel exemption for run-budget exhaustion, completing the coverage of non-LLM timeout causes that should skip model fallback.

Acceptance criteria from bot review (all run locally, all green):

pnpm test src/agents/model-fallback.test.ts
pnpm test src/agents/pi-embedded-runner/run/failover-policy.test.ts
pnpm test src/agents/pi-embedded-runner/run/assistant-failover.test.ts
pnpm test src/trajectory/metadata.test.ts

Closes #60388.

openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M labels Apr 7, 2026

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/agents/model-fallback.ts

simonusa force-pushed the fix/terminal-abort-reasons branch from 537fee7 to 7c59102 Compare April 7, 2026 20:16

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/agents/model-fallback.ts Outdated

simonusa force-pushed the fix/terminal-abort-reasons branch from 7c59102 to fa41f56 Compare April 7, 2026 20:38

simonusa force-pushed the fix/terminal-abort-reasons branch from fa41f56 to 235912a Compare April 8, 2026 11:36

simonusa force-pushed the fix/terminal-abort-reasons branch from 235912a to 2281256 Compare April 9, 2026 21:25

clawsweeper Bot mentioned this pull request Apr 27, 2026

fix(cron): per-attempt AbortControllers and deferred execution timeout #42482

Closed

3 tasks

BingqingLyu mentioned this pull request Apr 28, 2026

fix(agents): distinguish terminal aborts from retryable failures (#60388) BingqingLyu/openclaw#2114

Open

3 tasks

This was referenced Apr 28, 2026

Don't trigger model fallback when abort reason is the run's own timeout budget #60388

Open

fix(cron): stop fallback attempts when cron budget is exhausted #52365

Open

simonusa force-pushed the fix/terminal-abort-reasons branch from 2281256 to 59e763c Compare May 1, 2026 00:29

simonusa force-pushed the fix/terminal-abort-reasons branch from 59e763c to 72ae084 Compare May 2, 2026 16:45

simonusa force-pushed the fix/terminal-abort-reasons branch from 72ae084 to 8cd3c1a Compare May 2, 2026 17:05

simonusa and others added 4 commits May 2, 2026 12:54

simonusa force-pushed the fix/terminal-abort-reasons branch from 8cd3c1a to 3fa442c Compare May 2, 2026 19:54

openclaw-barnacle Bot added size: L and removed size: M labels May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): distinguish terminal aborts from retryable failures (#60388)#62682

fix(agents): distinguish terminal aborts from retryable failures (#60388)#62682
simonusa wants to merge 4 commits intoopenclaw:mainfrom
simonusa:fix/terminal-abort-reasons

simonusa commented Apr 7, 2026

Uh oh!

greptile-apps Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

simonusa commented Apr 7, 2026

Uh oh!

simonusa commented Apr 9, 2026

Uh oh!

simonusa commented Apr 9, 2026

Uh oh!

clawsweeper Bot commented May 1, 2026 •

edited

Loading

Uh oh!

simonusa commented May 1, 2026

Uh oh!

simonusa commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

simonusa commented Apr 7, 2026

Summary

Change Type

Scope

Linked Issue

Root Cause

Regression Test Plan

User-visible / Behavior Changes

Diagram

Security Impact

Repro + Verification

Environment

Steps

Expected (after fix)

Actual (before fix)

Evidence

Human Verification

Compatibility / Migration

Risks and Mitigations

Out of scope (for separate PRs)

Relationship to PR #52365

Uh oh!

greptile-apps Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

simonusa commented Apr 7, 2026

Uh oh!

simonusa commented Apr 9, 2026

Uh oh!

simonusa commented Apr 9, 2026

Uh oh!

clawsweeper Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonusa commented May 1, 2026

Uh oh!

simonusa commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 7, 2026 •

edited

Loading

clawsweeper Bot commented May 1, 2026 •

edited

Loading