Skip to content

fix(agents): distinguish terminal aborts from retryable failures (#60388)#62682

Open
simonusa wants to merge 4 commits intoopenclaw:mainfrom
simonusa:fix/terminal-abort-reasons
Open

fix(agents): distinguish terminal aborts from retryable failures (#60388)#62682
simonusa wants to merge 4 commits intoopenclaw:mainfrom
simonusa:fix/terminal-abort-reasons

Conversation

@simonusa
Copy link
Copy Markdown
Contributor

@simonusa simonusa commented Apr 7, 2026

Addresses #60388 (complementary to #52365, see "Relationship to PR #52365" below)

Today the fallback layer cannot tell the difference between two very different aborts:

  1. "This model failed, try another" -> fallback should retry
  2. "The whole run is over" -> fallback should stop immediately

Two situations where the run is over and retrying with another model wastes resources:

Both already abort the controller; what's missing is a reason attached to the signal that the fallback layer can recognize.

Summary

Change Type

  • Bug fix

Scope

  • Gateway / orchestration

Linked Issue

Root Cause

  • Root cause: shouldRethrowAbort() in model-fallback.ts checks isFallbackAbortError(err) && !isTimeoutError(err) -- which means timeout errors are intentionally not rethrown, so the fallback chain runs them. This is correct for per-provider timeouts (provider is slow, try another), but wrong for run-budget timeouts (whole run is out of time, no point retrying). The two cannot be told apart from the error alone.
  • Missing detection / guardrail: No check for why the abort happened. The signal.reason carrying TimeoutError (set by pi-embedded-runner/run/attempt.ts:1382-1386 via makeTimeoutAbortReason) was never inspected.
  • Contributing context: The HTTP client disconnect path added in fix(gateway): propagate AbortController signal on HTTP client disconnect #54388 has the same shape -- it tags the abort with no reason today, so a downstream client disconnect also flows through fallback retries.

Regression Test Plan

  • Coverage level: [x] Unit test
  • Target test or file: src/agents/model-fallback.test.ts
  • Scenario the tests lock in: Six new tests under describe("terminal abort propagation (closes #60388)"):
    • signal.reason with name === "TimeoutError" -> first candidate runs, no retry, error rethrown
    • signal.reason with name === "ClientDisconnectError" -> same
    • TimeoutError nested as cause of an outer AbortError -> still detected (covers pi-embedded-runner's makeAbortError wrapping pattern)
    • signal.reason with a generic error -> fallback runs normally (non-terminal)
    • No abortSignal passed -> fallback runs normally (back-compat for existing callers)
    • abortSignal provided but not aborted -> fallback runs normally (live-signal back-compat)
  • Why this is the smallest reliable guardrail: The tests construct the abort signal directly and assert on run.mock.calls.length === 1 to verify the chain stopped. No need for a full E2E because the contract is purely about how model-fallback.ts interprets AbortSignal.reason.
  • Existing test that already covers this: None.
  • If no new test is added, why not: 6 new tests added.

User-visible / Behavior Changes

When a run-budget timeout fires (the agent run exceeds agents.defaults.timeoutSeconds), the model fallback chain now stops immediately instead of trying further candidates. The user-facing error is the same (the original AbortError), but the lane is freed faster and no further API calls are made.

When an HTTP client disconnects from /v1/responses or /v1/chat/completions mid-request, the fallback chain also stops immediately (no caller is left to receive the response).

Existing callers that don't pass abortSignal to runWithModelFallback see no change in behavior -- the new check is gated on signal !== undefined && signal.aborted.

Diagram

Before:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)
  -> agent attempt throws AbortError
  -> shouldRethrowAbort(err) returns false (because isTimeoutError(err) is true)
  -> runFallbackCandidate returns { ok: false } -> tries next candidate
  -> next candidate also times out (~0ms budget left) -> tries next ...
  -> wasted API calls

After:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)  // unchanged
  -> agent attempt throws AbortError
  -> isTerminalAbort(signal) returns true (signal.reason.name === "TimeoutError")
  -> shouldRethrowAbort(err, signal) returns true
  -> error rethrown immediately, no further candidates tried

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (this REDUCES network calls in the timeout path)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (Docker linux/amd64)
  • Runtime/container: Node 22 / OpenClaw built from this branch
  • Test runner: Vitest

Steps

  1. Configure an agent with a small agents.defaults.timeoutSeconds (e.g. 5s) and one or more model fallbacks
  2. Send a request that takes longer than the timeout
  3. Observe the fallback chain behavior

Expected (after fix)

  • Primary candidate runs, hits the timeout, fallback chain stops, single error returned
  • Logs show no [model-fallback] candidate_failed entries for fallback candidates beyond the first

Actual (before fix)

  • Primary candidate runs, hits the timeout
  • Fallback layer tries the next candidate with ~0ms budget remaining
  • Next candidate also times out, fallback layer tries the next, and so on
  • 2-3x the API calls before the chain exhausts and returns an error

Evidence

$ npx vitest run src/agents/model-fallback.test.ts

Test Files  1 passed (1)
     Tests  70 passed (70)

The new test cases:

  • rethrows immediately when signal.reason has name=TimeoutError (run-budget timeout)
  • rethrows immediately when signal.reason has name=ClientDisconnectError
  • detects TimeoutError nested as cause of an outer Error
  • falls back normally when signal is aborted with a non-terminal reason
  • falls back normally when no abortSignal is passed (back-compat)
  • falls back normally when signal is provided but not aborted

Human Verification

  • Verified scenarios:
    • All 70 tests in model-fallback.test.ts pass after the change
    • All 20 tests in model-fallback.probe.test.ts pass
    • All 8 tests in agent-command.live-model-switch.test.ts pass
    • End-to-end smoke: container rebuilt from this branch, HTTP client disconnect on /v1/responses produces the expected [openresponses] client disconnected, aborting streaming run runId=... log line and the agent run terminates within ~1s (the upstream watchClientDisconnect plumbing already in aad3bbedd works correctly with the new ClientDisconnectError reason tag)
  • Edge cases checked: cause-chain walking (one level deep) for TimeoutError/ClientDisconnectError wrapped inside an outer AbortError -- this matches pi-embedded-runner/run/attempt.ts:1387's makeAbortError pattern
  • What I did not verify:
    • Direct E2E of the fallback-skip path with a real cron timeout firing (would require setting timeoutSeconds to a very small value; the unit tests cover the contract)
    • The image-model fallback variant (runWithImageModelFallback) -- it gets the new abortSignal? parameter for consistency but no callers currently pass a signal

Compatibility / Migration

  • Backward compatible? Yes -- the new abortSignal? parameter is optional. Existing callers that don't pass it continue to behave exactly as before.
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: A future caller passes an abort signal whose signal.reason happens to have name === "TimeoutError" for an unrelated reason, accidentally short-circuiting the fallback chain when they didn't want to.
    • Mitigation: The check is intentionally narrow -- only the exact name strings TimeoutError and ClientDisconnectError match. Both names are already reserved for run-budget timeout (set by makeTimeoutAbortReason in pi-embedded-runner/run/attempt.ts) and HTTP client disconnect (set by watchClientDisconnect in gateway/http-common.ts after this PR).
  • Risk: TimeoutError from a per-provider request timeout (not run-budget) gets mistaken for a run-budget timeout and skips fallback when it shouldn't.
    • Mitigation: Per-provider timeouts come from the LLM SDK's internal fetch timeout, which throws an error directly -- they don't propagate through runAbortController.abort(). Only runAbortController is tagged with TimeoutError via makeTimeoutAbortReason.

Out of scope (for separate PRs)

Relationship to PR #52365

PR #52365 (fix(cron): stop fallback attempts when cron budget is exhausted) addresses the same underlying problem from #60388 but with a different, complementary mechanism:

  • fix(cron): stop fallback attempts when cron budget is exhausted #52365 is proactive: a new beforeAttempt hook in runWithModelFallback that lets the cron layer check its remaining budget before each attempt and stop the chain if budget is too low.
  • This PR is reactive: an isTerminalAbort(signal) check in shouldRethrowAbort that inspects signal.reason after an attempt aborts to decide whether to rethrow or retry.

These are complementary, not exclusive. #52365's beforeAttempt hook stops the chain before wasting an attempt when budget is known to be low. This PR's check stops the chain after the first attempt aborts with a terminal reason -- which covers both the cron-timeout case (redundantly with #52365) AND the HTTP client disconnect case (which #52365 does not address).

Notable differences:

Either PR alone fixes the #60388 cron-timeout case. Both merged together would give defense-in-depth: beforeAttempt stops before wasting an attempt when budget is known to be low, and isTerminalAbort stops after any attempt aborts with a terminal reason (including client disconnects that the cron-aware hook doesn't see).

If the maintainers prefer #52365's approach and would rather not have two overlapping mechanisms, I'd suggest retargeting this PR to cover only the ClientDisconnectError branch of isTerminalAbort (the unique coverage) and letting #52365 handle the cron-timeout case via its proactive hook.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M labels Apr 7, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 7, 2026

Greptile Summary

This PR adds terminal abort detection to the model fallback chain to prevent wasteful API retries when a run-budget timeout or HTTP client disconnect has already terminated the run. It introduces isTerminalAbort(signal) which inspects signal.reason.name for "TimeoutError" and "ClientDisconnectError", threads an optional abortSignal parameter through runWithModelFallback and runWithImageModelFallback, and tags watchClientDisconnect aborts with the new ClientDisconnectError reason. The implementation correctly places the terminal-abort guard before coerceToFailoverError to prevent rate-limit-shaped errors from masking a terminal signal, and the 7 new unit tests cover the full contract including the RESOURCE_EXHAUSTED-race edge case.

Confidence Score: 5/5

Safe to merge — the new abortSignal parameter is optional, all existing callers are unaffected, and the terminal-abort path is gated behind an explicit signal.reason name check.

All remaining findings are P2 or below. The implementation is well-scoped, the double isTerminalAbort guard (pre- and post-coerceToFailoverError) is defensive but not harmful, tests cover all advertised scenarios including the failover-normalization race, and backward compatibility is preserved by design.

No files require special attention.

Reviews (2): Last reviewed commit: "fix(agents): distinguish terminal aborts..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 537fee730e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/model-fallback.ts
@simonusa simonusa force-pushed the fix/terminal-abort-reasons branch from 537fee7 to 7c59102 Compare April 7, 2026 20:16
simonusa added a commit to simonusa/simons-openclaw that referenced this pull request Apr 7, 2026
Edge case flagged by greptile on openclaw#62682: if the signal
is aborted with a terminal reason (run-budget timeout) but the thrown
error also matches a failover-normalizable shape (e.g. Google Vertex
RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover`
guard falls through and the chain tries the next candidate anyway.

Fix: check isTerminalAbort(signal) before running coerceToFailoverError
so the terminal signal cannot be masked by rate-limit normalization.

Adds a regression test that locks the contract: aborted signal with
TimeoutError reason + 429/AbortError thrown => first candidate runs,
error rethrown, no further candidates tried.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c59102df7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/model-fallback.ts Outdated
@simonusa
Copy link
Copy Markdown
Contributor Author

simonusa commented Apr 7, 2026

@greptileai review

@simonusa simonusa force-pushed the fix/terminal-abort-reasons branch from 7c59102 to fa41f56 Compare April 7, 2026 20:38
simonusa added a commit to simonusa/simons-openclaw that referenced this pull request Apr 7, 2026
Flagged by codex review on openclaw#62682: `isTerminalAbort`
only handled `reason instanceof Error`, but `src/cron/service/timer.ts:90`
aborts the run controller with a plain string (`timeoutErrorMessage()`),
so cron timeouts were falling through to fallback retries.

Fix: add a known-terminal-strings set (currently just
"cron: job execution timed out") and match string reasons against it.
Also match Error objects whose `.message` equals a known terminal
string, to cover callers that wrap the message in `new Error(...)`
before passing it to `abort()`.

Three new regression tests cover: string reason match, wrapped-in-Error
match, and back-compat (unrelated strings still flow through fallback).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@simonusa simonusa force-pushed the fix/terminal-abort-reasons branch from fa41f56 to 235912a Compare April 8, 2026 11:36
@simonusa
Copy link
Copy Markdown
Contributor Author

simonusa commented Apr 9, 2026

checks-node-test failed with an OOM crash (FATAL ERROR: Reached heap limit) unrelated to this PR. Could a maintainer re-run the check?

@simonusa simonusa force-pushed the fix/terminal-abort-reasons branch from 235912a to 2281256 Compare April 9, 2026 21:25
@simonusa
Copy link
Copy Markdown
Contributor Author

simonusa commented Apr 9, 2026

check-additional failures are pre-existing upstream lint debt unrelated to this PR:

lint:tmp:no-random-messaging — os.tmpdir() usage in extensions/active-memory, browser, memory-core, etc. (being addressed by #63902)
lint:tmp:channel-agnostic-boundaries — channel id literals in src/agents/acp-spawn.ts
lint:tmp:no-raw-channel-fetch — raw fetch() in extensions/browser/src/browser/client-fetch.ts
All three fail on the current upstream/main HEAD independently of this PR. This PR only touches model-fallback.ts, agent-command.ts, gateway/http-common.ts, and followup-runner.ts.

@simonusa simonusa force-pushed the fix/terminal-abort-reasons branch from 2281256 to 59e763c Compare May 1, 2026 00:29
simonusa added a commit to simonusa/simons-openclaw that referenced this pull request May 1, 2026
Edge case flagged by greptile on openclaw#62682: if the signal
is aborted with a terminal reason (run-budget timeout) but the thrown
error also matches a failover-normalizable shape (e.g. Google Vertex
RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover`
guard falls through and the chain tries the next candidate anyway.

Fix: check isTerminalAbort(signal) before running coerceToFailoverError
so the terminal signal cannot be masked by rate-limit normalization.

Adds a regression test that locks the contract: aborted signal with
TimeoutError reason + 429/AbortError thrown => first candidate runs,
error rethrown, no further candidates tried.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 1, 2026

Codex review: needs maintainer review before merge.

Summary
The PR teaches model fallback to stop on terminal run aborts, tags HTTP disconnect aborts, adds an embedded run-budget timeout failover flag, records it in trajectory metadata, and adds regression tests plus a changelog entry.

Reproducibility: yes. A source-level reproduction is an embedded run with a short timeoutMs and fallback candidates: scheduleAbortTimer aborts the private runAbortController with a TimeoutError, which current main can map into timeout failover; the PR adds unit and policy coverage for both the thrown cause-chain and explicit policy-flag paths.

Next step before merge
No narrow automated repair remains; maintainer review should decide whether to merge this terminal-abort path alongside the related cron-budget PRs.

Security
Cleared: The diff changes in-process abort classification, signal plumbing, gateway disconnect tagging, trajectory metadata, tests, and changelog only; it adds no dependency, workflow, package, secret, download, or execution-surface change.

Review details

Best possible solution:

Land this narrow terminal-abort contract if maintainers want reactive coverage for embedded budget exhaustion, cron timeout aborts, and HTTP disconnects, while leaving broader cron controller redesign to the related cron PRs.

Do we have a high-confidence way to reproduce the issue?

Yes. A source-level reproduction is an embedded run with a short timeoutMs and fallback candidates: scheduleAbortTimer aborts the private runAbortController with a TimeoutError, which current main can map into timeout failover; the PR adds unit and policy coverage for both the thrown cause-chain and explicit policy-flag paths.

Is this the best way to solve the issue?

Yes. The latest patch is a narrow maintainable fix for the reactive terminal-abort gap: classify explicit terminal abort markers at the fallback boundary and add a runner-owned timedOutByRunBudget flag, while preserving ordinary LLM/provider timeouts as fallback-eligible.

What I checked:

Likely related people:

  • steipete: Recent commit history routes core model-fallback and embedded-runner failover maintenance through this handle, including lazy-loader refactors, model alias fallback fixes, raw fallback schema preservation, and extraction of embedded failover helpers. (role: recent maintainer and adjacent owner; confidence: high; commits: 59fb9e5ca7fe, aec5efed8d43, 5b39be3653ac; files: src/agents/model-fallback.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/failover-policy.ts)
  • obviyus: Merged and recently touched the closely related tool-execution timeout exemption in the same embedded-runner failover path, which this PR parallels for run-budget exhaustion. (role: recent merger and timeout-failover maintainer; confidence: high; commits: 731f640bca0a, 2605490dbd30; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/failover-policy.ts)
  • simonusa: Beyond authoring this PR, this handle appears in merged history for the adjacent fix(agents): exempt tool-execution timeouts from model fallback (#52147) #75873 timeout-failover exemption, making them relevant to the current policy shape. (role: prior merged contributor in this area; confidence: medium; commits: 2605490dbd30; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/failover-policy.ts)
  • Lellansin: Introduced the merged HTTP client-disconnect abort plumbing that this PR now tags with a terminal abort reason. (role: HTTP disconnect surface contributor; confidence: medium; commits: aad3bbebdd87; files: src/gateway/http-common.ts, src/gateway/openai-http.ts, src/gateway/openresponses-http.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9eb79bcf997b.

@simonusa
Copy link
Copy Markdown
Contributor Author

simonusa commented May 1, 2026

Rebased onto current upstream/main (commit e47a7448e9). Three patches preserved:

Conflicts resolved across 4 files (src/agents/agent-command.ts, src/agents/model-fallback.ts, src/auto-reply/reply/agent-runner-execution.ts, src/auto-reply/reply/followup-runner.ts) — all "keep both" merges where upstream added new options to runWithModelFallback (onFallbackStep, classifyResult) and our patch adds abortSignal. No semantic conflict — both extensions land alongside each other.

Local validation: npx vitest run src/agents/model-fallback.test.ts152/152 pass, including the 6 new abort-signal tests added in this PR.

Re: the 2 CI failures (checks-node-core + checks-node-core-runtime-infra):

Both surface the same root cause — src/config/schema.base.generated.test.ts failing with "expected schema to deeply equal generated payload." This is a stale generated baseline, not a real test failure: a recent upstream config schema change wasn't followed by pnpm generate:base-config-schema. Same class as #63902 (the lint-debt failures from my prior CI run that have since been cleared). Unrelated to this PR's changes (which are scoped to src/agents/, src/auto-reply/, no config schema touch).

The original CI failures from Apr 9 (check-additional lint debt + checks-node-test OOM) are now passing — net progress from the rebase.

Ready for review.

@simonusa simonusa force-pushed the fix/terminal-abort-reasons branch from 59e763c to 72ae084 Compare May 2, 2026 16:45
simonusa added a commit to simonusa/simons-openclaw that referenced this pull request May 2, 2026
Edge case flagged by greptile on openclaw#62682: if the signal
is aborted with a terminal reason (run-budget timeout) but the thrown
error also matches a failover-normalizable shape (e.g. Google Vertex
RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover`
guard falls through and the chain tries the next candidate anyway.

Fix: check isTerminalAbort(signal) before running coerceToFailoverError
so the terminal signal cannot be masked by rate-limit normalization.

Adds a regression test that locks the contract: aborted signal with
TimeoutError reason + 429/AbortError thrown => first candidate runs,
error rethrown, no further candidates tried.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
simonusa added a commit to simonusa/simons-openclaw that referenced this pull request May 2, 2026
Edge case flagged by greptile on openclaw#62682: if the signal
is aborted with a terminal reason (run-budget timeout) but the thrown
error also matches a failover-normalizable shape (e.g. Google Vertex
RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover`
guard falls through and the chain tries the next candidate anyway.

Fix: check isTerminalAbort(signal) before running coerceToFailoverError
so the terminal signal cannot be masked by rate-limit normalization.

Adds a regression test that locks the contract: aborted signal with
TimeoutError reason + 429/AbortError thrown => first candidate runs,
error rethrown, no further candidates tried.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@simonusa simonusa force-pushed the fix/terminal-abort-reasons branch from 72ae084 to 8cd3c1a Compare May 2, 2026 17:05
simonusa and others added 4 commits May 2, 2026 12:54
 openclaw#60388)

When a model fallback chain runs after an abort, today the layer cannot tell
the difference between:

  1. "This model failed, try another"  -> fallback should retry
  2. "The whole run is over"            -> fallback should stop immediately

Two situations where the run is over and retrying wastes resources:

  - **Run-budget timeout**: The embedded runner's `scheduleAbortTimer` fires
    `runAbortController.abort(makeTimeoutAbortReason())` after the configured
    `agents.defaults.timeoutSeconds`. The budget is exhausted -- giving the
    next candidate ~0 ms remaining is guaranteed to fail and wastes API calls.
    This is openclaw#60388.

  - **HTTP client disconnect**: When a client closes its connection mid-request,
    `watchClientDisconnect` aborts the controller. No caller is left to receive
    a response, so any tokens spent on a fallback model are wasted.

Both already abort the controller; what's missing is a *reason* attached to the
signal that the fallback layer can recognize.

This patch:

  - Adds `ClientDisconnectError` and tags the abort in `watchClientDisconnect`
    via `abortController.abort(new ClientDisconnectError())`. The run-timeout
    case already tags with `name === "TimeoutError"` via `makeTimeoutAbortReason`.

  - Adds `isTerminalAbort(signal)` in `model-fallback.ts` that walks one cause
    level (because pi-embedded-runner's `makeAbortError` wraps the original
    reason as an outer AbortError with `.cause` set) and recognizes both
    `TimeoutError` and `ClientDisconnectError` as terminal markers.

  - Wires `shouldRethrowAbort(err, signal)` to short-circuit the fallback chain
    when the signal carries a terminal reason.

  - Adds `abortSignal` parameter to `runWithModelFallback` and plumbs it through
    `runFallbackAttempt` -> `runFallbackCandidate`. Updates the five callers
    that have a signal in scope (`agent-command.ts`, `agent-runner-execution.ts`,
    `agent-runner-memory.ts`, `followup-runner.ts`, `cron/isolated-agent/run-executor.ts`).

  - Six new unit tests in `model-fallback.test.ts` covering: TimeoutError reason,
    ClientDisconnectError reason, nested cause chain detection, generic reason
    falls back normally, no signal back-compat, signal-not-aborted back-compat.

`runWithImageModelFallback` also gets the `abortSignal?` parameter for consistency.

Detection is via `signal.reason` (not `err.message` or `err.name`) because the
fetch wrapper re-throws aborts as a generic `AbortError` that loses the original
tag in `.name`. The `signal.reason` survives this round-trip.

Test results: `model-fallback.test.ts` 70/70 passed, `model-fallback.probe.test.ts`
20/20 passed, `agent-command.live-model-switch.test.ts` 8/8 passed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Edge case flagged by greptile on openclaw#62682: if the signal
is aborted with a terminal reason (run-budget timeout) but the thrown
error also matches a failover-normalizable shape (e.g. Google Vertex
RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover`
guard falls through and the chain tries the next candidate anyway.

Fix: check isTerminalAbort(signal) before running coerceToFailoverError
so the terminal signal cannot be masked by rate-limit normalization.

Adds a regression test that locks the contract: aborted signal with
TimeoutError reason + 429/AbortError thrown => first candidate runs,
error rethrown, no further candidates tried.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Flagged by codex review on openclaw#62682: `isTerminalAbort`
only handled `reason instanceof Error`, but `src/cron/service/timer.ts:90`
aborts the run controller with a plain string (`timeoutErrorMessage()`),
so cron timeouts were falling through to fallback retries.

Fix: add a known-terminal-strings set (currently just
"cron: job execution timed out") and match string reasons against it.
Also match Error objects whose `.message` equals a known terminal
string, to cover callers that wrap the message in `new Error(...)`
before passing it to `abort()`.

Three new regression tests cover: string reason match, wrapped-in-Error
match, and back-compat (unrelated strings still flow through fallback).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address P2/P3 bot review on PR openclaw#62682:

[P2] Embedded run-budget timeouts misclassified — `opts.abortSignal` is
the caller-provided signal (HTTP disconnect, cron wrapper). It is NOT
the embedded runner's private `runAbortController` — when
`scheduleAbortTimer` fires and calls `abortRun(true)`, only the internal
controller is aborted with a TimeoutError. The previous patch's
`isTerminalAbort(opts.abortSignal)` check therefore missed the openclaw#60388
case entirely; the fallback chain still tried alternative models even
though the whole run's deadline had elapsed.

Two complementary changes address this:

(1) `EmbeddedRunAttemptResult.timedOutByRunBudget` flag — set explicitly
    in attempt.ts when the run-budget timer fires (NOT when LLM idle
    watchdog fires; that's still LLM-phase and benefits from fallback).
    Threaded through to failover-policy `shouldRotateAssistant`
    alongside the existing `timedOutDuringCompaction` and
    `timedOutDuringToolExecution` exemptions. Also gates the
    timeout-triggered compaction branch in run.ts (compacting wouldn't
    help — the deadline is exhausted). Optional in result type for
    public harness SDK back-compat (matches openclaw#75873 pattern).

(2) `isTerminalAbortFromError(err)` — new helper that mirrors the
    existing `isTerminalAbort(signal)` checks but reads the thrown
    error's `.cause` chain. Needed because `abortable()` wraps the
    embedded controller's TimeoutError in an outer AbortError, and the
    fallback layer only sees the thrown error (not the embedded
    controller's signal). Used in `runFallbackCandidate.catch` next to
    the existing signal check.

[P3] CHANGELOG entry added under Unreleased/Fixes.

Tests: 197/197 pass in failover-policy, assistant-failover,
model-fallback, and trajectory-metadata suites. New regression cases
cover: (a) thrown error with TimeoutError in cause chain rethrows; (b)
thrown error with ClientDisconnectError in cause chain rethrows; (c)
generic AbortError still falls back; (d) failover-policy exempts
run-budget timeouts both pre- and post-rotation.

Closes openclaw#60388.
@simonusa simonusa force-pushed the fix/terminal-abort-reasons branch from 8cd3c1a to 3fa442c Compare May 2, 2026 19:54
@simonusa
Copy link
Copy Markdown
Contributor Author

simonusa commented May 2, 2026

Addressed both clawsweeper review findings in 3fa442cb0e:

[P2] Embedded run-budget timeouts now marked as terminal

The bot was correct that the prior approach (isTerminalAbort(opts.abortSignal)) only caught caller-driven aborts (HTTP disconnect, cron wrapper) — it missed the scheduleAbortTimer case described in #60388, where the embedded runner aborts a private runAbortController that the fallback layer never sees.

Two complementary changes address this:

  1. EmbeddedRunAttemptResult.timedOutByRunBudget flag — set explicitly in attempt.ts when the run-budget timer fires (NOT when LLM idle watchdog fires; that's still LLM-phase and benefits from fallback). Threaded through to failover-policy.shouldRotateAssistant alongside the existing timedOutDuringCompaction and timedOutDuringToolExecution exemptions, and gates the timeout-triggered compaction branch in run.ts. Optional in result type for public harness SDK back-compat (matches the pattern proven in fix(agents): exempt tool-execution timeouts from model fallback (#52147) #75873).

  2. isTerminalAbortFromError(err) helper — mirrors the existing isTerminalAbort(signal) checks but inspects the thrown error's .cause chain. Needed because abortable() wraps the embedded controller's TimeoutError in an outer AbortError; the fallback layer only sees the thrown error, not the embedded controller's signal.

[P3] CHANGELOG entry added under Unreleased / Fixes.

Tests: 197/197 pass in model-fallback.test.ts, failover-policy.test.ts, assistant-failover.test.ts, and trajectory/metadata.test.ts. Five new regression cases cover:

  • Thrown error with TimeoutError in cause chain → rethrows immediately (the embedded run-budget case)
  • Thrown error with ClientDisconnectError in cause chain → rethrows immediately
  • Generic AbortError without terminal cause → still falls back normally (back-compat)
  • Failover-policy: timedOutByRunBudget=true returns continue_normal both pre- and post-rotation

@steipete @obviyus — flagging since you both maintain this area and @obviyus recently merged the closely-related #75873 (timedOutDuringToolExecution exemption for tool-phase timeouts). This PR adds the parallel exemption for run-budget exhaustion, completing the coverage of non-LLM timeout causes that should skip model fallback.

Acceptance criteria from bot review (all run locally, all green):

  • pnpm test src/agents/model-fallback.test.ts
  • pnpm test src/agents/pi-embedded-runner/run/failover-policy.test.ts
  • pnpm test src/agents/pi-embedded-runner/run/assistant-failover.test.ts
  • pnpm test src/trajectory/metadata.test.ts

Closes #60388.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't trigger model fallback when abort reason is the run's own timeout budget

1 participant