fix(agents): distinguish terminal aborts from retryable failures (#60388) by BingqingLyu · Pull Request #2114 · BingqingLyu/openclaw

BingqingLyu · 2026-04-28T05:20:22Z

Addresses openclaw#60388 (complementary to openclaw#52365, see "Relationship to PR openclaw#52365" below)

Today the fallback layer cannot tell the difference between two very different aborts:

"This model failed, try another" -> fallback should retry
"The whole run is over" -> fallback should stop immediately

Two situations where the run is over and retrying with another model wastes resources:

Run-budget timeout (Don't trigger model fallback when abort reason is the run's own timeout budget openclaw/openclaw#60388): The embedded runner's scheduleAbortTimer fires runAbortController.abort(makeTimeoutAbortReason()) after the configured agents.defaults.timeoutSeconds. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. The user issue description (Don't trigger model fallback when abort reason is the run's own timeout budget openclaw/openclaw#60388) says "On a fleet of ~100 cron jobs: 30+ model-fallback events per day, almost all triggered by run timeouts."
HTTP client disconnect: When a client closes its connection mid-request, watchClientDisconnect aborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted.

Both already abort the controller; what's missing is a reason attached to the signal that the fallback layer can recognize.

Summary

Problem: Run-budget timeout aborts (and HTTP client disconnects) currently flow through the model fallback chain as if they were retryable provider failures, wasting API calls and lane occupancy.
Why it matters: Each unwanted fallback attempt holds a session lane and burns tokens for no benefit. On busy systems (Don't trigger model fallback when abort reason is the run's own timeout budget openclaw/openclaw#60388 reports 30+ events/day from cron timeouts alone) this is meaningful overhead.
What changed: The fallback layer now checks AbortSignal.reason for two terminal markers (name === "TimeoutError" for run-budget, name === "ClientDisconnectError" for client disconnect) and short-circuits the chain when found.
What did NOT change (scope boundary): No changes to how aborts are triggered, no changes to per-provider timeout/retry semantics, no changes to AbortController lifecycles. Issues Cron job timeout aborts entire model fallback chain via shared AbortController openclaw/openclaw#37505 (cron AbortController sharing) and [bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight openclaw/openclaw#58578 (mid-flight primary restore) are different abort sources upstream of the fallback layer and remain unchanged.

Change Type

Bug fix

Scope

Gateway / orchestration

Linked Issue

Closes Don't trigger model fallback when abort reason is the run's own timeout budget openclaw/openclaw#60388
This PR fixes a bug or regression

Root Cause

Root cause: shouldRethrowAbort() in model-fallback.ts checks isFallbackAbortError(err) && !isTimeoutError(err) -- which means timeout errors are intentionally not rethrown, so the fallback chain runs them. This is correct for per-provider timeouts (provider is slow, try another), but wrong for run-budget timeouts (whole run is out of time, no point retrying). The two cannot be told apart from the error alone.
Missing detection / guardrail: No check for why the abort happened. The signal.reason carrying TimeoutError (set by pi-embedded-runner/run/attempt.ts:1382-1386 via makeTimeoutAbortReason) was never inspected.
Contributing context: The HTTP client disconnect path added in fix(gateway): propagate AbortController signal on HTTP client disconnect openclaw/openclaw#54388 has the same shape -- it tags the abort with no reason today, so a downstream client disconnect also flows through fallback retries.

Regression Test Plan

Coverage level: [x] Unit test
Target test or file: src/agents/model-fallback.test.ts
Scenario the tests lock in: Six new tests under describe("terminal abort propagation (closes #60388)"):
- signal.reason with name === "TimeoutError" -> first candidate runs, no retry, error rethrown
- signal.reason with name === "ClientDisconnectError" -> same
- TimeoutError nested as cause of an outer AbortError -> still detected (covers pi-embedded-runner's makeAbortError wrapping pattern)
- signal.reason with a generic error -> fallback runs normally (non-terminal)
- No abortSignal passed -> fallback runs normally (back-compat for existing callers)
- abortSignal provided but not aborted -> fallback runs normally (live-signal back-compat)
Why this is the smallest reliable guardrail: The tests construct the abort signal directly and assert on run.mock.calls.length === 1 to verify the chain stopped. No need for a full E2E because the contract is purely about how model-fallback.ts interprets AbortSignal.reason.
Existing test that already covers this: None.
If no new test is added, why not: 6 new tests added.

User-visible / Behavior Changes

When a run-budget timeout fires (the agent run exceeds agents.defaults.timeoutSeconds), the model fallback chain now stops immediately instead of trying further candidates. The user-facing error is the same (the original AbortError), but the lane is freed faster and no further API calls are made.

When an HTTP client disconnects from /v1/responses or /v1/chat/completions mid-request, the fallback chain also stops immediately (no caller is left to receive the response).

Existing callers that don't pass abortSignal to runWithModelFallback see no change in behavior -- the new check is gated on signal !== undefined && signal.aborted.

Diagram

Before:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)
  -> agent attempt throws AbortError
  -> shouldRethrowAbort(err) returns false (because isTimeoutError(err) is true)
  -> runFallbackCandidate returns { ok: false } -> tries next candidate
  -> next candidate also times out (~0ms budget left) -> tries next ...
  -> wasted API calls

After:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)  // unchanged
  -> agent attempt throws AbortError
  -> isTerminalAbort(signal) returns true (signal.reason.name === "TimeoutError")
  -> shouldRethrowAbort(err, signal) returns true
  -> error rethrown immediately, no further candidates tried

Security Impact

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No (this REDUCES network calls in the timeout path)
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS (Docker linux/amd64)
Runtime/container: Node 22 / OpenClaw built from this branch
Test runner: Vitest

Steps

Configure an agent with a small agents.defaults.timeoutSeconds (e.g. 5s) and one or more model fallbacks
Send a request that takes longer than the timeout
Observe the fallback chain behavior

Expected (after fix)

Primary candidate runs, hits the timeout, fallback chain stops, single error returned
Logs show no [model-fallback] candidate_failed entries for fallback candidates beyond the first

Actual (before fix)

Primary candidate runs, hits the timeout
Fallback layer tries the next candidate with ~0ms budget remaining
Next candidate also times out, fallback layer tries the next, and so on
2-3x the API calls before the chain exhausts and returns an error

Evidence

$ npx vitest run src/agents/model-fallback.test.ts

Test Files  1 passed (1)
     Tests  70 passed (70)

The new test cases:

rethrows immediately when signal.reason has name=TimeoutError (run-budget timeout)
rethrows immediately when signal.reason has name=ClientDisconnectError
detects TimeoutError nested as cause of an outer Error
falls back normally when signal is aborted with a non-terminal reason
falls back normally when no abortSignal is passed (back-compat)
falls back normally when signal is provided but not aborted

Human Verification

Verified scenarios:
- All 70 tests in model-fallback.test.ts pass after the change
- All 20 tests in model-fallback.probe.test.ts pass
- All 8 tests in agent-command.live-model-switch.test.ts pass
- End-to-end smoke: container rebuilt from this branch, HTTP client disconnect on /v1/responses produces the expected [openresponses] client disconnected, aborting streaming run runId=... log line and the agent run terminates within ~1s (the upstream watchClientDisconnect plumbing already in aad3bbedd works correctly with the new ClientDisconnectError reason tag)
Edge cases checked: cause-chain walking (one level deep) for TimeoutError/ClientDisconnectError wrapped inside an outer AbortError -- this matches pi-embedded-runner/run/attempt.ts:1387's makeAbortError pattern
What I did not verify:
- Direct E2E of the fallback-skip path with a real cron timeout firing (would require setting timeoutSeconds to a very small value; the unit tests cover the contract)
- The image-model fallback variant (runWithImageModelFallback) -- it gets the new abortSignal? parameter for consistency but no callers currently pass a signal

Compatibility / Migration

Backward compatible? Yes -- the new abortSignal? parameter is optional. Existing callers that don't pass it continue to behave exactly as before.
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: A future caller passes an abort signal whose signal.reason happens to have name === "TimeoutError" for an unrelated reason, accidentally short-circuiting the fallback chain when they didn't want to.
- Mitigation: The check is intentionally narrow -- only the exact name strings TimeoutError and ClientDisconnectError match. Both names are already reserved for run-budget timeout (set by makeTimeoutAbortReason in pi-embedded-runner/run/attempt.ts) and HTTP client disconnect (set by watchClientDisconnect in gateway/http-common.ts after this PR).
Risk: TimeoutError from a per-provider request timeout (not run-budget) gets mistaken for a run-budget timeout and skips fallback when it shouldn't.
- Mitigation: Per-provider timeouts come from the LLM SDK's internal fetch timeout, which throws an error directly -- they don't propagate through runAbortController.abort(). Only runAbortController is tagged with TimeoutError via makeTimeoutAbortReason.

Out of scope (for separate PRs)

Issue Cron job timeout aborts entire model fallback chain via shared AbortController openclaw/openclaw#37505 (cron timeout aborts entire fallback chain via shared AbortController) -- different abort source (cron service executeJobCoreWithTimeout), needs per-attempt AbortController isolation. PR fix(cron): per-attempt AbortControllers and deferred execution timeout openclaw/openclaw#42482 already addresses this.
Issue [bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight openclaw/openclaw#58578 (fallback chain aborted by premature primary restore mid-flight) -- different abort source (requestLiveSessionModelSwitch), needs coordination state between live-model-switch and fallback layers

Relationship to PR openclaw#52365

PR openclaw#52365 (fix(cron): stop fallback attempts when cron budget is exhausted) addresses the same underlying problem from openclaw#60388 but with a different, complementary mechanism:

fix(cron): stop fallback attempts when cron budget is exhausted openclaw/openclaw#52365 is proactive: a new beforeAttempt hook in runWithModelFallback that lets the cron layer check its remaining budget before each attempt and stop the chain if budget is too low.
This PR is reactive: an isTerminalAbort(signal) check in shouldRethrowAbort that inspects signal.reason after an attempt aborts to decide whether to rethrow or retry.

These are complementary, not exclusive. openclaw#52365's beforeAttempt hook stops the chain before wasting an attempt when budget is known to be low. This PR's check stops the chain after the first attempt aborts with a terminal reason -- which covers both the cron-timeout case (redundantly with openclaw#52365) AND the HTTP client disconnect case (which openclaw#52365 does not address).

Notable differences:

Scope: this PR is 8 files / ~276 lines touching only model-fallback.ts, http-common.ts, agent-command.ts, 3 auto-reply callers, and one cron caller. fix(cron): stop fallback attempts when cron budget is exhausted openclaw/openclaw#52365 is 47 files / ~1900 lines (includes unrelated slack/plugin-sdk changes).
Client disconnect coverage: this PR handles the HTTP /v1/responses and /v1/chat/completions client-disconnect case by tagging the abort with ClientDisconnectError in watchClientDisconnect. fix(cron): stop fallback attempts when cron budget is exhausted openclaw/openclaw#52365 does not touch the gateway HTTP path.
Detection layer: this PR checks signal.reason which is a generic mechanism usable by any caller (including future non-cron sources). fix(cron): stop fallback attempts when cron budget is exhausted openclaw/openclaw#52365's beforeAttempt hook requires each caller to implement its own budget-check logic.

Either PR alone fixes the openclaw#60388 cron-timeout case. Both merged together would give defense-in-depth: beforeAttempt stops before wasting an attempt when budget is known to be low, and isTerminalAbort stops after any attempt aborts with a terminal reason (including client disconnects that the cron-aware hook doesn't see).

If the maintainers prefer openclaw#52365's approach and would rather not have two overlapping mechanisms, I'd suggest retargeting this PR to cover only the ClientDisconnectError branch of isTerminalAbort (the unique coverage) and letting openclaw#52365 handle the cron-timeout case via its proactive hook.

openclaw#60388) When a model fallback chain runs after an abort, today the layer cannot tell the difference between: 1. "This model failed, try another" -> fallback should retry 2. "The whole run is over" -> fallback should stop immediately Two situations where the run is over and retrying wastes resources: - **Run-budget timeout**: The embedded runner's `scheduleAbortTimer` fires `runAbortController.abort(makeTimeoutAbortReason())` after the configured `agents.defaults.timeoutSeconds`. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. This is openclaw#60388. - **HTTP client disconnect**: When a client closes its connection mid-request, `watchClientDisconnect` aborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted. Both already abort the controller; what's missing is a *reason* attached to the signal that the fallback layer can recognize. This patch: - Adds `ClientDisconnectError` and tags the abort in `watchClientDisconnect` via `abortController.abort(new ClientDisconnectError())`. The run-timeout case already tags with `name === "TimeoutError"` via `makeTimeoutAbortReason`. - Adds `isTerminalAbort(signal)` in `model-fallback.ts` that walks one cause level (because pi-embedded-runner's `makeAbortError` wraps the original reason as an outer AbortError with `.cause` set) and recognizes both `TimeoutError` and `ClientDisconnectError` as terminal markers. - Wires `shouldRethrowAbort(err, signal)` to short-circuit the fallback chain when the signal carries a terminal reason. - Adds `abortSignal` parameter to `runWithModelFallback` and plumbs it through `runFallbackAttempt` -> `runFallbackCandidate`. Updates the five callers that have a signal in scope (`agent-command.ts`, `agent-runner-execution.ts`, `agent-runner-memory.ts`, `followup-runner.ts`, `cron/isolated-agent/run-executor.ts`). - Six new unit tests in `model-fallback.test.ts` covering: TimeoutError reason, ClientDisconnectError reason, nested cause chain detection, generic reason falls back normally, no signal back-compat, signal-not-aborted back-compat. `runWithImageModelFallback` also gets the `abortSignal?` parameter for consistency. Detection is via `signal.reason` (not `err.message` or `err.name`) because the fetch wrapper re-throws aborts as a generic `AbortError` that loses the original tag in `.name`. The `signal.reason` survives this round-trip. Test results: `model-fallback.test.ts` 70/70 passed, `model-fallback.probe.test.ts` 20/20 passed, `agent-command.live-model-switch.test.ts` 8/8 passed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): distinguish terminal aborts from retryable failures (#60388)#2114

fix(agents): distinguish terminal aborts from retryable failures (#60388)#2114
BingqingLyu wants to merge 1 commit intomainfrom
fork-pr-62682-fix-terminal-abort-reasons

BingqingLyu commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BingqingLyu commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type

Scope

Linked Issue

Root Cause

Regression Test Plan

User-visible / Behavior Changes

Diagram

Security Impact

Repro + Verification

Environment

Steps

Expected (after fix)

Actual (before fix)

Evidence

Human Verification

Compatibility / Migration

Risks and Mitigations

Out of scope (for separate PRs)

Relationship to PR openclaw#52365

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BingqingLyu commented Apr 28, 2026 •

edited

Loading