Skip to content

fix(agents): distinguish terminal aborts from retryable failures (#60388)#2114

Open
BingqingLyu wants to merge 1 commit intomainfrom
fork-pr-62682-fix-terminal-abort-reasons
Open

fix(agents): distinguish terminal aborts from retryable failures (#60388)#2114
BingqingLyu wants to merge 1 commit intomainfrom
fork-pr-62682-fix-terminal-abort-reasons

Conversation

@BingqingLyu
Copy link
Copy Markdown
Owner

@BingqingLyu BingqingLyu commented Apr 28, 2026

Addresses openclaw#60388 (complementary to openclaw#52365, see "Relationship to PR openclaw#52365" below)

Today the fallback layer cannot tell the difference between two very different aborts:

  1. "This model failed, try another" -> fallback should retry
  2. "The whole run is over" -> fallback should stop immediately

Two situations where the run is over and retrying with another model wastes resources:

Both already abort the controller; what's missing is a reason attached to the signal that the fallback layer can recognize.

Summary

Change Type

  • Bug fix

Scope

  • Gateway / orchestration

Linked Issue

Root Cause

  • Root cause: shouldRethrowAbort() in model-fallback.ts checks isFallbackAbortError(err) && !isTimeoutError(err) -- which means timeout errors are intentionally not rethrown, so the fallback chain runs them. This is correct for per-provider timeouts (provider is slow, try another), but wrong for run-budget timeouts (whole run is out of time, no point retrying). The two cannot be told apart from the error alone.
  • Missing detection / guardrail: No check for why the abort happened. The signal.reason carrying TimeoutError (set by pi-embedded-runner/run/attempt.ts:1382-1386 via makeTimeoutAbortReason) was never inspected.
  • Contributing context: The HTTP client disconnect path added in fix(gateway): propagate AbortController signal on HTTP client disconnect openclaw/openclaw#54388 has the same shape -- it tags the abort with no reason today, so a downstream client disconnect also flows through fallback retries.

Regression Test Plan

  • Coverage level: [x] Unit test
  • Target test or file: src/agents/model-fallback.test.ts
  • Scenario the tests lock in: Six new tests under describe("terminal abort propagation (closes #60388)"):
    • signal.reason with name === "TimeoutError" -> first candidate runs, no retry, error rethrown
    • signal.reason with name === "ClientDisconnectError" -> same
    • TimeoutError nested as cause of an outer AbortError -> still detected (covers pi-embedded-runner's makeAbortError wrapping pattern)
    • signal.reason with a generic error -> fallback runs normally (non-terminal)
    • No abortSignal passed -> fallback runs normally (back-compat for existing callers)
    • abortSignal provided but not aborted -> fallback runs normally (live-signal back-compat)
  • Why this is the smallest reliable guardrail: The tests construct the abort signal directly and assert on run.mock.calls.length === 1 to verify the chain stopped. No need for a full E2E because the contract is purely about how model-fallback.ts interprets AbortSignal.reason.
  • Existing test that already covers this: None.
  • If no new test is added, why not: 6 new tests added.

User-visible / Behavior Changes

When a run-budget timeout fires (the agent run exceeds agents.defaults.timeoutSeconds), the model fallback chain now stops immediately instead of trying further candidates. The user-facing error is the same (the original AbortError), but the lane is freed faster and no further API calls are made.

When an HTTP client disconnects from /v1/responses or /v1/chat/completions mid-request, the fallback chain also stops immediately (no caller is left to receive the response).

Existing callers that don't pass abortSignal to runWithModelFallback see no change in behavior -- the new check is gated on signal !== undefined && signal.aborted.

Diagram

Before:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)
  -> agent attempt throws AbortError
  -> shouldRethrowAbort(err) returns false (because isTimeoutError(err) is true)
  -> runFallbackCandidate returns { ok: false } -> tries next candidate
  -> next candidate also times out (~0ms budget left) -> tries next ...
  -> wasted API calls

After:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)  // unchanged
  -> agent attempt throws AbortError
  -> isTerminalAbort(signal) returns true (signal.reason.name === "TimeoutError")
  -> shouldRethrowAbort(err, signal) returns true
  -> error rethrown immediately, no further candidates tried

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (this REDUCES network calls in the timeout path)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (Docker linux/amd64)
  • Runtime/container: Node 22 / OpenClaw built from this branch
  • Test runner: Vitest

Steps

  1. Configure an agent with a small agents.defaults.timeoutSeconds (e.g. 5s) and one or more model fallbacks
  2. Send a request that takes longer than the timeout
  3. Observe the fallback chain behavior

Expected (after fix)

  • Primary candidate runs, hits the timeout, fallback chain stops, single error returned
  • Logs show no [model-fallback] candidate_failed entries for fallback candidates beyond the first

Actual (before fix)

  • Primary candidate runs, hits the timeout
  • Fallback layer tries the next candidate with ~0ms budget remaining
  • Next candidate also times out, fallback layer tries the next, and so on
  • 2-3x the API calls before the chain exhausts and returns an error

Evidence

$ npx vitest run src/agents/model-fallback.test.ts

Test Files  1 passed (1)
     Tests  70 passed (70)

The new test cases:

  • rethrows immediately when signal.reason has name=TimeoutError (run-budget timeout)
  • rethrows immediately when signal.reason has name=ClientDisconnectError
  • detects TimeoutError nested as cause of an outer Error
  • falls back normally when signal is aborted with a non-terminal reason
  • falls back normally when no abortSignal is passed (back-compat)
  • falls back normally when signal is provided but not aborted

Human Verification

  • Verified scenarios:
    • All 70 tests in model-fallback.test.ts pass after the change
    • All 20 tests in model-fallback.probe.test.ts pass
    • All 8 tests in agent-command.live-model-switch.test.ts pass
    • End-to-end smoke: container rebuilt from this branch, HTTP client disconnect on /v1/responses produces the expected [openresponses] client disconnected, aborting streaming run runId=... log line and the agent run terminates within ~1s (the upstream watchClientDisconnect plumbing already in aad3bbedd works correctly with the new ClientDisconnectError reason tag)
  • Edge cases checked: cause-chain walking (one level deep) for TimeoutError/ClientDisconnectError wrapped inside an outer AbortError -- this matches pi-embedded-runner/run/attempt.ts:1387's makeAbortError pattern
  • What I did not verify:
    • Direct E2E of the fallback-skip path with a real cron timeout firing (would require setting timeoutSeconds to a very small value; the unit tests cover the contract)
    • The image-model fallback variant (runWithImageModelFallback) -- it gets the new abortSignal? parameter for consistency but no callers currently pass a signal

Compatibility / Migration

  • Backward compatible? Yes -- the new abortSignal? parameter is optional. Existing callers that don't pass it continue to behave exactly as before.
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: A future caller passes an abort signal whose signal.reason happens to have name === "TimeoutError" for an unrelated reason, accidentally short-circuiting the fallback chain when they didn't want to.
    • Mitigation: The check is intentionally narrow -- only the exact name strings TimeoutError and ClientDisconnectError match. Both names are already reserved for run-budget timeout (set by makeTimeoutAbortReason in pi-embedded-runner/run/attempt.ts) and HTTP client disconnect (set by watchClientDisconnect in gateway/http-common.ts after this PR).
  • Risk: TimeoutError from a per-provider request timeout (not run-budget) gets mistaken for a run-budget timeout and skips fallback when it shouldn't.
    • Mitigation: Per-provider timeouts come from the LLM SDK's internal fetch timeout, which throws an error directly -- they don't propagate through runAbortController.abort(). Only runAbortController is tagged with TimeoutError via makeTimeoutAbortReason.

Out of scope (for separate PRs)

Relationship to PR openclaw#52365

PR openclaw#52365 (fix(cron): stop fallback attempts when cron budget is exhausted) addresses the same underlying problem from openclaw#60388 but with a different, complementary mechanism:

These are complementary, not exclusive. openclaw#52365's beforeAttempt hook stops the chain before wasting an attempt when budget is known to be low. This PR's check stops the chain after the first attempt aborts with a terminal reason -- which covers both the cron-timeout case (redundantly with openclaw#52365) AND the HTTP client disconnect case (which openclaw#52365 does not address).

Notable differences:

Either PR alone fixes the openclaw#60388 cron-timeout case. Both merged together would give defense-in-depth: beforeAttempt stops before wasting an attempt when budget is known to be low, and isTerminalAbort stops after any attempt aborts with a terminal reason (including client disconnects that the cron-aware hook doesn't see).

If the maintainers prefer openclaw#52365's approach and would rather not have two overlapping mechanisms, I'd suggest retargeting this PR to cover only the ClientDisconnectError branch of isTerminalAbort (the unique coverage) and letting openclaw#52365 handle the cron-timeout case via its proactive hook.

 openclaw#60388)

When a model fallback chain runs after an abort, today the layer cannot tell
the difference between:

  1. "This model failed, try another"  -> fallback should retry
  2. "The whole run is over"            -> fallback should stop immediately

Two situations where the run is over and retrying wastes resources:

  - **Run-budget timeout**: The embedded runner's `scheduleAbortTimer` fires
    `runAbortController.abort(makeTimeoutAbortReason())` after the configured
    `agents.defaults.timeoutSeconds`. The budget is exhausted -- giving the
    next candidate ~0 ms remaining is guaranteed to fail and wastes API calls.
    This is openclaw#60388.

  - **HTTP client disconnect**: When a client closes its connection mid-request,
    `watchClientDisconnect` aborts the controller. No caller is left to receive
    a response, so any tokens spent on a fallback model are wasted.

Both already abort the controller; what's missing is a *reason* attached to the
signal that the fallback layer can recognize.

This patch:

  - Adds `ClientDisconnectError` and tags the abort in `watchClientDisconnect`
    via `abortController.abort(new ClientDisconnectError())`. The run-timeout
    case already tags with `name === "TimeoutError"` via `makeTimeoutAbortReason`.

  - Adds `isTerminalAbort(signal)` in `model-fallback.ts` that walks one cause
    level (because pi-embedded-runner's `makeAbortError` wraps the original
    reason as an outer AbortError with `.cause` set) and recognizes both
    `TimeoutError` and `ClientDisconnectError` as terminal markers.

  - Wires `shouldRethrowAbort(err, signal)` to short-circuit the fallback chain
    when the signal carries a terminal reason.

  - Adds `abortSignal` parameter to `runWithModelFallback` and plumbs it through
    `runFallbackAttempt` -> `runFallbackCandidate`. Updates the five callers
    that have a signal in scope (`agent-command.ts`, `agent-runner-execution.ts`,
    `agent-runner-memory.ts`, `followup-runner.ts`, `cron/isolated-agent/run-executor.ts`).

  - Six new unit tests in `model-fallback.test.ts` covering: TimeoutError reason,
    ClientDisconnectError reason, nested cause chain detection, generic reason
    falls back normally, no signal back-compat, signal-not-aborted back-compat.

`runWithImageModelFallback` also gets the `abortSignal?` parameter for consistency.

Detection is via `signal.reason` (not `err.message` or `err.name`) because the
fetch wrapper re-throws aborts as a generic `AbortError` that loses the original
tag in `.name`. The `signal.reason` survives this round-trip.

Test results: `model-fallback.test.ts` 70/70 passed, `model-fallback.probe.test.ts`
20/20 passed, `agent-command.live-model-switch.test.ts` 8/8 passed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't trigger model fallback when abort reason is the run's own timeout budget

2 participants