fix(agents): distinguish terminal aborts from retryable failures (#60388)#2114
Open
BingqingLyu wants to merge 1 commit intomainfrom
Open
fix(agents): distinguish terminal aborts from retryable failures (#60388)#2114BingqingLyu wants to merge 1 commit intomainfrom
BingqingLyu wants to merge 1 commit intomainfrom
Conversation
openclaw#60388) When a model fallback chain runs after an abort, today the layer cannot tell the difference between: 1. "This model failed, try another" -> fallback should retry 2. "The whole run is over" -> fallback should stop immediately Two situations where the run is over and retrying wastes resources: - **Run-budget timeout**: The embedded runner's `scheduleAbortTimer` fires `runAbortController.abort(makeTimeoutAbortReason())` after the configured `agents.defaults.timeoutSeconds`. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. This is openclaw#60388. - **HTTP client disconnect**: When a client closes its connection mid-request, `watchClientDisconnect` aborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted. Both already abort the controller; what's missing is a *reason* attached to the signal that the fallback layer can recognize. This patch: - Adds `ClientDisconnectError` and tags the abort in `watchClientDisconnect` via `abortController.abort(new ClientDisconnectError())`. The run-timeout case already tags with `name === "TimeoutError"` via `makeTimeoutAbortReason`. - Adds `isTerminalAbort(signal)` in `model-fallback.ts` that walks one cause level (because pi-embedded-runner's `makeAbortError` wraps the original reason as an outer AbortError with `.cause` set) and recognizes both `TimeoutError` and `ClientDisconnectError` as terminal markers. - Wires `shouldRethrowAbort(err, signal)` to short-circuit the fallback chain when the signal carries a terminal reason. - Adds `abortSignal` parameter to `runWithModelFallback` and plumbs it through `runFallbackAttempt` -> `runFallbackCandidate`. Updates the five callers that have a signal in scope (`agent-command.ts`, `agent-runner-execution.ts`, `agent-runner-memory.ts`, `followup-runner.ts`, `cron/isolated-agent/run-executor.ts`). - Six new unit tests in `model-fallback.test.ts` covering: TimeoutError reason, ClientDisconnectError reason, nested cause chain detection, generic reason falls back normally, no signal back-compat, signal-not-aborted back-compat. `runWithImageModelFallback` also gets the `abortSignal?` parameter for consistency. Detection is via `signal.reason` (not `err.message` or `err.name`) because the fetch wrapper re-throws aborts as a generic `AbortError` that loses the original tag in `.name`. The `signal.reason` survives this round-trip. Test results: `model-fallback.test.ts` 70/70 passed, `model-fallback.probe.test.ts` 20/20 passed, `agent-command.live-model-switch.test.ts` 8/8 passed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses openclaw#60388 (complementary to openclaw#52365, see "Relationship to PR openclaw#52365" below)
Today the fallback layer cannot tell the difference between two very different aborts:
Two situations where the run is over and retrying with another model wastes resources:
Run-budget timeout (Don't trigger model fallback when abort reason is the run's own timeout budget openclaw/openclaw#60388): The embedded runner's
scheduleAbortTimerfiresrunAbortController.abort(makeTimeoutAbortReason())after the configuredagents.defaults.timeoutSeconds. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. The user issue description (Don't trigger model fallback when abort reason is the run's own timeout budget openclaw/openclaw#60388) says "On a fleet of ~100 cron jobs: 30+ model-fallback events per day, almost all triggered by run timeouts."HTTP client disconnect: When a client closes its connection mid-request,
watchClientDisconnectaborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted.Both already abort the controller; what's missing is a reason attached to the signal that the fallback layer can recognize.
Summary
AbortSignal.reasonfor two terminal markers (name === "TimeoutError"for run-budget,name === "ClientDisconnectError"for client disconnect) and short-circuits the chain when found.Change Type
Scope
Linked Issue
Root Cause
shouldRethrowAbort()inmodel-fallback.tschecksisFallbackAbortError(err) && !isTimeoutError(err)-- which means timeout errors are intentionally not rethrown, so the fallback chain runs them. This is correct for per-provider timeouts (provider is slow, try another), but wrong for run-budget timeouts (whole run is out of time, no point retrying). The two cannot be told apart from the error alone.signal.reasoncarryingTimeoutError(set bypi-embedded-runner/run/attempt.ts:1382-1386viamakeTimeoutAbortReason) was never inspected.Regression Test Plan
src/agents/model-fallback.test.tsdescribe("terminal abort propagation (closes #60388)"):signal.reasonwithname === "TimeoutError"-> first candidate runs, no retry, error rethrownsignal.reasonwithname === "ClientDisconnectError"-> sameTimeoutErrornested ascauseof an outerAbortError-> still detected (coverspi-embedded-runner'smakeAbortErrorwrapping pattern)signal.reasonwith a generic error -> fallback runs normally (non-terminal)abortSignalpassed -> fallback runs normally (back-compat for existing callers)abortSignalprovided but not aborted -> fallback runs normally (live-signal back-compat)run.mock.calls.length === 1to verify the chain stopped. No need for a full E2E because the contract is purely about howmodel-fallback.tsinterpretsAbortSignal.reason.User-visible / Behavior Changes
When a run-budget timeout fires (the agent run exceeds
agents.defaults.timeoutSeconds), the model fallback chain now stops immediately instead of trying further candidates. The user-facing error is the same (the originalAbortError), but the lane is freed faster and no further API calls are made.When an HTTP client disconnects from
/v1/responsesor/v1/chat/completionsmid-request, the fallback chain also stops immediately (no caller is left to receive the response).Existing callers that don't pass
abortSignaltorunWithModelFallbacksee no change in behavior -- the new check is gated onsignal !== undefined && signal.aborted.Diagram
Security Impact
Repro + Verification
Environment
Steps
agents.defaults.timeoutSeconds(e.g. 5s) and one or more model fallbacksExpected (after fix)
[model-fallback] candidate_failedentries for fallback candidates beyond the firstActual (before fix)
Evidence
The new test cases:
rethrows immediately when signal.reason has name=TimeoutError (run-budget timeout)rethrows immediately when signal.reason has name=ClientDisconnectErrordetects TimeoutError nested as cause of an outer Errorfalls back normally when signal is aborted with a non-terminal reasonfalls back normally when no abortSignal is passed (back-compat)falls back normally when signal is provided but not abortedHuman Verification
model-fallback.test.tspass after the changemodel-fallback.probe.test.tspassagent-command.live-model-switch.test.tspass/v1/responsesproduces the expected[openresponses] client disconnected, aborting streaming run runId=...log line and the agent run terminates within ~1s (the upstreamwatchClientDisconnectplumbing already inaad3bbeddworks correctly with the newClientDisconnectErrorreason tag)TimeoutError/ClientDisconnectErrorwrapped inside an outerAbortError-- this matchespi-embedded-runner/run/attempt.ts:1387'smakeAbortErrorpatterntimeoutSecondsto a very small value; the unit tests cover the contract)runWithImageModelFallback) -- it gets the newabortSignal?parameter for consistency but no callers currently pass a signalCompatibility / Migration
abortSignal?parameter is optional. Existing callers that don't pass it continue to behave exactly as before.Risks and Mitigations
signal.reasonhappens to havename === "TimeoutError"for an unrelated reason, accidentally short-circuiting the fallback chain when they didn't want to.TimeoutErrorandClientDisconnectErrormatch. Both names are already reserved for run-budget timeout (set bymakeTimeoutAbortReasoninpi-embedded-runner/run/attempt.ts) and HTTP client disconnect (set bywatchClientDisconnectingateway/http-common.tsafter this PR).TimeoutErrorfrom a per-provider request timeout (not run-budget) gets mistaken for a run-budget timeout and skips fallback when it shouldn't.runAbortController.abort(). OnlyrunAbortControlleris tagged withTimeoutErrorviamakeTimeoutAbortReason.Out of scope (for separate PRs)
executeJobCoreWithTimeout), needs per-attempt AbortController isolation. PR fix(cron): per-attempt AbortControllers and deferred execution timeout openclaw/openclaw#42482 already addresses this.requestLiveSessionModelSwitch), needs coordination state between live-model-switch and fallback layersRelationship to PR openclaw#52365
PR openclaw#52365 (fix(cron): stop fallback attempts when cron budget is exhausted) addresses the same underlying problem from openclaw#60388 but with a different, complementary mechanism:
beforeAttempthook inrunWithModelFallbackthat lets the cron layer check its remaining budget before each attempt and stop the chain if budget is too low.isTerminalAbort(signal)check inshouldRethrowAbortthat inspectssignal.reasonafter an attempt aborts to decide whether to rethrow or retry.These are complementary, not exclusive. openclaw#52365's
beforeAttempthook stops the chain before wasting an attempt when budget is known to be low. This PR's check stops the chain after the first attempt aborts with a terminal reason -- which covers both the cron-timeout case (redundantly with openclaw#52365) AND the HTTP client disconnect case (which openclaw#52365 does not address).Notable differences:
model-fallback.ts,http-common.ts,agent-command.ts, 3 auto-reply callers, and one cron caller. fix(cron): stop fallback attempts when cron budget is exhausted openclaw/openclaw#52365 is 47 files / ~1900 lines (includes unrelated slack/plugin-sdk changes)./v1/responsesand/v1/chat/completionsclient-disconnect case by tagging the abort withClientDisconnectErrorinwatchClientDisconnect. fix(cron): stop fallback attempts when cron budget is exhausted openclaw/openclaw#52365 does not touch the gateway HTTP path.signal.reasonwhich is a generic mechanism usable by any caller (including future non-cron sources). fix(cron): stop fallback attempts when cron budget is exhausted openclaw/openclaw#52365'sbeforeAttempthook requires each caller to implement its own budget-check logic.Either PR alone fixes the openclaw#60388 cron-timeout case. Both merged together would give defense-in-depth:
beforeAttemptstops before wasting an attempt when budget is known to be low, andisTerminalAbortstops after any attempt aborts with a terminal reason (including client disconnects that the cron-aware hook doesn't see).If the maintainers prefer openclaw#52365's approach and would rather not have two overlapping mechanisms, I'd suggest retargeting this PR to cover only the
ClientDisconnectErrorbranch ofisTerminalAbort(the unique coverage) and letting openclaw#52365 handle the cron-timeout case via its proactive hook.