fix(agents): distinguish terminal aborts from retryable failures (#60388)#62682
fix(agents): distinguish terminal aborts from retryable failures (#60388)#62682simonusa wants to merge 4 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR adds terminal abort detection to the model fallback chain to prevent wasteful API retries when a run-budget timeout or HTTP client disconnect has already terminated the run. It introduces Confidence Score: 5/5Safe to merge — the new abortSignal parameter is optional, all existing callers are unaffected, and the terminal-abort path is gated behind an explicit signal.reason name check. All remaining findings are P2 or below. The implementation is well-scoped, the double isTerminalAbort guard (pre- and post-coerceToFailoverError) is defensive but not harmful, tests cover all advertised scenarios including the failover-normalization race, and backward compatibility is preserved by design. No files require special attention. Reviews (2): Last reviewed commit: "fix(agents): distinguish terminal aborts..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 537fee730e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
537fee7 to
7c59102
Compare
Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c59102df7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@greptileai review |
7c59102 to
fa41f56
Compare
Flagged by codex review on openclaw#62682: `isTerminalAbort` only handled `reason instanceof Error`, but `src/cron/service/timer.ts:90` aborts the run controller with a plain string (`timeoutErrorMessage()`), so cron timeouts were falling through to fallback retries. Fix: add a known-terminal-strings set (currently just "cron: job execution timed out") and match string reasons against it. Also match Error objects whose `.message` equals a known terminal string, to cover callers that wrap the message in `new Error(...)` before passing it to `abort()`. Three new regression tests cover: string reason match, wrapped-in-Error match, and back-compat (unrelated strings still flow through fallback). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fa41f56 to
235912a
Compare
|
checks-node-test failed with an OOM crash (FATAL ERROR: Reached heap limit) unrelated to this PR. Could a maintainer re-run the check? |
235912a to
2281256
Compare
|
check-additional failures are pre-existing upstream lint debt unrelated to this PR: lint:tmp:no-random-messaging — os.tmpdir() usage in extensions/active-memory, browser, memory-core, etc. (being addressed by #63902) |
2281256 to
59e763c
Compare
Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. A source-level reproduction is an embedded run with a short timeoutMs and fallback candidates: scheduleAbortTimer aborts the private runAbortController with a TimeoutError, which current main can map into timeout failover; the PR adds unit and policy coverage for both the thrown cause-chain and explicit policy-flag paths. Next step before merge Security Review detailsBest possible solution: Land this narrow terminal-abort contract if maintainers want reactive coverage for embedded budget exhaustion, cron timeout aborts, and HTTP disconnects, while leaving broader cron controller redesign to the related cron PRs. Do we have a high-confidence way to reproduce the issue? Yes. A source-level reproduction is an embedded run with a short timeoutMs and fallback candidates: scheduleAbortTimer aborts the private runAbortController with a TimeoutError, which current main can map into timeout failover; the PR adds unit and policy coverage for both the thrown cause-chain and explicit policy-flag paths. Is this the best way to solve the issue? Yes. The latest patch is a narrow maintainable fix for the reactive terminal-abort gap: classify explicit terminal abort markers at the fallback boundary and add a runner-owned timedOutByRunBudget flag, while preserving ordinary LLM/provider timeouts as fallback-eligible. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 9eb79bcf997b. |
|
Rebased onto current upstream/main (commit
Conflicts resolved across 4 files ( Local validation: Re: the 2 CI failures ( Both surface the same root cause — The original CI failures from Apr 9 ( Ready for review. |
59e763c to
72ae084
Compare
Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
72ae084 to
8cd3c1a
Compare
openclaw#60388) When a model fallback chain runs after an abort, today the layer cannot tell the difference between: 1. "This model failed, try another" -> fallback should retry 2. "The whole run is over" -> fallback should stop immediately Two situations where the run is over and retrying wastes resources: - **Run-budget timeout**: The embedded runner's `scheduleAbortTimer` fires `runAbortController.abort(makeTimeoutAbortReason())` after the configured `agents.defaults.timeoutSeconds`. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. This is openclaw#60388. - **HTTP client disconnect**: When a client closes its connection mid-request, `watchClientDisconnect` aborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted. Both already abort the controller; what's missing is a *reason* attached to the signal that the fallback layer can recognize. This patch: - Adds `ClientDisconnectError` and tags the abort in `watchClientDisconnect` via `abortController.abort(new ClientDisconnectError())`. The run-timeout case already tags with `name === "TimeoutError"` via `makeTimeoutAbortReason`. - Adds `isTerminalAbort(signal)` in `model-fallback.ts` that walks one cause level (because pi-embedded-runner's `makeAbortError` wraps the original reason as an outer AbortError with `.cause` set) and recognizes both `TimeoutError` and `ClientDisconnectError` as terminal markers. - Wires `shouldRethrowAbort(err, signal)` to short-circuit the fallback chain when the signal carries a terminal reason. - Adds `abortSignal` parameter to `runWithModelFallback` and plumbs it through `runFallbackAttempt` -> `runFallbackCandidate`. Updates the five callers that have a signal in scope (`agent-command.ts`, `agent-runner-execution.ts`, `agent-runner-memory.ts`, `followup-runner.ts`, `cron/isolated-agent/run-executor.ts`). - Six new unit tests in `model-fallback.test.ts` covering: TimeoutError reason, ClientDisconnectError reason, nested cause chain detection, generic reason falls back normally, no signal back-compat, signal-not-aborted back-compat. `runWithImageModelFallback` also gets the `abortSignal?` parameter for consistency. Detection is via `signal.reason` (not `err.message` or `err.name`) because the fetch wrapper re-throws aborts as a generic `AbortError` that loses the original tag in `.name`. The `signal.reason` survives this round-trip. Test results: `model-fallback.test.ts` 70/70 passed, `model-fallback.probe.test.ts` 20/20 passed, `agent-command.live-model-switch.test.ts` 8/8 passed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Edge case flagged by greptile on openclaw#62682: if the signal is aborted with a terminal reason (run-budget timeout) but the thrown error also matches a failover-normalizable shape (e.g. Google Vertex RESOURCE_EXHAUSTED), the `shouldRethrowAbort && !normalizedFailover` guard falls through and the chain tries the next candidate anyway. Fix: check isTerminalAbort(signal) before running coerceToFailoverError so the terminal signal cannot be masked by rate-limit normalization. Adds a regression test that locks the contract: aborted signal with TimeoutError reason + 429/AbortError thrown => first candidate runs, error rethrown, no further candidates tried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Flagged by codex review on openclaw#62682: `isTerminalAbort` only handled `reason instanceof Error`, but `src/cron/service/timer.ts:90` aborts the run controller with a plain string (`timeoutErrorMessage()`), so cron timeouts were falling through to fallback retries. Fix: add a known-terminal-strings set (currently just "cron: job execution timed out") and match string reasons against it. Also match Error objects whose `.message` equals a known terminal string, to cover callers that wrap the message in `new Error(...)` before passing it to `abort()`. Three new regression tests cover: string reason match, wrapped-in-Error match, and back-compat (unrelated strings still flow through fallback). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address P2/P3 bot review on PR openclaw#62682: [P2] Embedded run-budget timeouts misclassified — `opts.abortSignal` is the caller-provided signal (HTTP disconnect, cron wrapper). It is NOT the embedded runner's private `runAbortController` — when `scheduleAbortTimer` fires and calls `abortRun(true)`, only the internal controller is aborted with a TimeoutError. The previous patch's `isTerminalAbort(opts.abortSignal)` check therefore missed the openclaw#60388 case entirely; the fallback chain still tried alternative models even though the whole run's deadline had elapsed. Two complementary changes address this: (1) `EmbeddedRunAttemptResult.timedOutByRunBudget` flag — set explicitly in attempt.ts when the run-budget timer fires (NOT when LLM idle watchdog fires; that's still LLM-phase and benefits from fallback). Threaded through to failover-policy `shouldRotateAssistant` alongside the existing `timedOutDuringCompaction` and `timedOutDuringToolExecution` exemptions. Also gates the timeout-triggered compaction branch in run.ts (compacting wouldn't help — the deadline is exhausted). Optional in result type for public harness SDK back-compat (matches openclaw#75873 pattern). (2) `isTerminalAbortFromError(err)` — new helper that mirrors the existing `isTerminalAbort(signal)` checks but reads the thrown error's `.cause` chain. Needed because `abortable()` wraps the embedded controller's TimeoutError in an outer AbortError, and the fallback layer only sees the thrown error (not the embedded controller's signal). Used in `runFallbackCandidate.catch` next to the existing signal check. [P3] CHANGELOG entry added under Unreleased/Fixes. Tests: 197/197 pass in failover-policy, assistant-failover, model-fallback, and trajectory-metadata suites. New regression cases cover: (a) thrown error with TimeoutError in cause chain rethrows; (b) thrown error with ClientDisconnectError in cause chain rethrows; (c) generic AbortError still falls back; (d) failover-policy exempts run-budget timeouts both pre- and post-rotation. Closes openclaw#60388.
8cd3c1a to
3fa442c
Compare
|
Addressed both clawsweeper review findings in [P2] Embedded run-budget timeouts now marked as terminal The bot was correct that the prior approach ( Two complementary changes address this:
[P3] CHANGELOG entry added under Tests: 197/197 pass in
@steipete @obviyus — flagging since you both maintain this area and @obviyus recently merged the closely-related #75873 ( Acceptance criteria from bot review (all run locally, all green):
Closes #60388. |
Addresses #60388 (complementary to #52365, see "Relationship to PR #52365" below)
Today the fallback layer cannot tell the difference between two very different aborts:
Two situations where the run is over and retrying with another model wastes resources:
Run-budget timeout (Don't trigger model fallback when abort reason is the run's own timeout budget #60388): The embedded runner's
scheduleAbortTimerfiresrunAbortController.abort(makeTimeoutAbortReason())after the configuredagents.defaults.timeoutSeconds. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. The user issue description (Don't trigger model fallback when abort reason is the run's own timeout budget #60388) says "On a fleet of ~100 cron jobs: 30+ model-fallback events per day, almost all triggered by run timeouts."HTTP client disconnect: When a client closes its connection mid-request,
watchClientDisconnectaborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted.Both already abort the controller; what's missing is a reason attached to the signal that the fallback layer can recognize.
Summary
AbortSignal.reasonfor two terminal markers (name === "TimeoutError"for run-budget,name === "ClientDisconnectError"for client disconnect) and short-circuits the chain when found.Change Type
Scope
Linked Issue
Root Cause
shouldRethrowAbort()inmodel-fallback.tschecksisFallbackAbortError(err) && !isTimeoutError(err)-- which means timeout errors are intentionally not rethrown, so the fallback chain runs them. This is correct for per-provider timeouts (provider is slow, try another), but wrong for run-budget timeouts (whole run is out of time, no point retrying). The two cannot be told apart from the error alone.signal.reasoncarryingTimeoutError(set bypi-embedded-runner/run/attempt.ts:1382-1386viamakeTimeoutAbortReason) was never inspected.Regression Test Plan
src/agents/model-fallback.test.tsdescribe("terminal abort propagation (closes #60388)"):signal.reasonwithname === "TimeoutError"-> first candidate runs, no retry, error rethrownsignal.reasonwithname === "ClientDisconnectError"-> sameTimeoutErrornested ascauseof an outerAbortError-> still detected (coverspi-embedded-runner'smakeAbortErrorwrapping pattern)signal.reasonwith a generic error -> fallback runs normally (non-terminal)abortSignalpassed -> fallback runs normally (back-compat for existing callers)abortSignalprovided but not aborted -> fallback runs normally (live-signal back-compat)run.mock.calls.length === 1to verify the chain stopped. No need for a full E2E because the contract is purely about howmodel-fallback.tsinterpretsAbortSignal.reason.User-visible / Behavior Changes
When a run-budget timeout fires (the agent run exceeds
agents.defaults.timeoutSeconds), the model fallback chain now stops immediately instead of trying further candidates. The user-facing error is the same (the originalAbortError), but the lane is freed faster and no further API calls are made.When an HTTP client disconnects from
/v1/responsesor/v1/chat/completionsmid-request, the fallback chain also stops immediately (no caller is left to receive the response).Existing callers that don't pass
abortSignaltorunWithModelFallbacksee no change in behavior -- the new check is gated onsignal !== undefined && signal.aborted.Diagram
Security Impact
Repro + Verification
Environment
Steps
agents.defaults.timeoutSeconds(e.g. 5s) and one or more model fallbacksExpected (after fix)
[model-fallback] candidate_failedentries for fallback candidates beyond the firstActual (before fix)
Evidence
The new test cases:
rethrows immediately when signal.reason has name=TimeoutError (run-budget timeout)rethrows immediately when signal.reason has name=ClientDisconnectErrordetects TimeoutError nested as cause of an outer Errorfalls back normally when signal is aborted with a non-terminal reasonfalls back normally when no abortSignal is passed (back-compat)falls back normally when signal is provided but not abortedHuman Verification
model-fallback.test.tspass after the changemodel-fallback.probe.test.tspassagent-command.live-model-switch.test.tspass/v1/responsesproduces the expected[openresponses] client disconnected, aborting streaming run runId=...log line and the agent run terminates within ~1s (the upstreamwatchClientDisconnectplumbing already inaad3bbeddworks correctly with the newClientDisconnectErrorreason tag)TimeoutError/ClientDisconnectErrorwrapped inside an outerAbortError-- this matchespi-embedded-runner/run/attempt.ts:1387'smakeAbortErrorpatterntimeoutSecondsto a very small value; the unit tests cover the contract)runWithImageModelFallback) -- it gets the newabortSignal?parameter for consistency but no callers currently pass a signalCompatibility / Migration
abortSignal?parameter is optional. Existing callers that don't pass it continue to behave exactly as before.Risks and Mitigations
signal.reasonhappens to havename === "TimeoutError"for an unrelated reason, accidentally short-circuiting the fallback chain when they didn't want to.TimeoutErrorandClientDisconnectErrormatch. Both names are already reserved for run-budget timeout (set bymakeTimeoutAbortReasoninpi-embedded-runner/run/attempt.ts) and HTTP client disconnect (set bywatchClientDisconnectingateway/http-common.tsafter this PR).TimeoutErrorfrom a per-provider request timeout (not run-budget) gets mistaken for a run-budget timeout and skips fallback when it shouldn't.runAbortController.abort(). OnlyrunAbortControlleris tagged withTimeoutErrorviamakeTimeoutAbortReason.Out of scope (for separate PRs)
executeJobCoreWithTimeout), needs per-attempt AbortController isolation. PR fix(cron): per-attempt AbortControllers and deferred execution timeout #42482 already addresses this.requestLiveSessionModelSwitch), needs coordination state between live-model-switch and fallback layersRelationship to PR #52365
PR #52365 (fix(cron): stop fallback attempts when cron budget is exhausted) addresses the same underlying problem from #60388 but with a different, complementary mechanism:
beforeAttempthook inrunWithModelFallbackthat lets the cron layer check its remaining budget before each attempt and stop the chain if budget is too low.isTerminalAbort(signal)check inshouldRethrowAbortthat inspectssignal.reasonafter an attempt aborts to decide whether to rethrow or retry.These are complementary, not exclusive. #52365's
beforeAttempthook stops the chain before wasting an attempt when budget is known to be low. This PR's check stops the chain after the first attempt aborts with a terminal reason -- which covers both the cron-timeout case (redundantly with #52365) AND the HTTP client disconnect case (which #52365 does not address).Notable differences:
model-fallback.ts,http-common.ts,agent-command.ts, 3 auto-reply callers, and one cron caller. fix(cron): stop fallback attempts when cron budget is exhausted #52365 is 47 files / ~1900 lines (includes unrelated slack/plugin-sdk changes)./v1/responsesand/v1/chat/completionsclient-disconnect case by tagging the abort withClientDisconnectErrorinwatchClientDisconnect. fix(cron): stop fallback attempts when cron budget is exhausted #52365 does not touch the gateway HTTP path.signal.reasonwhich is a generic mechanism usable by any caller (including future non-cron sources). fix(cron): stop fallback attempts when cron budget is exhausted #52365'sbeforeAttempthook requires each caller to implement its own budget-check logic.Either PR alone fixes the #60388 cron-timeout case. Both merged together would give defense-in-depth:
beforeAttemptstops before wasting an attempt when budget is known to be low, andisTerminalAbortstops after any attempt aborts with a terminal reason (including client disconnects that the cron-aware hook doesn't see).If the maintainers prefer #52365's approach and would rather not have two overlapping mechanisms, I'd suggest retargeting this PR to cover only the
ClientDisconnectErrorbranch ofisTerminalAbort(the unique coverage) and letting #52365 handle the cron-timeout case via its proactive hook.