Skip to content

Commit 3fa442c

Browse files
committed
fix(agents): mark embedded run-budget timeouts as terminal aborts
Address P2/P3 bot review on PR #62682: [P2] Embedded run-budget timeouts misclassified — `opts.abortSignal` is the caller-provided signal (HTTP disconnect, cron wrapper). It is NOT the embedded runner's private `runAbortController` — when `scheduleAbortTimer` fires and calls `abortRun(true)`, only the internal controller is aborted with a TimeoutError. The previous patch's `isTerminalAbort(opts.abortSignal)` check therefore missed the #60388 case entirely; the fallback chain still tried alternative models even though the whole run's deadline had elapsed. Two complementary changes address this: (1) `EmbeddedRunAttemptResult.timedOutByRunBudget` flag — set explicitly in attempt.ts when the run-budget timer fires (NOT when LLM idle watchdog fires; that's still LLM-phase and benefits from fallback). Threaded through to failover-policy `shouldRotateAssistant` alongside the existing `timedOutDuringCompaction` and `timedOutDuringToolExecution` exemptions. Also gates the timeout-triggered compaction branch in run.ts (compacting wouldn't help — the deadline is exhausted). Optional in result type for public harness SDK back-compat (matches #75873 pattern). (2) `isTerminalAbortFromError(err)` — new helper that mirrors the existing `isTerminalAbort(signal)` checks but reads the thrown error's `.cause` chain. Needed because `abortable()` wraps the embedded controller's TimeoutError in an outer AbortError, and the fallback layer only sees the thrown error (not the embedded controller's signal). Used in `runFallbackCandidate.catch` next to the existing signal check. [P3] CHANGELOG entry added under Unreleased/Fixes. Tests: 197/197 pass in failover-policy, assistant-failover, model-fallback, and trajectory-metadata suites. New regression cases cover: (a) thrown error with TimeoutError in cause chain rethrows; (b) thrown error with ClientDisconnectError in cause chain rethrows; (c) generic AbortError still falls back; (d) failover-policy exempts run-budget timeouts both pre- and post-rotation. Closes #60388.
1 parent 62b58bd commit 3fa442c

13 files changed

Lines changed: 236 additions & 4 deletions

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Docs: https://docs.openclaw.ai
1111

1212
### Fixes
1313

14+
- Agents/failover: stop the model fallback chain when a run-level abort is terminal — the embedded run-budget timer (`scheduleAbortTimer`), an HTTP client disconnect, or a cron job timeout. Previously the fallback layer would continue trying additional models even though the whole run's deadline had elapsed (or the caller had gone away), wasting tokens and time on retries that could not produce a useful result. Detection covers both `signal.reason` (caller-driven) and the thrown error's `.cause` chain (embedded private controller). New `timedOutByRunBudget` attempt result flag; failover policy now skips assistant rotation when set. Closes #60388. Thanks @simonusa.
1415
- Agents/sessions: preserve terminal lifecycle state when final run metadata persists from a stale in-memory snapshot, preventing `main` sessions from staying stuck as running after completed or timed-out turns.
1516
- Status: show the `openai-codex` OAuth profile for `openai/gpt-*` sessions running through the native Codex runtime instead of reporting auth as unknown. (#76197) Thanks @mbelinky.
1617
- Plugins/externalization: keep diagnostics ClawHub packages and persisted bundled-plugin relocation on npm-first install metadata for launch, and omit Discord from the core package now that its external package is published. Thanks @vincentkoc.

src/agents/model-fallback.test.ts

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2227,6 +2227,79 @@ describe("runWithModelFallback", () => {
22272227
expect(run).toHaveBeenCalledTimes(1);
22282228
});
22292229

2230+
it("rethrows when thrown error has TimeoutError in cause chain (embedded run-budget timer)", async () => {
2231+
// The embedded runner aborts a *private* runAbortController on
2232+
// run-budget timeout — `params.abortSignal` (the caller signal) is
2233+
// never aborted. abortable() then wraps the rejection in an outer
2234+
// AbortError whose .cause is the TimeoutError. The fallback layer
2235+
// must inspect the thrown error's cause chain (not just the signal).
2236+
// Closes #60388.
2237+
const cfg = makeCfg();
2238+
const innerTimeout = new Error("request timed out");
2239+
innerTimeout.name = "TimeoutError";
2240+
const outerAbort = new Error("aborted", { cause: innerTimeout });
2241+
outerAbort.name = "AbortError";
2242+
const run = vi.fn().mockRejectedValue(outerAbort);
2243+
2244+
// Note: NO abortSignal passed — caller signal is irrelevant for this
2245+
// path. The terminal classification must come from the thrown error.
2246+
await expect(
2247+
runWithModelFallback({
2248+
cfg,
2249+
provider: "anthropic",
2250+
model: "claude-sonnet-4-6",
2251+
run,
2252+
}),
2253+
).rejects.toBe(outerAbort);
2254+
2255+
expect(run).toHaveBeenCalledTimes(1);
2256+
});
2257+
2258+
it("rethrows when thrown error has ClientDisconnectError in cause chain", async () => {
2259+
// Parallel to the run-budget case: HTTP client disconnect can also
2260+
// surface as a wrapped AbortError whose .cause is a
2261+
// ClientDisconnectError. The fallback layer must recognize it via
2262+
// the thrown error even when params.abortSignal is unused.
2263+
const cfg = makeCfg();
2264+
const innerDisconnect = new Error("client disconnected");
2265+
innerDisconnect.name = "ClientDisconnectError";
2266+
const outerAbort = new Error("aborted", { cause: innerDisconnect });
2267+
outerAbort.name = "AbortError";
2268+
const run = vi.fn().mockRejectedValue(outerAbort);
2269+
2270+
await expect(
2271+
runWithModelFallback({
2272+
cfg,
2273+
provider: "anthropic",
2274+
model: "claude-sonnet-4-6",
2275+
run,
2276+
}),
2277+
).rejects.toBe(outerAbort);
2278+
2279+
expect(run).toHaveBeenCalledTimes(1);
2280+
});
2281+
2282+
it("falls back normally when thrown error is generic AbortError without terminal cause", async () => {
2283+
// Sanity: a generic AbortError (e.g. user pressed Esc but signal not
2284+
// tagged) should NOT trigger terminal-from-error detection — it falls
2285+
// through to the existing fallback path.
2286+
const cfg = makeCfg();
2287+
const run = vi
2288+
.fn()
2289+
.mockRejectedValueOnce(new Error("provider transient failure"))
2290+
.mockResolvedValueOnce("ok");
2291+
2292+
const result = await runWithModelFallback({
2293+
cfg,
2294+
provider: "anthropic",
2295+
model: "claude-sonnet-4-6",
2296+
run,
2297+
});
2298+
2299+
expect(result.result).toBe("ok");
2300+
expect(run).toHaveBeenCalledTimes(2);
2301+
});
2302+
22302303
it("falls back normally when signal is aborted with a non-terminal reason", async () => {
22312304
// A signal aborted with a generic error (e.g. provider-specific failure)
22322305
// should NOT skip fallback — that's a regular failover situation.

src/agents/model-fallback.ts

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,51 @@ function isTerminalAbort(signal: AbortSignal | undefined): boolean {
192192
return false;
193193
}
194194

195+
/**
196+
* Check if a thrown error itself indicates a terminal abort (vs the
197+
* caller-provided signal). Mirrors {@link isTerminalAbort} but inspects the
198+
* error's `.cause` chain instead of `signal.reason`. Needed because the
199+
* embedded runner's run-budget timer aborts a private `runAbortController`,
200+
* not the caller signal — `abortable()` then wraps the rejection in an
201+
* outer AbortError whose `.cause` is the original TimeoutError. Closes #60388.
202+
*/
203+
function isTerminalAbortFromError(err: unknown): boolean {
204+
if (!(err instanceof Error)) {
205+
return false;
206+
}
207+
const candidates: unknown[] = [err];
208+
if ("cause" in err && err.cause !== undefined) {
209+
candidates.push(err.cause);
210+
if (err.cause instanceof Error && "cause" in err.cause && err.cause.cause !== undefined) {
211+
candidates.push(err.cause.cause);
212+
}
213+
}
214+
for (const candidate of candidates) {
215+
if (typeof candidate === "string") {
216+
if (TERMINAL_ABORT_REASON_STRINGS.has(candidate)) {
217+
return true;
218+
}
219+
continue;
220+
}
221+
if (!(candidate instanceof Error)) {
222+
continue;
223+
}
224+
if (candidate.name === "TimeoutError") {
225+
return true;
226+
}
227+
if (candidate.name === "ClientDisconnectError") {
228+
return true;
229+
}
230+
if (
231+
typeof candidate.message === "string" &&
232+
TERMINAL_ABORT_REASON_STRINGS.has(candidate.message)
233+
) {
234+
return true;
235+
}
236+
}
237+
return false;
238+
}
239+
195240
function shouldRethrowAbort(err: unknown, signal?: AbortSignal): boolean {
196241
// Terminal aborts (run timeout, client disconnect) always propagate up.
197242
// The whole run is over — retrying with another model wastes resources.
@@ -319,7 +364,17 @@ async function runFallbackCandidate<T>(params: {
319364
// (e.g. a Google Vertex RESOURCE_EXHAUSTED abort that races with the run-budget
320365
// timer). Check this BEFORE coerceToFailoverError so the normalization path
321366
// cannot mask a terminal reason. Flagged by greptile review on openclaw/openclaw#62682.
322-
if (isTerminalAbort(params.abortSignal)) {
367+
//
368+
// Two paths are checked:
369+
// (1) `params.abortSignal` — the caller-provided signal. Catches HTTP
370+
// client disconnects (caller aborts on socket close) and any other
371+
// caller-driven terminal abort.
372+
// (2) the thrown error itself — catches the embedded runner's
373+
// run-budget-timer case where the embedded `runAbortController` (a
374+
// *private* controller, not `params.abortSignal`) is aborted with a
375+
// TimeoutError; `abortable()` wraps that as an outer AbortError whose
376+
// `.cause` is the TimeoutError. Closes #60388.
377+
if (isTerminalAbort(params.abortSignal) || isTerminalAbortFromError(err)) {
323378
throw err;
324379
}
325380
// Normalize abort-wrapped rate-limit errors (e.g. Google Vertex RESOURCE_EXHAUSTED)

src/agents/pi-embedded-runner/run.ts

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1139,6 +1139,12 @@ export async function runEmbeddedPiAgent(
11391139
currentAttemptAssistant,
11401140
} = attempt;
11411141
const timedOutDuringToolExecution = attempt.timedOutDuringToolExecution ?? false;
1142+
// Optional in the public harness SDK contract; default to false. The
1143+
// embedded runner sets this explicitly when its run-budget timer
1144+
// fires. Used below by the failover-policy and model-fallback layer
1145+
// to skip the fallback chain when the whole-run deadline has elapsed
1146+
// (no other model can help). Closes #60388.
1147+
const timedOutByRunBudget = attempt.timedOutByRunBudget ?? false;
11421148
if (sessionIdUsed && sessionIdUsed !== activeSessionId) {
11431149
activeSessionId = sessionIdUsed;
11441150
}
@@ -1243,8 +1249,15 @@ export async function runEmbeddedPiAgent(
12431249
}
12441250
// ── Timeout-triggered compaction ──────────────────────────────────
12451251
// When the LLM times out with high context usage, compact before
1246-
// retrying to break the death spiral of repeated timeouts.
1247-
if (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution) {
1252+
// retrying to break the death spiral of repeated timeouts. Skip when
1253+
// the run-budget timer fired (the whole-run deadline is already
1254+
// exhausted; compacting is wasted work). Closes #60388.
1255+
if (
1256+
timedOut &&
1257+
!timedOutDuringCompaction &&
1258+
!timedOutDuringToolExecution &&
1259+
!timedOutByRunBudget
1260+
) {
12481261
// Only consider prompt-side tokens here. API totals include output
12491262
// tokens, which can make a long generation look like high context
12501263
// pressure even when the prompt itself was small.
@@ -1937,6 +1950,7 @@ export async function runEmbeddedPiAgent(
19371950
timedOut,
19381951
timedOutDuringCompaction,
19391952
timedOutDuringToolExecution,
1953+
timedOutByRunBudget,
19401954
profileRotated: false,
19411955
});
19421956
const assistantFailoverOutcome = await handleAssistantFailover({
@@ -1950,6 +1964,7 @@ export async function runEmbeddedPiAgent(
19501964
idleTimedOut,
19511965
timedOutDuringCompaction,
19521966
timedOutDuringToolExecution,
1967+
timedOutByRunBudget,
19531968
allowSameModelIdleTimeoutRetry:
19541969
timedOut &&
19551970
idleTimedOut &&

src/agents/pi-embedded-runner/run/assistant-failover.test.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ function makeParams(overrides: Partial<Params> = {}): Params {
2020
idleTimedOut: false,
2121
timedOutDuringCompaction: false,
2222
timedOutDuringToolExecution: false,
23+
timedOutByRunBudget: false,
2324
allowSameModelIdleTimeoutRetry: false,
2425
assistantProfileFailureReason: null,
2526
lastProfileId: undefined,

src/agents/pi-embedded-runner/run/assistant-failover.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ export async function handleAssistantFailover(params: {
4343
idleTimedOut: boolean;
4444
timedOutDuringCompaction: boolean;
4545
timedOutDuringToolExecution: boolean;
46+
timedOutByRunBudget: boolean;
4647
allowSameModelIdleTimeoutRetry: boolean;
4748
assistantProfileFailureReason: AuthProfileFailureReason | null;
4849
lastProfileId?: string;
@@ -179,6 +180,7 @@ export async function handleAssistantFailover(params: {
179180
timedOut: params.timedOut,
180181
timedOutDuringCompaction: params.timedOutDuringCompaction,
181182
timedOutDuringToolExecution: params.timedOutDuringToolExecution,
183+
timedOutByRunBudget: params.timedOutByRunBudget,
182184
profileRotated: true,
183185
});
184186
}

src/agents/pi-embedded-runner/run/attempt.ts

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -786,6 +786,11 @@ export async function runEmbeddedAttempt(
786786
let idleTimedOut = false;
787787
let timedOutDuringCompaction = false;
788788
let timedOutDuringToolExecution = false;
789+
// True when the embedded run-budget timer (`scheduleAbortTimer`) fires.
790+
// Distinct from idleTimedOut (per-LLM-call streaming idle) and from
791+
// caller-signal aborts. This is the "the whole run's deadline is exhausted
792+
// — no fallback model can help" case. Closes #60388.
793+
let timedOutByRunBudget = false;
789794
let promptError: unknown = null;
790795
let emitDiagnosticRunCompleted:
791796
| ((outcome: "completed" | "aborted" | "error", err?: unknown) => void)
@@ -2414,6 +2419,10 @@ export async function runEmbeddedAttempt(
24142419
) {
24152420
timedOutDuringCompaction = true;
24162421
}
2422+
// Mark the run-budget exhaustion explicitly so the fallback layer
2423+
// can stop the model-fallback chain (no other model can help once
2424+
// the whole run's deadline has elapsed). Closes #60388.
2425+
timedOutByRunBudget = true;
24172426
abortRun(true);
24182427
if (!abortWarnTimer) {
24192428
abortWarnTimer = setTimeout(() => {
@@ -3477,6 +3486,7 @@ export async function runEmbeddedAttempt(
34773486
idleTimedOut,
34783487
timedOutDuringCompaction,
34793488
timedOutDuringToolExecution,
3489+
timedOutByRunBudget,
34803490
promptError: promptError ? formatErrorMessage(promptError) : undefined,
34813491
promptErrorSource,
34823492
usage: attemptUsage,
@@ -3496,6 +3506,7 @@ export async function runEmbeddedAttempt(
34963506
idleTimedOut,
34973507
timedOutDuringCompaction,
34983508
timedOutDuringToolExecution,
3509+
timedOutByRunBudget,
34993510
promptError: promptError ? formatErrorMessage(promptError) : undefined,
35003511
promptErrorSource,
35013512
usage: attemptUsage,
@@ -3521,6 +3532,7 @@ export async function runEmbeddedAttempt(
35213532
idleTimedOut,
35223533
timedOutDuringCompaction,
35233534
timedOutDuringToolExecution,
3535+
timedOutByRunBudget,
35243536
promptError: promptError ? formatErrorMessage(promptError) : undefined,
35253537
});
35263538
trajectoryEndRecorded = true;
@@ -3535,6 +3547,7 @@ export async function runEmbeddedAttempt(
35353547
idleTimedOut,
35363548
timedOutDuringCompaction,
35373549
timedOutDuringToolExecution,
3550+
timedOutByRunBudget,
35383551
promptError,
35393552
promptErrorSource,
35403553
preflightRecovery,
@@ -3580,6 +3593,7 @@ export async function runEmbeddedAttempt(
35803593
idleTimedOut,
35813594
timedOutDuringCompaction,
35823595
timedOutDuringToolExecution,
3596+
timedOutByRunBudget,
35833597
promptError: promptError ? formatErrorMessage(promptError) : undefined,
35843598
});
35853599
}

src/agents/pi-embedded-runner/run/failover-policy.test.ts

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ describe("resolveRunFailoverDecision", () => {
7373
timedOut: false,
7474
timedOutDuringCompaction: false,
7575
timedOutDuringToolExecution: false,
76+
timedOutByRunBudget: false,
7677
profileRotated: false,
7778
}),
7879
).toEqual({
@@ -93,6 +94,7 @@ describe("resolveRunFailoverDecision", () => {
9394
timedOut: false,
9495
timedOutDuringCompaction: false,
9596
timedOutDuringToolExecution: false,
97+
timedOutByRunBudget: false,
9698
profileRotated: true,
9799
}),
98100
).toEqual({
@@ -113,6 +115,7 @@ describe("resolveRunFailoverDecision", () => {
113115
timedOut: false,
114116
timedOutDuringCompaction: false,
115117
timedOutDuringToolExecution: false,
118+
timedOutByRunBudget: false,
116119
profileRotated: false,
117120
}),
118121
).toEqual({
@@ -149,6 +152,7 @@ describe("resolveRunFailoverDecision", () => {
149152
timedOut: true,
150153
timedOutDuringCompaction: false,
151154
timedOutDuringToolExecution: true,
155+
timedOutByRunBudget: false,
152156
profileRotated: false,
153157
}),
154158
).toEqual({
@@ -168,6 +172,7 @@ describe("resolveRunFailoverDecision", () => {
168172
timedOut: true,
169173
timedOutDuringCompaction: false,
170174
timedOutDuringToolExecution: true,
175+
timedOutByRunBudget: false,
171176
profileRotated: true,
172177
}),
173178
).toEqual({
@@ -187,6 +192,7 @@ describe("resolveRunFailoverDecision", () => {
187192
timedOut: true,
188193
timedOutDuringCompaction: false,
189194
timedOutDuringToolExecution: false,
195+
timedOutByRunBudget: false,
190196
profileRotated: false,
191197
}),
192198
).toEqual({
@@ -207,13 +213,54 @@ describe("resolveRunFailoverDecision", () => {
207213
timedOut: true,
208214
timedOutDuringCompaction: false,
209215
timedOutDuringToolExecution: false,
216+
timedOutByRunBudget: false,
210217
profileRotated: false,
211218
}),
212219
).toEqual({
213220
action: "surface_error",
214221
reason: null,
215222
});
216223
});
224+
225+
it("does not rotate or fallback assistant timeouts that exhausted the run budget (#60388)", () => {
226+
expect(
227+
resolveRunFailoverDecision({
228+
stage: "assistant",
229+
aborted: true,
230+
externalAbort: false,
231+
fallbackConfigured: true,
232+
failoverFailure: false,
233+
failoverReason: null,
234+
timedOut: true,
235+
timedOutDuringCompaction: false,
236+
timedOutDuringToolExecution: false,
237+
timedOutByRunBudget: true,
238+
profileRotated: false,
239+
}),
240+
).toEqual({
241+
action: "continue_normal",
242+
});
243+
});
244+
245+
it("does not fallback assistant run-budget timeouts even after profile rotation exhausted (#60388)", () => {
246+
expect(
247+
resolveRunFailoverDecision({
248+
stage: "assistant",
249+
aborted: true,
250+
externalAbort: false,
251+
fallbackConfigured: true,
252+
failoverFailure: false,
253+
failoverReason: null,
254+
timedOut: true,
255+
timedOutDuringCompaction: false,
256+
timedOutDuringToolExecution: false,
257+
timedOutByRunBudget: true,
258+
profileRotated: true,
259+
}),
260+
).toEqual({
261+
action: "continue_normal",
262+
});
263+
});
217264
});
218265

219266
describe("mergeRetryFailoverReason", () => {

0 commit comments

Comments
 (0)