Skip to content

Commit 0e586bb

Browse files
committed
fix(agents): improve fallback failure observability
1 parent 63eaf8e commit 0e586bb

14 files changed

Lines changed: 166 additions & 8 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ Docs: https://docs.openclaw.ai
2626
- Control UI: show loading, reload, and retry states when a lazy dashboard panel cannot load after an upgrade, so the Logs tab no longer appears blank on stale browser bundles. Fixes #72450. Thanks @sobergou.
2727
- Agents/reasoning: recover fully wrapped unclosed `<think>` replies that would otherwise sanitize to empty text while keeping strict stripping for closed reasoning blocks and unclosed tails after visible text. Fixes #37696; supersedes #51915. Thanks @druide67 and @okuyam2y.
2828
- Control UI/Gateway: bind WebChat handshakes to their active socket and reject post-close server registrations, so aborted connects no longer leave zombie clients or misleading duplicate WebSocket connection logs. Fixes #72753. Thanks @LumenFromTheFuture.
29+
- Agents/fallback: split ambiguous provider failures into `empty_response`, `no_error_details`, and `unclassified`, and add flat fallback-step fields to structured fallback logs so primary-model failures stay visible when later fallbacks also fail. Fixes #71922; refs #71744. Thanks @andyk-ms and @nikolaykazakovvs-ux.
2930
- Plugins/Windows: normalize Windows absolute paths before handing bundled plugin modules to Jiti, so Feishu/Lark message sending no longer fails with unsupported `c:` ESM loader URLs. Fixes #72783. Thanks @jackychen-png.
3031
- CLI/doctor: run bundled plugin runtime-dependency repairs through the async npm installer with spinner/line progress and heartbeat updates, so long `openclaw doctor --fix` installs no longer look hung in TTY or piped output. Fixes #72775. Thanks @dfpalhano.
3132
- Feishu/Windows: normalize bundled channel sidecar loads before Jiti evaluates them, so Feishu outbound sends no longer fail with raw `C:` ESM loader errors on Windows. Fixes #72783. Thanks @jackychen-png.

docs/concepts/model-failover.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ Defaults:
203203

204204
## Model fallback
205205

206-
If all profiles for a provider fail, OpenClaw moves to the next model in `agents.defaults.model.fallbacks`. This applies to auth failures, rate limits, and timeouts that exhausted profile rotation (other errors do not advance fallback).
206+
If all profiles for a provider fail, OpenClaw moves to the next model in `agents.defaults.model.fallbacks`. This applies to auth failures, rate limits, and timeouts that exhausted profile rotation (other errors do not advance fallback). Provider errors that do not expose enough detail are still labeled precisely in fallback state: `empty_response` means the provider returned no usable message or status, `no_error_details` means the provider explicitly returned `Unknown error (no error details in response)`, and `unclassified` means OpenClaw preserved the raw preview but no classifier matched it yet.
207207

208208
Overloaded and rate-limit errors are handled more aggressively than billing cooldowns. By default, OpenClaw allows one same-provider auth-profile retry, then switches to the next configured model fallback without waiting. Provider-busy signals such as `ModelNotReadyException` land in that overloaded bucket. Tune this with `auth.cooldowns.overloadedProfileRotations`, `auth.cooldowns.overloadedBackoffMs`, and `auth.cooldowns.rateLimitedProfileRotations`.
209209

@@ -302,6 +302,8 @@ The persisted fallback override closes that window, and the narrow rollback keep
302302
- optional status/code
303303
- human-readable error summary
304304

305+
Structured `model_fallback_decision` logs also include flat `fallbackStep*` fields when a candidate fails, is skipped, or a later fallback succeeds. These fields make the attempted transition explicit (`fallbackStepFromModel`, `fallbackStepToModel`, `fallbackStepFromFailureReason`, `fallbackStepFromFailureDetail`, `fallbackStepFinalOutcome`) so log and diagnostic exporters can reconstruct the primary failure even when the terminal fallback also fails.
306+
305307
When every candidate fails, OpenClaw throws `FallbackSummaryError`. The outer reply runner can use that to build a more specific message such as "all models are temporarily rate-limited" and include the soonest cooldown expiry when one is known.
306308

307309
That cooldown summary is model-aware:

src/agents/auth-profiles/types.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ export type AuthProfileFailureReason =
6161
| "timeout"
6262
| "model_not_found"
6363
| "session_expired"
64+
| "empty_response"
65+
| "no_error_details"
66+
| "unclassified"
6467
| "unknown";
6568

6669
/** Per-profile usage statistics for round-robin and cooldown tracking */

src/agents/auth-profiles/usage.ts

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,9 @@ const FAILURE_REASON_PRIORITY: AuthProfileFailureReason[] = [
4545
"overloaded",
4646
"timeout",
4747
"rate_limit",
48+
"empty_response",
49+
"no_error_details",
50+
"unclassified",
4851
"unknown",
4952
];
5053
const FAILURE_REASON_SET = new Set<AuthProfileFailureReason>(FAILURE_REASON_PRIORITY);
@@ -89,7 +92,11 @@ function shouldProbeWhamForFailure(
8992
): boolean {
9093
return (
9194
normalizeProviderId(provider ?? "") === "openai-codex" &&
92-
(reason === "rate_limit" || reason === "unknown")
95+
(reason === "rate_limit" ||
96+
reason === "empty_response" ||
97+
reason === "no_error_details" ||
98+
reason === "unclassified" ||
99+
reason === "unknown")
93100
);
94101
}
95102

src/agents/failover-policy.test.ts

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,24 @@ const CASES: ReasonCase[] = [
3838
useTransientProbeSlot: true,
3939
preserveTransientProbeSlot: false,
4040
},
41+
{
42+
reason: "empty_response",
43+
allowCooldownProbe: true,
44+
useTransientProbeSlot: true,
45+
preserveTransientProbeSlot: false,
46+
},
47+
{
48+
reason: "no_error_details",
49+
allowCooldownProbe: true,
50+
useTransientProbeSlot: true,
51+
preserveTransientProbeSlot: false,
52+
},
53+
{
54+
reason: "unclassified",
55+
allowCooldownProbe: true,
56+
useTransientProbeSlot: true,
57+
preserveTransientProbeSlot: false,
58+
},
4159
{
4260
reason: "model_not_found",
4361
allowCooldownProbe: false,

src/agents/failover-policy.ts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@ export function shouldAllowCooldownProbeForReason(
88
reason === "overloaded" ||
99
reason === "billing" ||
1010
reason === "unknown" ||
11+
reason === "empty_response" ||
12+
reason === "no_error_details" ||
13+
reason === "unclassified" ||
1114
reason === "timeout"
1215
);
1316
}
@@ -19,6 +22,9 @@ export function shouldUseTransientCooldownProbeSlot(
1922
reason === "rate_limit" ||
2023
reason === "overloaded" ||
2124
reason === "unknown" ||
25+
reason === "empty_response" ||
26+
reason === "no_error_details" ||
27+
reason === "unclassified" ||
2228
reason === "timeout"
2329
);
2430
}

src/agents/model-fallback-observation.ts

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,68 @@ function buildErrorObservationFields(error?: string): {
2727
};
2828
}
2929

30+
type FallbackStepOutcome = "next_fallback" | "succeeded" | "chain_exhausted";
31+
32+
function formatModelRef(candidate: ModelCandidate): string {
33+
return `${candidate.provider}/${candidate.model}`;
34+
}
35+
36+
function buildFallbackStepFields(params: {
37+
decision: "skip_candidate" | "candidate_failed" | "candidate_succeeded";
38+
candidate: ModelCandidate;
39+
reason?: FailoverReason | null;
40+
error?: string;
41+
nextCandidate?: ModelCandidate;
42+
attempt?: number;
43+
previousAttempts?: FallbackAttempt[];
44+
}):
45+
| {
46+
fallbackStepType: "fallback_step";
47+
fallbackStepFromModel: string;
48+
fallbackStepToModel?: string;
49+
fallbackStepFromFailureReason?: FailoverReason;
50+
fallbackStepFromFailureDetail?: string;
51+
fallbackStepChainPosition?: number;
52+
fallbackStepFinalOutcome: FallbackStepOutcome;
53+
}
54+
| undefined {
55+
const lastPreviousAttempt = params.previousAttempts?.at(-1);
56+
if (params.decision === "candidate_succeeded") {
57+
if (!lastPreviousAttempt) {
58+
return undefined;
59+
}
60+
return {
61+
fallbackStepType: "fallback_step",
62+
fallbackStepFromModel: `${lastPreviousAttempt.provider}/${lastPreviousAttempt.model}`,
63+
fallbackStepToModel: formatModelRef(params.candidate),
64+
...(lastPreviousAttempt.reason
65+
? { fallbackStepFromFailureReason: lastPreviousAttempt.reason }
66+
: {}),
67+
...(lastPreviousAttempt.error
68+
? { fallbackStepFromFailureDetail: lastPreviousAttempt.error }
69+
: {}),
70+
...(typeof params.attempt === "number" ? { fallbackStepChainPosition: params.attempt } : {}),
71+
fallbackStepFinalOutcome: "succeeded",
72+
};
73+
}
74+
75+
const observed = buildErrorObservationFields(params.error);
76+
return {
77+
fallbackStepType: "fallback_step",
78+
fallbackStepFromModel: formatModelRef(params.candidate),
79+
...(params.nextCandidate ? { fallbackStepToModel: formatModelRef(params.nextCandidate) } : {}),
80+
...(params.reason ? { fallbackStepFromFailureReason: params.reason } : {}),
81+
...((observed.providerErrorMessagePreview ?? observed.errorPreview)
82+
? {
83+
fallbackStepFromFailureDetail:
84+
observed.providerErrorMessagePreview ?? observed.errorPreview,
85+
}
86+
: {}),
87+
...(typeof params.attempt === "number" ? { fallbackStepChainPosition: params.attempt } : {}),
88+
fallbackStepFinalOutcome: params.nextCandidate ? "next_fallback" : "chain_exhausted",
89+
};
90+
}
91+
3092
export function logModelFallbackDecision(params: {
3193
decision:
3294
| "skip_candidate"
@@ -57,6 +119,20 @@ export function logModelFallbackDecision(params: {
57119
const reasonText = params.reason ?? "unknown";
58120
const observedError = buildErrorObservationFields(params.error);
59121
const detailText = observedError.providerErrorMessagePreview ?? observedError.errorPreview;
122+
const fallbackStepFields =
123+
params.decision === "skip_candidate" ||
124+
params.decision === "candidate_failed" ||
125+
params.decision === "candidate_succeeded"
126+
? buildFallbackStepFields({
127+
decision: params.decision,
128+
candidate: params.candidate,
129+
reason: params.reason,
130+
error: params.error,
131+
nextCandidate: params.nextCandidate,
132+
attempt: params.attempt,
133+
previousAttempts: params.previousAttempts,
134+
})
135+
: undefined;
60136
const providerErrorTypeSuffix = observedError.providerErrorType
61137
? ` providerErrorType=${sanitizeForLog(observedError.providerErrorType)}`
62138
: "";
@@ -76,6 +152,7 @@ export function logModelFallbackDecision(params: {
76152
status: params.status,
77153
code: params.code,
78154
...observedError,
155+
...fallbackStepFields,
79156
nextCandidateProvider: params.nextCandidate?.provider,
80157
nextCandidateModel: params.nextCandidate?.model,
81158
isPrimary: params.isPrimary,

src/agents/model-fallback.probe.test.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,12 @@ describe("runWithModelFallback – probe logic", () => {
346346
requestedModelMatched: true,
347347
nextCandidateProvider: "anthropic",
348348
nextCandidateModel: "claude-haiku-3-5",
349+
fallbackStepType: "fallback_step",
350+
fallbackStepFromModel: "openai/gpt-4.1-mini",
351+
fallbackStepToModel: "anthropic/claude-haiku-3-5",
352+
fallbackStepFromFailureReason: "rate_limit",
353+
fallbackStepChainPosition: 1,
354+
fallbackStepFinalOutcome: "next_fallback",
349355
}),
350356
expect.objectContaining({
351357
event: "model_fallback_decision",
@@ -354,6 +360,12 @@ describe("runWithModelFallback – probe logic", () => {
354360
candidateModel: "claude-haiku-3-5",
355361
isPrimary: false,
356362
requestedModelMatched: false,
363+
fallbackStepType: "fallback_step",
364+
fallbackStepFromModel: "openai/gpt-4.1-mini",
365+
fallbackStepToModel: "anthropic/claude-haiku-3-5",
366+
fallbackStepFromFailureReason: "rate_limit",
367+
fallbackStepChainPosition: 2,
368+
fallbackStepFinalOutcome: "succeeded",
357369
}),
358370
]),
359371
);

src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -733,9 +733,9 @@ describe("classifyFailoverReason", () => {
733733
).toBeNull();
734734
});
735735

736-
it("classifies OpenAI Responses unknown-no-details message as unknown", () => {
736+
it("classifies OpenAI Responses unknown-no-details message distinctly", () => {
737737
const message = "Unknown error (no error details in response)";
738-
expect(classifyFailoverReason(message)).toBe("unknown");
738+
expect(classifyFailoverReason(message)).toBe("no_error_details");
739739
expect(isFailoverErrorMessage(message)).toBe(true);
740740
});
741741

@@ -1376,6 +1376,16 @@ describe("classifyProviderRuntimeFailureKind", () => {
13761376
).toBe("replay_invalid");
13771377
});
13781378

1379+
it("splits ambiguous provider runtime failures instead of collapsing to unknown", () => {
1380+
expect(classifyProviderRuntimeFailureKind({})).toBe("empty_response");
1381+
expect(classifyProviderRuntimeFailureKind("Unknown error (no error details in response)")).toBe(
1382+
"no_error_details",
1383+
);
1384+
expect(classifyProviderRuntimeFailureKind("provider sent a strange opaque failure")).toBe(
1385+
"unclassified",
1386+
);
1387+
});
1388+
13791389
it("does not classify generic config errors that mention proxy settings as proxy failures", () => {
13801390
expect(
13811391
classifyProviderRuntimeFailureKind(

src/agents/pi-embedded-helpers/errors.ts

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,9 @@ export type ProviderRuntimeFailureKind =
270270
| "schema"
271271
| "sandbox_blocked"
272272
| "replay_invalid"
273+
| "empty_response"
274+
| "no_error_details"
275+
| "unclassified"
273276
| "unknown";
274277

275278
const BILLING_402_HINTS = [
@@ -851,7 +854,7 @@ function classifyFailoverClassificationFromMessage(
851854
return toReasonClassification("format");
852855
}
853856
if (isExactUnknownNoDetailsError(raw)) {
854-
return toReasonClassification("unknown");
857+
return toReasonClassification("no_error_details");
855858
}
856859
if (isTimeoutErrorMessage(raw)) {
857860
return toReasonClassification("timeout");
@@ -900,7 +903,7 @@ export function classifyProviderRuntimeFailureKind(
900903
const status = inferSignalStatus(normalizedSignal);
901904

902905
if (!message && typeof status !== "number") {
903-
return "unknown";
906+
return "empty_response";
904907
}
905908
if (normalizedSignal.code === "refresh_contention") {
906909
return "refresh_contention";
@@ -958,7 +961,10 @@ export function classifyProviderRuntimeFailureKind(
958961
if (message && isTimeoutTransportErrorMessage(message, status)) {
959962
return "timeout";
960963
}
961-
return "unknown";
964+
if (message && isExactUnknownNoDetailsError(message)) {
965+
return "no_error_details";
966+
}
967+
return "unclassified";
962968
}
963969

964970
export function formatAssistantErrorText(

0 commit comments

Comments
 (0)