fix(runner): throw FailoverError on assistant surface_error so webchat renders provider failures#70848
Conversation
Greptile SummaryThis PR fixes a silent drop of provider failures (billing, auth, rate-limit) in the Confidence Score: 5/5Safe to merge; the fix is well-scoped, the external-abort carve-out is preserved, and all seven new tests pass. The only finding is P2: the new surface_error FailoverError omits rawError that the fallback_model branch includes, which is a minor diagnostic inconsistency with no functional impact on the bug fix itself. No files require special attention. Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/assistant-failover.ts
Line: 221-231
Comment:
**`rawError` omitted from `surface_error` FailoverError**
The `fallback_model` branch passes `rawError: params.lastAssistant?.errorMessage?.trim()` to `FailoverError`, but the new `surface_error` branch does not. If the client or logging pipeline uses `rawError` for debugging (e.g., to surface the original provider error string alongside the resolved message), that context is silently dropped on the `surface_error` path. Consider adding it for consistency.
```suggestion
return {
action: "throw",
overloadProfileRotations,
error: new FailoverError(message, {
reason,
provider: params.activeErrorContext.provider,
model: params.activeErrorContext.model,
profileId: params.lastProfileId,
status,
rawError: params.lastAssistant?.errorMessage?.trim(),
}),
};
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(runner): throw FailoverError on assi..." | Re-trigger Greptile |
| return { | ||
| action: "throw", | ||
| overloadProfileRotations, | ||
| error: new FailoverError(message, { | ||
| reason, | ||
| provider: params.activeErrorContext.provider, | ||
| model: params.activeErrorContext.model, | ||
| profileId: params.lastProfileId, | ||
| status, | ||
| }), | ||
| }; |
There was a problem hiding this comment.
rawError omitted from surface_error FailoverError
The fallback_model branch passes rawError: params.lastAssistant?.errorMessage?.trim() to FailoverError, but the new surface_error branch does not. If the client or logging pipeline uses rawError for debugging (e.g., to surface the original provider error string alongside the resolved message), that context is silently dropped on the surface_error path. Consider adding it for consistency.
| return { | |
| action: "throw", | |
| overloadProfileRotations, | |
| error: new FailoverError(message, { | |
| reason, | |
| provider: params.activeErrorContext.provider, | |
| model: params.activeErrorContext.model, | |
| profileId: params.lastProfileId, | |
| status, | |
| }), | |
| }; | |
| return { | |
| action: "throw", | |
| overloadProfileRotations, | |
| error: new FailoverError(message, { | |
| reason, | |
| provider: params.activeErrorContext.provider, | |
| model: params.activeErrorContext.model, | |
| profileId: params.lastProfileId, | |
| status, | |
| rawError: params.lastAssistant?.errorMessage?.trim(), | |
| }), | |
| }; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/assistant-failover.ts
Line: 221-231
Comment:
**`rawError` omitted from `surface_error` FailoverError**
The `fallback_model` branch passes `rawError: params.lastAssistant?.errorMessage?.trim()` to `FailoverError`, but the new `surface_error` branch does not. If the client or logging pipeline uses `rawError` for debugging (e.g., to surface the original provider error string alongside the resolved message), that context is silently dropped on the `surface_error` path. Consider adding it for consistency.
```suggestion
return {
action: "throw",
overloadProfileRotations,
error: new FailoverError(message, {
reason,
provider: params.activeErrorContext.provider,
model: params.activeErrorContext.model,
profileId: params.lastProfileId,
status,
rawError: params.lastAssistant?.errorMessage?.trim(),
}),
};
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 09ff3dfc7b. The throw path now mirrors fallback_model and sets rawError: params.lastAssistant?.errorMessage?.trim() on the FailoverError. New test populates rawError on the thrown FailoverError with the lastAssistant errorMessage locks it in.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d41aeb28d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // no other synthesis path caught them). Throw a FailoverError so | ||
| // the client surface can render it the same way it already | ||
| // renders fallback_model failures. | ||
| if (!params.externalAbort && !params.timedOut) { |
There was a problem hiding this comment.
Restrict surface-error throw to real assistant failures
This new throw branch runs for every non-timeout surface_error, but surface_error is not limited to stopReason === "error": in run.ts, assistantFailoverReason is derived from assistantForFailover?.errorMessage even on successful turns, so stale errorMessage values can still drive rotate_profile -> surface_error. In single-profile/no-fallback sessions, that now throws here and aborts normal payload construction, turning a successful assistant reply into a hard error. Please gate this throw on an actual assistant failure signal (e.g. failoverFailure / errored stop reason) so successful turns with stale error metadata still return via continue_normal.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. Confirmed by reading run.ts:1476-1481 and errors.ts:1105,1129,1222,1269,1253 — classifyFailoverReason runs as a pure string classifier without a stopReason === "error" gate, so a successful turn whose lastAssistant.errorMessage happens to match billing/auth/rate_limit/etc can drive shouldRotateAssistant through its failoverReason !== null branch. failoverFailure IS gated on stopReason === "error" by the isXAssistantError helpers, so it's the correct signal that a concrete provider failure occurred on this attempt.
Fixed in 09ff3dfc7b. The throw guard is now !params.externalAbort && !params.timedOut && params.failoverFailure. #70124 still lands because Anthropic billing errors have stopReason === "error", which sets billingFailure AND failoverFailure to true. New test leaves successful turns with a stale classified errorMessage on the continue_normal path locks in the gate.
…t renders provider failures Close openclaw#70124. When `handleAssistantFailover` resolved `surface_error`, it logged the decision and then fell through to `continue_normal`. The next iteration of `buildEmbeddedRunPayloads` would see the partial assistant output and drop the provider error silently. Gateway logs recorded `decision=surface_error reason=billing`, but the webchat never rendered anything. The reporter hit this on Anthropic `invalid_request_error` billing responses, and the same path applies to every `surface_error` failure that isn't an external abort. Match the `fallback_model` behavior: on any non-externalAbort `surface_error`, throw a `FailoverError` with the resolved reason, provider, model, profileId, and status. The existing catch in `run.ts` already wraps thrown `FailoverError` instances into the payload the webchat renders, so the one-site change reuses the whole client surface. Three small pieces: - Extract `resolveAssistantFailoverErrorMessage` and share it with the existing `fallback_model` branch. Removes the 22-line duplicate that would otherwise exist between the two throw paths. - Add `resolveSurfaceErrorReason` to coerce a null `decision.reason` onto the most specific failure the run observed. `surface_error` decisions can arrive with `reason: null` when `shouldRotateAssistant` fired on `failoverFailure` or `timedOut` without a classified upstream reason, and `FailoverError` requires a concrete reason. - Preserve the existing external-abort and same-model-idle-timeout retry paths. External aborts still fall through to `continue_normal` so partial assistant output carries the turn; the idle-timeout retry guard is unchanged. AI-assisted; verified via new `assistant-failover.test.ts` covering billing/auth/rate_limit throw paths, null-reason coercion, external abort preservation, idle-timeout retry, and a regression guard on the existing `fallback_model` branch after the message-builder refactor.
The original fix in this PR widened the throw condition to `!externalAbort`, which broke timeout-compaction retry coverage. `run.ts` already synthesizes a timeout-error payload when `buildEmbeddedRunPayloads` produces no assistant content (the `timedOut && !timedOutDuringCompaction && !payloadsWithToolMedia.length` block). Throwing a FailoverError from the surface_error branch for plain timeouts short-circuits that synthesis. Narrow the guard to `!externalAbort && !timedOut`. The throw path now covers only non-timeout, non-external-abort surface_error decisions, which is the true silent-drop case openclaw#70124 reported. Drop the dead `timedOut` branch from `resolveSurfaceErrorReason` now that the helper is only called on the non-timeout path. Test coverage: update the null-reason coercion test to the non-timeout shape, and add a dedicated test locking in that plain timeouts keep falling through to `continue_normal`. AI-assisted.
09ff3df to
bc22bde
Compare
Surface non-retryable assistant provider failures from the embedded runner instead of letting surface_error fall through to continue_normal. - Preserve external abort and plain timeout fall-through paths. - Preserve raw provider error diagnostics on surfaced FailoverError. - Add regression coverage for billing/auth/rate-limit/null-reason/error fall-through cases. - Update changelog. Fixes openclaw#70124. Thanks @truffle-dev.
Surface non-retryable assistant provider failures from the embedded runner instead of letting surface_error fall through to continue_normal. - Preserve external abort and plain timeout fall-through paths. - Preserve raw provider error diagnostics on surfaced FailoverError. - Add regression coverage for billing/auth/rate-limit/null-reason/error fall-through cases. - Update changelog. Fixes openclaw#70124. Thanks @truffle-dev.
Summary
handleAssistantFailoverresolvedsurface_error, logged the decision, then fell through tocontinue_normal.buildEmbeddedRunPayloadssaw the partial assistant output and dropped the provider error silently, so billing/auth/rate_limit failures never reached the webchat.invalid_request_error(billing). Gateway logs showdecision=surface_error reason=billing, but the UI renders nothing. The same path applies to everysurface_errorthat isn't an external abort.surface_error, throw aFailoverErrorwith the resolved reason/provider/model/profileId/status, matching the existingfallback_modelbehavior. The catch inrun.tsalready wraps thrownFailoverErrorinto the payload the webchat renders, so the one-site change reuses the whole client surface.continue_normal), same-model-idle-timeout retry path, and therotate_profile/fallback_modelbranches are untouched.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause (if applicable)
surface_errorbranch inhandleAssistantFailovercalledlogAssistantFailoverDecision("surface_error")and then fell through toreturn { action: "continue_normal", ... }. Control returned tobuildEmbeddedRunPayloads, which treated the partial assistant message as a completed turn and dropped the provider error.handleAssistantFailoverthrows aFailoverErroronsurface_error.failover-policy.test.tscovered decision resolution but not post-decision outcome side effects.fallback_modelbranch already throwsFailoverErrorfor the gateway catch to lift into the webchat payload. The two branches were asymmetric;surface_errorwas the only failure decision that reachedcontinue_normal.Regression Test Plan (if applicable)
src/agents/pi-embedded-runner/run/assistant-failover.test.ts(new).FailoverErrorwith the right status; null decision reasons coerce onto the most specific observed failure (timedOut -> timeout/408);externalAbortstill falls through tocontinue_normal; idle-timeout same-model retry preserved;fallback_modelregression guard after the message-builder refactor.handleAssistantFailoveris pure given its params. Unit-level tests directly assert the outcome action andFailoverErrorfields; no need for a heavier runner harness.failover-policy.test.tscovers decision resolution but not outcome side effects.User-visible / Behavior Changes
Billing, auth, rate_limit, and timeout failures from the assistant turn now surface to the webchat as concrete provider error messages (matching what
fallback_modelalready produced) instead of an empty completion. External aborts (user pressed stop) continue to fall through without synthesizing a provider error.Diagram (if applicable)
Security Impact (required)
NoNoNoNoNoYes, explain risk + mitigation: N/ARepro + Verification
Environment
claude-haiku-4-5-20251001(billing error path)billingwithfailoverFailure=true,fallbackConfigured=false.Steps
invalid_request_erroron the assistant turn (or stub the assistant message to setbillingFailure=truewithfailoverReason=billing).decision=surface_error reason=billing.Expected
FailoverErrorpayload path.Actual
run.tslifts theFailoverErrorinto the payload the client already knows how to render.Evidence
New
assistant-failover.test.tsfails against the pre-patch file (7/7 tests expectaction: "throw"with a concreteFailoverError; pre-patch returnsaction: "continue_normal"). Against the patched file, all 7 pass.Human Verification (required)
pnpm teston the newassistant-failover.test.ts(7/7 pass);pnpm check:changed(239 tests green);pnpm lint:core(0 warnings / 0 errors across 7127 files);pnpm tsgo:coreandpnpm tsgo:test:src(both clean); unit-fast run on the adjacentpayloads.errors,payloads,failover-error,failover-policytests (106 tests / 4 files, all passing).timedOut=truecoerces totimeout/408; null reason with no signal coerces tounknown; external abort still returnscontinue_normal;idleTimedOut + allowSameModelIdleTimeoutRetryreturns retry withretryKind: "same_model_idle_timeout";fallback_modelbranch still throws the sameFailoverErrorwith the shared message builder.Review Conversations
Compatibility / Migration
YesNoNoRisks and Mitigations
surface_errorpath (outside externalAbort) previously relied on the fall-through tocontinue_normaland would now throw instead.continue_normalafter asurface_errordecision was the external-abort case, which is still preserved explicitly. Every othersurface_errordecision represented a concrete provider failure thatcontinue_normalwas silently dropping.FailoverReason, which changes the reported reason on the outgoingFailoverError."unknown"is emitted instead of crashing onnull.AI-assisted: yes