Skip to content

fix(runner): throw FailoverError on assistant surface_error so webchat renders provider failures#70848

Merged
steipete merged 4 commits intoopenclaw:mainfrom
truffle-dev:fix/surface-error-billing-reaches-client
Apr 24, 2026
Merged

fix(runner): throw FailoverError on assistant surface_error so webchat renders provider failures#70848
steipete merged 4 commits intoopenclaw:mainfrom
truffle-dev:fix/surface-error-billing-reaches-client

Conversation

@truffle-dev
Copy link
Copy Markdown
Contributor

Summary

  • Problem: handleAssistantFailover resolved surface_error, logged the decision, then fell through to continue_normal. buildEmbeddedRunPayloads saw the partial assistant output and dropped the provider error silently, so billing/auth/rate_limit failures never reached the webchat.
  • Why it matters: The reporter on [Bug]: Webchat UI shows nothing when Anthropic provider returns billing error — gateway decides surface_error but client never renders it #70124 hit this with Anthropic invalid_request_error (billing). Gateway logs show decision=surface_error reason=billing, but the UI renders nothing. The same path applies to every surface_error that isn't an external abort.
  • What changed: On non-externalAbort surface_error, throw a FailoverError with the resolved reason/provider/model/profileId/status, matching the existing fallback_model behavior. The catch in run.ts already wraps thrown FailoverError into the payload the webchat renders, so the one-site change reuses the whole client surface.
  • What did NOT change (scope boundary): external-abort short-circuit (user-pressed-stop still carries partial output via continue_normal), same-model-idle-timeout retry path, and the rotate_profile / fallback_model branches are untouched.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: The surface_error branch in handleAssistantFailover called logAssistantFailoverDecision("surface_error") and then fell through to return { action: "continue_normal", ... }. Control returned to buildEmbeddedRunPayloads, which treated the partial assistant message as a completed turn and dropped the provider error.
  • Missing detection / guardrail: No unit test asserted handleAssistantFailover throws a FailoverError on surface_error. failover-policy.test.ts covered decision resolution but not post-decision outcome side effects.
  • Contributing context (if known): The fallback_model branch already throws FailoverError for the gateway catch to lift into the webchat payload. The two branches were asymmetric; surface_error was the only failure decision that reached continue_normal.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/agents/pi-embedded-runner/run/assistant-failover.test.ts (new).
  • Scenario the test should lock in: each failover reason (billing/auth/rate_limit) throws a FailoverError with the right status; null decision reasons coerce onto the most specific observed failure (timedOut -> timeout/408); externalAbort still falls through to continue_normal; idle-timeout same-model retry preserved; fallback_model regression guard after the message-builder refactor.
  • Why this is the smallest reliable guardrail: handleAssistantFailover is pure given its params. Unit-level tests directly assert the outcome action and FailoverError fields; no need for a heavier runner harness.
  • Existing test that already covers this (if any): failover-policy.test.ts covers decision resolution but not outcome side effects.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Billing, auth, rate_limit, and timeout failures from the assistant turn now surface to the webchat as concrete provider error messages (matching what fallback_model already produced) instead of an empty completion. External aborts (user pressed stop) continue to fall through without synthesizing a provider error.

Diagram (if applicable)

Before:
[surface_error billing] -> [log decision] -> [continue_normal]
                        -> [buildEmbeddedRunPayloads sees partial msg] -> [empty UI]

After:
[surface_error billing] -> [log decision] -> [throw FailoverError(reason=billing, status=402)]
                        -> [run.ts catch lifts into payload] -> [webchat renders provider error]

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux (container)
  • Runtime/container: Node 22 / pnpm 9
  • Model/provider: Anthropic claude-haiku-4-5-20251001 (billing error path)
  • Integration/channel (if any): pi-embedded-runner / webchat
  • Relevant config (redacted): failover path exercised via an assistant turn classified as billing with failoverFailure=true, fallbackConfigured=false.

Steps

  1. Start a session against a provider that returns a billing-style invalid_request_error on the assistant turn (or stub the assistant message to set billingFailure=true with failoverReason=billing).
  2. Watch the gateway log: it records decision=surface_error reason=billing.
  3. Observe the webchat.

Expected

  • Webchat renders the provider's billing error via the shared FailoverError payload path.

Actual

  • Before this PR: webchat renders nothing; gateway log shows the decision but the error never propagates.
  • After this PR: webchat renders the formatted billing error message; the catch in run.ts lifts the FailoverError into the payload the client already knows how to render.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

New assistant-failover.test.ts fails against the pre-patch file (7/7 tests expect action: "throw" with a concrete FailoverError; pre-patch returns action: "continue_normal"). Against the patched file, all 7 pass.

Human Verification (required)

  • Verified scenarios: focused pnpm test on the new assistant-failover.test.ts (7/7 pass); pnpm check:changed (239 tests green); pnpm lint:core (0 warnings / 0 errors across 7127 files); pnpm tsgo:core and pnpm tsgo:test:src (both clean); unit-fast run on the adjacent payloads.errors, payloads, failover-error, failover-policy tests (106 tests / 4 files, all passing).
  • Edge cases checked: null decision reason with timedOut=true coerces to timeout/408; null reason with no signal coerces to unknown; external abort still returns continue_normal; idleTimedOut + allowSameModelIdleTimeoutRetry returns retry with retryKind: "same_model_idle_timeout"; fallback_model branch still throws the same FailoverError with the shared message builder.
  • What you did not verify: live-provider integration probe against a real Anthropic billing error.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: A legitimate surface_error path (outside externalAbort) previously relied on the fall-through to continue_normal and would now throw instead.
    • Mitigation: The only documented caller of continue_normal after a surface_error decision was the external-abort case, which is still preserved explicitly. Every other surface_error decision represented a concrete provider failure that continue_normal was silently dropping.
  • Risk: Null decision reasons now coerce to a concrete FailoverReason, which changes the reported reason on the outgoing FailoverError.
    • Mitigation: The coercion order (timeout -> billing -> auth -> rate_limit -> unknown) matches the same precedence used elsewhere in the failover path; if nothing matches, "unknown" is emitted instead of crashing on null.

AI-assisted: yes

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M labels Apr 24, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR fixes a silent drop of provider failures (billing, auth, rate-limit) in the surface_error branch of handleAssistantFailover. Previously the branch fell through to continue_normal, causing buildEmbeddedRunPayloads to see a partial assistant message and discard the error; now it throws a FailoverError, reusing the same catch path that fallback_model already relies on. A new test file covers the key scenarios (billing/auth/rate_limit throws, null-reason coercion, external-abort pass-through, idle-timeout retry, and a fallback_model regression guard).

Confidence Score: 5/5

Safe to merge; the fix is well-scoped, the external-abort carve-out is preserved, and all seven new tests pass.

The only finding is P2: the new surface_error FailoverError omits rawError that the fallback_model branch includes, which is a minor diagnostic inconsistency with no functional impact on the bug fix itself.

No files require special attention.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/assistant-failover.ts
Line: 221-231

Comment:
**`rawError` omitted from `surface_error` FailoverError**

The `fallback_model` branch passes `rawError: params.lastAssistant?.errorMessage?.trim()` to `FailoverError`, but the new `surface_error` branch does not. If the client or logging pipeline uses `rawError` for debugging (e.g., to surface the original provider error string alongside the resolved message), that context is silently dropped on the `surface_error` path. Consider adding it for consistency.

```suggestion
      return {
        action: "throw",
        overloadProfileRotations,
        error: new FailoverError(message, {
          reason,
          provider: params.activeErrorContext.provider,
          model: params.activeErrorContext.model,
          profileId: params.lastProfileId,
          status,
          rawError: params.lastAssistant?.errorMessage?.trim(),
        }),
      };
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(runner): throw FailoverError on assi..." | Re-trigger Greptile

Comment on lines +221 to +231
return {
action: "throw",
overloadProfileRotations,
error: new FailoverError(message, {
reason,
provider: params.activeErrorContext.provider,
model: params.activeErrorContext.model,
profileId: params.lastProfileId,
status,
}),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 rawError omitted from surface_error FailoverError

The fallback_model branch passes rawError: params.lastAssistant?.errorMessage?.trim() to FailoverError, but the new surface_error branch does not. If the client or logging pipeline uses rawError for debugging (e.g., to surface the original provider error string alongside the resolved message), that context is silently dropped on the surface_error path. Consider adding it for consistency.

Suggested change
return {
action: "throw",
overloadProfileRotations,
error: new FailoverError(message, {
reason,
provider: params.activeErrorContext.provider,
model: params.activeErrorContext.model,
profileId: params.lastProfileId,
status,
}),
};
return {
action: "throw",
overloadProfileRotations,
error: new FailoverError(message, {
reason,
provider: params.activeErrorContext.provider,
model: params.activeErrorContext.model,
profileId: params.lastProfileId,
status,
rawError: params.lastAssistant?.errorMessage?.trim(),
}),
};
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/assistant-failover.ts
Line: 221-231

Comment:
**`rawError` omitted from `surface_error` FailoverError**

The `fallback_model` branch passes `rawError: params.lastAssistant?.errorMessage?.trim()` to `FailoverError`, but the new `surface_error` branch does not. If the client or logging pipeline uses `rawError` for debugging (e.g., to surface the original provider error string alongside the resolved message), that context is silently dropped on the `surface_error` path. Consider adding it for consistency.

```suggestion
      return {
        action: "throw",
        overloadProfileRotations,
        error: new FailoverError(message, {
          reason,
          provider: params.activeErrorContext.provider,
          model: params.activeErrorContext.model,
          profileId: params.lastProfileId,
          status,
          rawError: params.lastAssistant?.errorMessage?.trim(),
        }),
      };
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 09ff3dfc7b. The throw path now mirrors fallback_model and sets rawError: params.lastAssistant?.errorMessage?.trim() on the FailoverError. New test populates rawError on the thrown FailoverError with the lastAssistant errorMessage locks it in.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d41aeb28d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// no other synthesis path caught them). Throw a FailoverError so
// the client surface can render it the same way it already
// renders fallback_model failures.
if (!params.externalAbort && !params.timedOut) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict surface-error throw to real assistant failures

This new throw branch runs for every non-timeout surface_error, but surface_error is not limited to stopReason === "error": in run.ts, assistantFailoverReason is derived from assistantForFailover?.errorMessage even on successful turns, so stale errorMessage values can still drive rotate_profile -> surface_error. In single-profile/no-fallback sessions, that now throws here and aborts normal payload construction, turning a successful assistant reply into a hard error. Please gate this throw on an actual assistant failure signal (e.g. failoverFailure / errored stop reason) so successful turns with stale error metadata still return via continue_normal.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Confirmed by reading run.ts:1476-1481 and errors.ts:1105,1129,1222,1269,1253classifyFailoverReason runs as a pure string classifier without a stopReason === "error" gate, so a successful turn whose lastAssistant.errorMessage happens to match billing/auth/rate_limit/etc can drive shouldRotateAssistant through its failoverReason !== null branch. failoverFailure IS gated on stopReason === "error" by the isXAssistantError helpers, so it's the correct signal that a concrete provider failure occurred on this attempt.

Fixed in 09ff3dfc7b. The throw guard is now !params.externalAbort && !params.timedOut && params.failoverFailure. #70124 still lands because Anthropic billing errors have stopReason === "error", which sets billingFailure AND failoverFailure to true. New test leaves successful turns with a stale classified errorMessage on the continue_normal path locks in the gate.

truffle-dev and others added 4 commits April 24, 2026 03:19
…t renders provider failures

Close openclaw#70124.

When `handleAssistantFailover` resolved `surface_error`, it logged the
decision and then fell through to `continue_normal`. The next iteration
of `buildEmbeddedRunPayloads` would see the partial assistant output and
drop the provider error silently. Gateway logs recorded
`decision=surface_error reason=billing`, but the webchat never rendered
anything. The reporter hit this on Anthropic `invalid_request_error`
billing responses, and the same path applies to every `surface_error`
failure that isn't an external abort.

Match the `fallback_model` behavior: on any non-externalAbort
`surface_error`, throw a `FailoverError` with the resolved reason,
provider, model, profileId, and status. The existing catch in
`run.ts` already wraps thrown `FailoverError` instances into the
payload the webchat renders, so the one-site change reuses the whole
client surface.

Three small pieces:

- Extract `resolveAssistantFailoverErrorMessage` and share it with the
  existing `fallback_model` branch. Removes the 22-line duplicate that
  would otherwise exist between the two throw paths.
- Add `resolveSurfaceErrorReason` to coerce a null `decision.reason`
  onto the most specific failure the run observed. `surface_error`
  decisions can arrive with `reason: null` when `shouldRotateAssistant`
  fired on `failoverFailure` or `timedOut` without a classified
  upstream reason, and `FailoverError` requires a concrete reason.
- Preserve the existing external-abort and same-model-idle-timeout
  retry paths. External aborts still fall through to `continue_normal`
  so partial assistant output carries the turn; the idle-timeout retry
  guard is unchanged.

AI-assisted; verified via new `assistant-failover.test.ts` covering
billing/auth/rate_limit throw paths, null-reason coercion, external
abort preservation, idle-timeout retry, and a regression guard on the
existing `fallback_model` branch after the message-builder
refactor.
The original fix in this PR widened the throw condition to
`!externalAbort`, which broke timeout-compaction retry coverage.
`run.ts` already synthesizes a timeout-error payload when
`buildEmbeddedRunPayloads` produces no assistant content (the
`timedOut && !timedOutDuringCompaction && !payloadsWithToolMedia.length`
block). Throwing a FailoverError from the surface_error branch for
plain timeouts short-circuits that synthesis.

Narrow the guard to `!externalAbort && !timedOut`. The throw path
now covers only non-timeout, non-external-abort surface_error
decisions, which is the true silent-drop case openclaw#70124
reported. Drop the dead `timedOut` branch from
`resolveSurfaceErrorReason` now that the helper is only called on
the non-timeout path.

Test coverage: update the null-reason coercion test to the
non-timeout shape, and add a dedicated test locking in that plain
timeouts keep falling through to `continue_normal`.

AI-assisted.
@steipete steipete force-pushed the fix/surface-error-billing-reaches-client branch from 09ff3df to bc22bde Compare April 24, 2026 02:22
@steipete steipete merged commit a958b6e into openclaw:main Apr 24, 2026
69 checks passed
@truffle-dev truffle-dev deleted the fix/surface-error-billing-reaches-client branch April 30, 2026 22:24
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
Surface non-retryable assistant provider failures from the embedded runner instead of letting surface_error fall through to continue_normal.

- Preserve external abort and plain timeout fall-through paths.
- Preserve raw provider error diagnostics on surfaced FailoverError.
- Add regression coverage for billing/auth/rate-limit/null-reason/error fall-through cases.
- Update changelog.

Fixes openclaw#70124.
Thanks @truffle-dev.
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Surface non-retryable assistant provider failures from the embedded runner instead of letting surface_error fall through to continue_normal.

- Preserve external abort and plain timeout fall-through paths.
- Preserve raw provider error diagnostics on surfaced FailoverError.
- Add regression coverage for billing/auth/rate-limit/null-reason/error fall-through cases.
- Update changelog.

Fixes openclaw#70124.
Thanks @truffle-dev.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Webchat UI shows nothing when Anthropic provider returns billing error — gateway decides surface_error but client never renders it

2 participants