[Bug]: Overloaded error does not trigger model fallback — session hangs indefinitely with no retry or UI escape

## Summary

When the primary AI model returns an **overloaded / 503 error** (e.g. Gemini returning *"The AI service is temporarily overloaded. Please try again in a moment."*), the session **does not fall back to the configured fallback models**, does not auto-retry, and leaves the session in a permanently hung state with no way to recover from the UI.

## Version

\`openclaw 2026.2.22\`

## Configuration

\`\`\`json
"model": {
  "primary": "google/gemini-3.1-pro-preview",
  "fallbacks": [
    "google/gemini-3-flash-preview",
    "google/gemini-3-pro-preview"
  ]
}
\`\`\`

## Steps to Reproduce

1. Configure an agent with a primary model + fallbacks (e.g. Gemini 3.1 Pro Preview with flash/pro fallbacks)
2. Send a message when the primary model provider is under load
3. Primary model returns a 503 / overloaded error
4. Observe that fallbacks are **never tried** — the session just stops

## Expected Behavior

1. `runWithModelFallback` should catch the overloaded error and cascade to the next fallback candidate
2. The session should auto-retry with the fallback model transparently
3. If all fallbacks fail, show a clear error with a **retry button** and unblock the session

## Actual Behavior

- The error message **"The AI service is temporarily overloaded. Please try again in a moment."** is displayed in the chat UI as a banner
- The session **hangs indefinitely** — the send button becomes unresponsive (shows pause icon)
- The `.jsonl.lock` file remains held by the gateway process (PID confirmed alive)
- **No fallback model is attempted**
- The only recovery options are: restart the gateway, or wait and hope the error clears

## Root Cause Analysis (from source inspection)

**The fallback is bypassed because `isFailoverAssistantError` is never triggered for overloaded errors in this path.**

In `pi-embedded-helpers-B7rXYPuB.js`:
\`\`\`js
function classifyFailoverReason(raw) {
  if (isOverloadedErrorMessage(raw)) return "rate_limit";  // ✅ classified correctly
  ...
}
function isFailoverErrorMessage(raw) {
  return classifyFailoverReason(raw) !== null;  // ✅ returns true for overloaded
}
function isFailoverAssistantError(msg) {
  if (!msg || msg.stopReason !== "error") return false;  // ⚠️ KEY CHECK
  return isFailoverErrorMessage(msg.errorMessage ?? "");
}
\`\`\`

In `subagent-registry-C8AjcLWJ.js` (~line 70380):
\`\`\`js
// This path handles HTTP-level errors (thrown exceptions):
if (isFailoverErrorMessage(errorText) && promptFailoverReason !== "timeout" && await advanceAuthProfile()) continue;
if (fallbackConfigured && isFailoverErrorMessage(errorText)) throw new FailoverError(...)
\`\`\`

The overload error from Gemini arrives as an **assistant message** (`stopReason: "error"`, `errorMessage: "...overloaded..."`) — NOT as a thrown exception. The assistant-message path at ~line 70409 checks `isFailoverAssistantError`, which correctly identifies it.

However, the outer `runWithModelFallback` loop then checks `isFailoverError(normalized)` (line 24178). The issue is that `coerceToFailoverError` does not recognize the overloaded assistant error pattern, so `isFailoverError(normalized)` returns false and **the error is rethrown immediately** without trying any fallbacks.

Specifically in `runWithModelFallback`:
\`\`\`js
const normalized = coerceToFailoverError(err, { provider, model });
if (!isFailoverError(normalized)) throw err;  // ← exits without trying fallbacks
\`\`\`

The `OVERLOADED_ERROR_USER_MESSAGE` string ("The AI service is temporarily overloaded...") is a **user-facing formatted string**, not the raw API error. `coerceToFailoverError` likely does not match it because it looks for raw API error patterns (HTTP status codes, provider-specific error types), not the already-formatted human-readable message.

## Secondary Issue: Session Hangs With No Recovery Path

Even when the error is surfaced correctly, the session ends up in a broken state:

1. The `.jsonl.lock` file remains held (gateway process alive but idle)
2. The UI shows the error banner but the **send button is unresponsive**
3. No retry button is shown
4. The only escape is gateway restart

**Expected:** A retry button in the error banner that re-queues the last message. At minimum, the session lock should be released so the user can resend.

## Proposed Fixes

### Fix 1: Ensure `coerceToFailoverError` catches overloaded errors from assistant messages

The assistant-error path should propagate to the fallback loop as a proper `FailoverError`:

\`\`\`js
// In the assistant-message error handling path, before exiting the inner loop:
if (failoverFailure && fallbackConfigured) {
  throw new FailoverError(lastAssistant.errorMessage, {
    reason: assistantFailoverReason ?? "unknown",
    provider,
    model: modelId,
  });
}
\`\`\`

### Fix 2: Auto-retry overloaded errors with exponential backoff before falling back

Overloaded errors are transient — a brief retry on the same model before consuming a fallback slot would save fallback capacity:

\`\`\`js
// Before falling through to next candidate, retry overloaded primary 1-2x
if (reason === "rate_limit" && isOverloadedErrorMessage(errorText)) {
  await sleep(retryDelayMs); // 2–5s
  continue; // retry same candidate
}
\`\`\`

### Fix 3: Release session lock on terminal error

When all candidates fail (or error is non-retryable), release the `.jsonl.lock` so the user can resend from the UI without restarting the gateway.

### Fix 4: Add retry affordance in UI

When the overloaded banner is shown, include a **"Retry"** button that re-queues the last user message rather than requiring a full gateway restart.

## Impact

- Users with fallbacks configured get no benefit from them on the most common transient failure mode
- Session becomes permanently stuck requiring manual intervention (gateway restart)
- Particularly impactful for long-running agent sessions where context would be lost on restart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Overloaded error does not trigger model fallback — session hangs indefinitely with no retry or UI escape #24378

Summary

Version

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis (from source inspection)

Secondary Issue: Session Hangs With No Recovery Path

Proposed Fixes

Fix 1: Ensure `coerceToFailoverError` catches overloaded errors from assistant messages

Fix 2: Auto-retry overloaded errors with exponential backoff before falling back

Fix 3: Release session lock on terminal error

Fix 4: Add retry affordance in UI

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Overloaded error does not trigger model fallback — session hangs indefinitely with no retry or UI escape #24378

Description

Summary

Version

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis (from source inspection)

Secondary Issue: Session Hangs With No Recovery Path

Proposed Fixes

Fix 1: Ensure coerceToFailoverError catches overloaded errors from assistant messages

Fix 2: Auto-retry overloaded errors with exponential backoff before falling back

Fix 3: Release session lock on terminal error

Fix 4: Add retry affordance in UI

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Fix 1: Ensure `coerceToFailoverError` catches overloaded errors from assistant messages