Skip to content

[Bug]: Overloaded error does not trigger model fallback — session hangs indefinitely with no retry or UI escape #24378

@bellchenx

Description

@bellchenx

Summary

When the primary AI model returns an overloaded / 503 error (e.g. Gemini returning "The AI service is temporarily overloaded. Please try again in a moment."), the session does not fall back to the configured fallback models, does not auto-retry, and leaves the session in a permanently hung state with no way to recover from the UI.

Version

`openclaw 2026.2.22`

Configuration

```json
"model": {
"primary": "google/gemini-3.1-pro-preview",
"fallbacks": [
"google/gemini-3-flash-preview",
"google/gemini-3-pro-preview"
]
}
```

Steps to Reproduce

  1. Configure an agent with a primary model + fallbacks (e.g. Gemini 3.1 Pro Preview with flash/pro fallbacks)
  2. Send a message when the primary model provider is under load
  3. Primary model returns a 503 / overloaded error
  4. Observe that fallbacks are never tried — the session just stops

Expected Behavior

  1. runWithModelFallback should catch the overloaded error and cascade to the next fallback candidate
  2. The session should auto-retry with the fallback model transparently
  3. If all fallbacks fail, show a clear error with a retry button and unblock the session

Actual Behavior

  • The error message "The AI service is temporarily overloaded. Please try again in a moment." is displayed in the chat UI as a banner
  • The session hangs indefinitely — the send button becomes unresponsive (shows pause icon)
  • The .jsonl.lock file remains held by the gateway process (PID confirmed alive)
  • No fallback model is attempted
  • The only recovery options are: restart the gateway, or wait and hope the error clears

Root Cause Analysis (from source inspection)

The fallback is bypassed because isFailoverAssistantError is never triggered for overloaded errors in this path.

In pi-embedded-helpers-B7rXYPuB.js:
```js
function classifyFailoverReason(raw) {
if (isOverloadedErrorMessage(raw)) return "rate_limit"; // ✅ classified correctly
...
}
function isFailoverErrorMessage(raw) {
return classifyFailoverReason(raw) !== null; // ✅ returns true for overloaded
}
function isFailoverAssistantError(msg) {
if (!msg || msg.stopReason !== "error") return false; // ⚠️ KEY CHECK
return isFailoverErrorMessage(msg.errorMessage ?? "");
}
```

In subagent-registry-C8AjcLWJ.js (~line 70380):
```js
// This path handles HTTP-level errors (thrown exceptions):
if (isFailoverErrorMessage(errorText) && promptFailoverReason !== "timeout" && await advanceAuthProfile()) continue;
if (fallbackConfigured && isFailoverErrorMessage(errorText)) throw new FailoverError(...)
```

The overload error from Gemini arrives as an assistant message (stopReason: "error", errorMessage: "...overloaded...") — NOT as a thrown exception. The assistant-message path at ~line 70409 checks isFailoverAssistantError, which correctly identifies it.

However, the outer runWithModelFallback loop then checks isFailoverError(normalized) (line 24178). The issue is that coerceToFailoverError does not recognize the overloaded assistant error pattern, so isFailoverError(normalized) returns false and the error is rethrown immediately without trying any fallbacks.

Specifically in runWithModelFallback:
```js
const normalized = coerceToFailoverError(err, { provider, model });
if (!isFailoverError(normalized)) throw err; // ← exits without trying fallbacks
```

The OVERLOADED_ERROR_USER_MESSAGE string ("The AI service is temporarily overloaded...") is a user-facing formatted string, not the raw API error. coerceToFailoverError likely does not match it because it looks for raw API error patterns (HTTP status codes, provider-specific error types), not the already-formatted human-readable message.

Secondary Issue: Session Hangs With No Recovery Path

Even when the error is surfaced correctly, the session ends up in a broken state:

  1. The .jsonl.lock file remains held (gateway process alive but idle)
  2. The UI shows the error banner but the send button is unresponsive
  3. No retry button is shown
  4. The only escape is gateway restart

Expected: A retry button in the error banner that re-queues the last message. At minimum, the session lock should be released so the user can resend.

Proposed Fixes

Fix 1: Ensure coerceToFailoverError catches overloaded errors from assistant messages

The assistant-error path should propagate to the fallback loop as a proper FailoverError:

```js
// In the assistant-message error handling path, before exiting the inner loop:
if (failoverFailure && fallbackConfigured) {
throw new FailoverError(lastAssistant.errorMessage, {
reason: assistantFailoverReason ?? "unknown",
provider,
model: modelId,
});
}
```

Fix 2: Auto-retry overloaded errors with exponential backoff before falling back

Overloaded errors are transient — a brief retry on the same model before consuming a fallback slot would save fallback capacity:

```js
// Before falling through to next candidate, retry overloaded primary 1-2x
if (reason === "rate_limit" && isOverloadedErrorMessage(errorText)) {
await sleep(retryDelayMs); // 2–5s
continue; // retry same candidate
}
```

Fix 3: Release session lock on terminal error

When all candidates fail (or error is non-retryable), release the .jsonl.lock so the user can resend from the UI without restarting the gateway.

Fix 4: Add retry affordance in UI

When the overloaded banner is shown, include a "Retry" button that re-queues the last user message rather than requiring a full gateway restart.

Impact

  • Users with fallbacks configured get no benefit from them on the most common transient failure mode
  • Session becomes permanently stuck requiring manual intervention (gateway restart)
  • Particularly impactful for long-running agent sessions where context would be lost on restart

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions