Summary
When the primary AI model returns an overloaded / 503 error (e.g. Gemini returning "The AI service is temporarily overloaded. Please try again in a moment."), the session does not fall back to the configured fallback models, does not auto-retry, and leaves the session in a permanently hung state with no way to recover from the UI.
Version
`openclaw 2026.2.22`
Configuration
```json
"model": {
"primary": "google/gemini-3.1-pro-preview",
"fallbacks": [
"google/gemini-3-flash-preview",
"google/gemini-3-pro-preview"
]
}
```
Steps to Reproduce
- Configure an agent with a primary model + fallbacks (e.g. Gemini 3.1 Pro Preview with flash/pro fallbacks)
- Send a message when the primary model provider is under load
- Primary model returns a 503 / overloaded error
- Observe that fallbacks are never tried — the session just stops
Expected Behavior
runWithModelFallback should catch the overloaded error and cascade to the next fallback candidate
- The session should auto-retry with the fallback model transparently
- If all fallbacks fail, show a clear error with a retry button and unblock the session
Actual Behavior
- The error message "The AI service is temporarily overloaded. Please try again in a moment." is displayed in the chat UI as a banner
- The session hangs indefinitely — the send button becomes unresponsive (shows pause icon)
- The
.jsonl.lock file remains held by the gateway process (PID confirmed alive)
- No fallback model is attempted
- The only recovery options are: restart the gateway, or wait and hope the error clears
Root Cause Analysis (from source inspection)
The fallback is bypassed because isFailoverAssistantError is never triggered for overloaded errors in this path.
In pi-embedded-helpers-B7rXYPuB.js:
```js
function classifyFailoverReason(raw) {
if (isOverloadedErrorMessage(raw)) return "rate_limit"; // ✅ classified correctly
...
}
function isFailoverErrorMessage(raw) {
return classifyFailoverReason(raw) !== null; // ✅ returns true for overloaded
}
function isFailoverAssistantError(msg) {
if (!msg || msg.stopReason !== "error") return false; // ⚠️ KEY CHECK
return isFailoverErrorMessage(msg.errorMessage ?? "");
}
```
In subagent-registry-C8AjcLWJ.js (~line 70380):
```js
// This path handles HTTP-level errors (thrown exceptions):
if (isFailoverErrorMessage(errorText) && promptFailoverReason !== "timeout" && await advanceAuthProfile()) continue;
if (fallbackConfigured && isFailoverErrorMessage(errorText)) throw new FailoverError(...)
```
The overload error from Gemini arrives as an assistant message (stopReason: "error", errorMessage: "...overloaded...") — NOT as a thrown exception. The assistant-message path at ~line 70409 checks isFailoverAssistantError, which correctly identifies it.
However, the outer runWithModelFallback loop then checks isFailoverError(normalized) (line 24178). The issue is that coerceToFailoverError does not recognize the overloaded assistant error pattern, so isFailoverError(normalized) returns false and the error is rethrown immediately without trying any fallbacks.
Specifically in runWithModelFallback:
```js
const normalized = coerceToFailoverError(err, { provider, model });
if (!isFailoverError(normalized)) throw err; // ← exits without trying fallbacks
```
The OVERLOADED_ERROR_USER_MESSAGE string ("The AI service is temporarily overloaded...") is a user-facing formatted string, not the raw API error. coerceToFailoverError likely does not match it because it looks for raw API error patterns (HTTP status codes, provider-specific error types), not the already-formatted human-readable message.
Secondary Issue: Session Hangs With No Recovery Path
Even when the error is surfaced correctly, the session ends up in a broken state:
- The
.jsonl.lock file remains held (gateway process alive but idle)
- The UI shows the error banner but the send button is unresponsive
- No retry button is shown
- The only escape is gateway restart
Expected: A retry button in the error banner that re-queues the last message. At minimum, the session lock should be released so the user can resend.
Proposed Fixes
Fix 1: Ensure coerceToFailoverError catches overloaded errors from assistant messages
The assistant-error path should propagate to the fallback loop as a proper FailoverError:
```js
// In the assistant-message error handling path, before exiting the inner loop:
if (failoverFailure && fallbackConfigured) {
throw new FailoverError(lastAssistant.errorMessage, {
reason: assistantFailoverReason ?? "unknown",
provider,
model: modelId,
});
}
```
Fix 2: Auto-retry overloaded errors with exponential backoff before falling back
Overloaded errors are transient — a brief retry on the same model before consuming a fallback slot would save fallback capacity:
```js
// Before falling through to next candidate, retry overloaded primary 1-2x
if (reason === "rate_limit" && isOverloadedErrorMessage(errorText)) {
await sleep(retryDelayMs); // 2–5s
continue; // retry same candidate
}
```
Fix 3: Release session lock on terminal error
When all candidates fail (or error is non-retryable), release the .jsonl.lock so the user can resend from the UI without restarting the gateway.
Fix 4: Add retry affordance in UI
When the overloaded banner is shown, include a "Retry" button that re-queues the last user message rather than requiring a full gateway restart.
Impact
- Users with fallbacks configured get no benefit from them on the most common transient failure mode
- Session becomes permanently stuck requiring manual intervention (gateway restart)
- Particularly impactful for long-running agent sessions where context would be lost on restart
Summary
When the primary AI model returns an overloaded / 503 error (e.g. Gemini returning "The AI service is temporarily overloaded. Please try again in a moment."), the session does not fall back to the configured fallback models, does not auto-retry, and leaves the session in a permanently hung state with no way to recover from the UI.
Version
`openclaw 2026.2.22`
Configuration
```json
"model": {
"primary": "google/gemini-3.1-pro-preview",
"fallbacks": [
"google/gemini-3-flash-preview",
"google/gemini-3-pro-preview"
]
}
```
Steps to Reproduce
Expected Behavior
runWithModelFallbackshould catch the overloaded error and cascade to the next fallback candidateActual Behavior
.jsonl.lockfile remains held by the gateway process (PID confirmed alive)Root Cause Analysis (from source inspection)
The fallback is bypassed because
isFailoverAssistantErroris never triggered for overloaded errors in this path.In⚠️ KEY CHECK
pi-embedded-helpers-B7rXYPuB.js:```js
function classifyFailoverReason(raw) {
if (isOverloadedErrorMessage(raw)) return "rate_limit"; // ✅ classified correctly
...
}
function isFailoverErrorMessage(raw) {
return classifyFailoverReason(raw) !== null; // ✅ returns true for overloaded
}
function isFailoverAssistantError(msg) {
if (!msg || msg.stopReason !== "error") return false; //
return isFailoverErrorMessage(msg.errorMessage ?? "");
}
```
In
subagent-registry-C8AjcLWJ.js(~line 70380):```js
// This path handles HTTP-level errors (thrown exceptions):
if (isFailoverErrorMessage(errorText) && promptFailoverReason !== "timeout" && await advanceAuthProfile()) continue;
if (fallbackConfigured && isFailoverErrorMessage(errorText)) throw new FailoverError(...)
```
The overload error from Gemini arrives as an assistant message (
stopReason: "error",errorMessage: "...overloaded...") — NOT as a thrown exception. The assistant-message path at ~line 70409 checksisFailoverAssistantError, which correctly identifies it.However, the outer
runWithModelFallbackloop then checksisFailoverError(normalized)(line 24178). The issue is thatcoerceToFailoverErrordoes not recognize the overloaded assistant error pattern, soisFailoverError(normalized)returns false and the error is rethrown immediately without trying any fallbacks.Specifically in
runWithModelFallback:```js
const normalized = coerceToFailoverError(err, { provider, model });
if (!isFailoverError(normalized)) throw err; // ← exits without trying fallbacks
```
The
OVERLOADED_ERROR_USER_MESSAGEstring ("The AI service is temporarily overloaded...") is a user-facing formatted string, not the raw API error.coerceToFailoverErrorlikely does not match it because it looks for raw API error patterns (HTTP status codes, provider-specific error types), not the already-formatted human-readable message.Secondary Issue: Session Hangs With No Recovery Path
Even when the error is surfaced correctly, the session ends up in a broken state:
.jsonl.lockfile remains held (gateway process alive but idle)Expected: A retry button in the error banner that re-queues the last message. At minimum, the session lock should be released so the user can resend.
Proposed Fixes
Fix 1: Ensure
coerceToFailoverErrorcatches overloaded errors from assistant messagesThe assistant-error path should propagate to the fallback loop as a proper
FailoverError:```js
// In the assistant-message error handling path, before exiting the inner loop:
if (failoverFailure && fallbackConfigured) {
throw new FailoverError(lastAssistant.errorMessage, {
reason: assistantFailoverReason ?? "unknown",
provider,
model: modelId,
});
}
```
Fix 2: Auto-retry overloaded errors with exponential backoff before falling back
Overloaded errors are transient — a brief retry on the same model before consuming a fallback slot would save fallback capacity:
```js
// Before falling through to next candidate, retry overloaded primary 1-2x
if (reason === "rate_limit" && isOverloadedErrorMessage(errorText)) {
await sleep(retryDelayMs); // 2–5s
continue; // retry same candidate
}
```
Fix 3: Release session lock on terminal error
When all candidates fail (or error is non-retryable), release the
.jsonl.lockso the user can resend from the UI without restarting the gateway.Fix 4: Add retry affordance in UI
When the overloaded banner is shown, include a "Retry" button that re-queues the last user message rather than requiring a full gateway restart.
Impact