Treat HTTP 503 (and 502/504) as failover-eligible so model fallback triggers

# OpenClaw: Treat HTTP 503 (and 502/504) as failover-eligible so model fallback triggers

## Summary

When the primary model’s API returns **503 Service Unavailable** (e.g. Google Gemini “This model is currently experiencing high demand…”), OpenClaw retries the same model and never moves to `agents.defaults.model.fallbacks`. The run can then stall with no reply to the user.

Model fallback is documented to apply to “auth failures, rate limits, and timeouts.” Today, **503 (and 502/504) are not treated as failover-eligible**, so they fall through to message-based classification and often end up as “other” → no fallback.

## Expected behavior

- **503**, and preferably **502** / **504**, should be treated like rate-limit/transient errors: trigger **model fallback** (next model in `agents.defaults.model.fallbacks`) instead of only retrying the same model.
- Same for error messages that clearly indicate overload (e.g. “UNAVAILABLE”, “experiencing high demand”) when the error object doesn’t carry an HTTP status.

## Actual behavior

- Primary (e.g. `google/gemini-3-flash-preview`) returns 503.
- Run retries the same model; after tool results, the next completion request again returns 503.
- No switch to fallback (e.g. `minimax/MiniMax-M2.5`); user gets no reply.

Observed in session transcript: assistant message with `"code":503,"status":"Service Unavailable"` and message “This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later.” Retries stayed on the same model.

## Proposed fix

In **`src/agents/failover-error.ts`**, in the function that resolves failover reason from an error (e.g. `resolveFailoverReasonFromError`):

1. **By status code:** After handling 408 (timeout), add handling for 503 (and optionally 502, 504) and return **`"rate_limit"`** (so existing fallback/cooldown behavior applies).

   Example (pseudocode):

   ```ts
   if (status === 408) return "timeout";
   // Treat server overload / temporary unavailable as rate-limit-like for fallback
   if (status === 503 || status === 502 || status === 504) return "rate_limit";
   ```

2. **By message (optional):** In `classifyFailoverReason` (or equivalent message-based classifier), if the text contains `503`, `UNAVAILABLE`, or “high demand” / “experiencing high demand”, return **`"rate_limit"`** so that when the error is passed without an HTTP status, fallback still triggers.

## Environment

- OpenClaw version: 2026.2.17 (from npm).
- Config: `agents.defaults.model.primary` = `google/gemini-3-flash-preview`, `agents.defaults.model.fallbacks` = [minimax, kimi, grok-3, ollama].
- Channel: Telegram (direct).

## References

- [Model failover](https://docs.openclaw.ai/concepts/model-failover): “This applies to auth failures, rate limits, and timeouts that exhausted profile rotation (**other errors do not advance fallback**).”
- Session transcript showed repeated 503 on `gemini-3-flash-preview` with no switch to fallback.

Thank you for considering this; it would make configured fallbacks actually apply when the primary provider is overloaded (503) or temporarily down (502/504).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Treat HTTP 503 (and 502/504) as failover-eligible so model fallback triggers #20999

OpenClaw: Treat HTTP 503 (and 502/504) as failover-eligible so model fallback triggers

Summary

Expected behavior

Actual behavior

Proposed fix

Environment

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Treat HTTP 503 (and 502/504) as failover-eligible so model fallback triggers #20999

Description

OpenClaw: Treat HTTP 503 (and 502/504) as failover-eligible so model fallback triggers

Summary

Expected behavior

Actual behavior

Proposed fix

Environment

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions