Problem
When an LLM provider returns server_is_overloaded (HTTP 503) or service_unavailable_error, OpenClaw does not trigger model fallback and instead retries the same endpoint repeatedly until timeout or manual intervention.
In contrast, timeout and rate_limit errors correctly trigger fallback and switch to the next candidate model.
Steps to Reproduce
- Configure an OpenAI model (e.g. gpt-5.4) as the primary model
- Wait for OpenAI to return
server_is_overloaded errors
- Observe: the agent retries the same model repeatedly without falling back
Log Evidence
# All 8 requests hit server_is_overloaded — zero fallbacks triggered:
17:52:18 embedded run agent end ... error=server_is_overloaded
17:54:49 embedded run agent end ... error=server_is_overloaded
18:00:45 embedded run agent end ... error=server_is_overloaded
18:00:48 embedded run agent end ... error=server_is_overloaded
18:01:01 embedded run agent end ... error=server_is_overloaded
18:01:11 embedded run agent end ... error=server_is_overloaded
18:01:39 embedded run agent end ... error=server_is_overloaded
18:02:06 embedded run agent end ... error=server_is_overloaded
# For comparison — timeout and rate_limit correctly trigger fallback:
17:44:30 failover decision: reason=timeout from=gpt-5.4 → next=zai/glm-5-turbo ✅
18:06:46 failover decision: reason=rate_limit from=gpt-5.3-codex → next=zai/glm-5-turbo ✅
Impact
- Downstream session blocking: subagent or main session stuck in retry loop, lane not released, subsequent messages queue indefinitely
- Resource waste: each retry consumes API quota and wall-clock time
- Poor UX: group/DM chats become unresponsive for extended periods
Expected Behavior
server_is_overloaded and service_unavailable_error should be treated the same as timeout and rate_limit — immediately trigger fallback to the next candidate model instead of retrying the same overloaded endpoint.
Environment
- OpenClaw version: latest (2026-04-22)
- Affected models: openai-codex/gpt-5.4, openai-codex/gpt-5.3-codex
- Fallback candidate: zai/glm-5-turbo (works correctly on timeout/rate_limit)
Problem
When an LLM provider returns
server_is_overloaded(HTTP 503) orservice_unavailable_error, OpenClaw does not trigger model fallback and instead retries the same endpoint repeatedly until timeout or manual intervention.In contrast,
timeoutandrate_limiterrors correctly trigger fallback and switch to the next candidate model.Steps to Reproduce
server_is_overloadederrorsLog Evidence
Impact
Expected Behavior
server_is_overloadedandservice_unavailable_errorshould be treated the same astimeoutandrate_limit— immediately trigger fallback to the next candidate model instead of retrying the same overloaded endpoint.Environment