Skip to content

[Bug] server_is_overloaded / service_unavailable_error does not trigger model fallback, causing repeated retries against the same overloaded endpoint #70120

@Sway-Chan

Description

@Sway-Chan

Problem

When an LLM provider returns server_is_overloaded (HTTP 503) or service_unavailable_error, OpenClaw does not trigger model fallback and instead retries the same endpoint repeatedly until timeout or manual intervention.

In contrast, timeout and rate_limit errors correctly trigger fallback and switch to the next candidate model.

Steps to Reproduce

  1. Configure an OpenAI model (e.g. gpt-5.4) as the primary model
  2. Wait for OpenAI to return server_is_overloaded errors
  3. Observe: the agent retries the same model repeatedly without falling back

Log Evidence

# All 8 requests hit server_is_overloaded — zero fallbacks triggered:
17:52:18 embedded run agent end ... error=server_is_overloaded
17:54:49 embedded run agent end ... error=server_is_overloaded
18:00:45 embedded run agent end ... error=server_is_overloaded
18:00:48 embedded run agent end ... error=server_is_overloaded
18:01:01 embedded run agent end ... error=server_is_overloaded
18:01:11 embedded run agent end ... error=server_is_overloaded
18:01:39 embedded run agent end ... error=server_is_overloaded
18:02:06 embedded run agent end ... error=server_is_overloaded

# For comparison — timeout and rate_limit correctly trigger fallback:
17:44:30 failover decision: reason=timeout from=gpt-5.4 → next=zai/glm-5-turbo ✅
18:06:46 failover decision: reason=rate_limit from=gpt-5.3-codex → next=zai/glm-5-turbo ✅

Impact

  1. Downstream session blocking: subagent or main session stuck in retry loop, lane not released, subsequent messages queue indefinitely
  2. Resource waste: each retry consumes API quota and wall-clock time
  3. Poor UX: group/DM chats become unresponsive for extended periods

Expected Behavior

server_is_overloaded and service_unavailable_error should be treated the same as timeout and rate_limitimmediately trigger fallback to the next candidate model instead of retrying the same overloaded endpoint.

Environment

  • OpenClaw version: latest (2026-04-22)
  • Affected models: openai-codex/gpt-5.4, openai-codex/gpt-5.3-codex
  • Fallback candidate: zai/glm-5-turbo (works correctly on timeout/rate_limit)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions