Skip to content

Model failover does not activate on rate limit - agents stay on primary model #19249

@cryptopafi

Description

@cryptopafi

Description

When the primary model (Anthropic Claude) hits API rate limits, the gateway does NOT automatically switch agents to configured fallback models (OpenRouter/DeepSeek, OpenRouter/Kimi). Agents remain stuck on the rate-limited primary model, causing cascading failures.

Environment

  • OpenClaw version: 2026.2.14
  • OS: macOS (Intel)
  • Node: v22.22.0
  • Gateway mode: local
  • 8 agents configured with fallback chains

Configuration

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-sonnet-4-5-20250929",
        "fallbacks": ["openrouter/deepseek/deepseek-v3.2", "openrouter/moonshotai/kimi-k2.5"]
      }
    }
  }
}

Auth profiles exist for both anthropic:manual and openrouter:manual in each agent's auth-profiles.json.

Steps to Reproduce

  1. Configure agent with Anthropic primary + OpenRouter fallbacks
  2. Trigger rate limit on Anthropic (e.g., activate multiple agents simultaneously)
  3. Observe that agents do NOT switch to fallback models

Expected Behavior

Per model-failover docs: "If all profiles for a provider fail, OpenClaw moves to the next model in agents.defaults.model.fallbacks."

Agents should automatically switch to openrouter/deepseek/deepseek-v3.2 when Anthropic is rate-limited.

Actual Behavior

  • All agents fail with FailoverError: ⚠️ API rate limit reached. Please try again later.
  • No attempt to use fallback models
  • Gateway error log shows cascading failures across all agents
  • Messages queue up (321 seconds max wait observed)
  • Eventually OAuth token expires (HTTP 401), compounding the issue

Testing Done

  1. Manual cooldown in auth-profiles.json: Set cooldownUntil in the future for anthropic:manual → gateway IGNORES file-based cooldowns for existing sessions
  2. Changed primary model to DeepSeek in config: Gateway hot-reloads config but existing sessions keep old model
  3. Gateway restart with DeepSeek as primary: Agent correctly responds on DeepSeek → confirms OpenRouter integration works
  4. Verified OpenRouter API key: Direct curl to OpenRouter returns successful response

Key Finding

The failover mechanism appears to only work at session creation (after gateway restart), not at runtime when rate limits are encountered. File-based cooldowns in auth-profiles.json are not respected by running sessions.

Error Log Excerpt

2026-02-17T13:45:54Z [diagnostic] lane task error: lane=session:agent:lexi:discord:channel:... error="FailoverError: ⚠️ API rate limit reached."
2026-02-17T13:45:58Z [diagnostic] lane task error: lane=session:agent:ressie:discord:channel:... error="FailoverError: ⚠️ API rate limit reached."
2026-02-17T13:46:01Z [diagnostic] lane task error: lane=session:agent:mark:discord:channel:... error="FailoverError: ⚠️ API rate limit reached."
2026-02-17T13:46:16Z [diagnostic] lane task error: lane=session:agent:dessie:discord:channel:... error="FailoverError: ⚠️ API rate limit reached."
2026-02-17T13:46:18Z [diagnostic] lane task error: lane=session:agent:afy:discord:channel:... error="FailoverError: ⚠️ API rate limit reached."
2026-02-17T13:46:24Z [diagnostic] lane task error: lane=session:agent:tech:discord:channel:... error="FailoverError: ⚠️ API rate limit reached."
2026-02-17T13:51:01Z [discord] Slow listener detected: DiscordMessageListener took 321 seconds

Workaround

Manually switch model via config + gateway restart:

# Edit openclaw.json to change agent model
# Then restart gateway to clear sessions
launchctl kickstart -k gui/$(id -u)/ai.openclaw.gateway

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions