Skip to content

LiveSessionModelSwitchError overrides model failover, creating infinite retry loop #57812

@bill492

Description

@bill492

Description

When the primary model (e.g., anthropic/claude-opus-4-6) hits rate limits and all auth profiles enter cooldown, the failover engine correctly decides to fall back to the next model in agents.defaults.model.fallbacks (e.g., openai-codex/gpt-5.4). However, the fallback candidate never actually executes — the session is immediately pulled back to the rate-limited primary model, creating an infinite retry loop.

Environment

  • OpenClaw version: 2026.3.28 (f9b1079)
  • Config: agents.defaults.model.primary: anthropic/claude-opus-4-6, agents.defaults.model.fallbacks: [openai-codex/gpt-5.4]
  • models.mode: merge
  • Auth profiles: One profile per provider (anthropic:default, openai-codex:default)

Root Cause

In the embedded run loop (src/agents/live-model-switch.ts / runtime at dist/auth-profiles-*.js), before each attempt, the code checks the persisted session store for live model switches:

const nextSelection = resolvePersistedLiveSelection();
if (hasDifferentLiveSessionModelSelection(resolveCurrentLiveSelection(), nextSelection)) {
    throw new LiveSessionModelSwitchError(nextSelection);
}

resolveLiveSessionModelSelection() reads providerOverride/modelOverride from the session store. When no explicit user override exists (no /model command), it falls back to defaultModelRef — which resolves to the configured primary model (opus).

When the outer runWithModelFallback() loop hands off to gpt-5.4:

  1. Inner run starts for gpt-5.4
  2. Pre-attempt check sees: current = gpt-5.4, persisted = opus (default)
  3. hasDifferentLiveSessionModelSelection() returns true
  4. LiveSessionModelSwitchError(opus) is thrown — switching back to the rate-limited model
  5. Opus fails again → outer loop advances to gpt-5.4 → inner loop switches back → infinite loop

Observed Behavior

Over a 12-hour window with 3 rate-limit episodes:

  • 389 correct candidate_failed → next=openai-codex/gpt-5.4 fallback decisions
  • 800+ total retry attempts
  • Only 1 successful gpt-5.4 execution (likely during a brief cooldown gap)
  • Manual gateway restarts and config hot-reloads required to restore service

Expected Behavior

When runWithModelFallback() selects a fallback candidate, the inner run loop should execute using that candidate model without the live-switch check pulling it back to the (rate-limited) primary.

Suggested Fix

The live-switch check (designed for user /model commands mid-run) should not fire during failover. Options:

  1. Pass isFallbackCandidate flag to the embedded run; suppress the resolvePersistedLiveSelection() pre-attempt check when true (still allow explicit consumeLiveSessionModelSwitch() for user commands)
  2. Write fallback model to session store as a temporary override during failover, so persisted selection matches the current candidate
  3. Distinguish explicit overrides from defaults in resolveLiveSessionModelSelection() — only throw LiveSessionModelSwitchError when there's an actual user override, not when the function is returning the configured default

Option 3 is cleanest: a default value should never be treated as an intentional model switch request.

Labels

bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions