Description
When the primary model (e.g., anthropic/claude-opus-4-6) hits rate limits and all auth profiles enter cooldown, the failover engine correctly decides to fall back to the next model in agents.defaults.model.fallbacks (e.g., openai-codex/gpt-5.4). However, the fallback candidate never actually executes — the session is immediately pulled back to the rate-limited primary model, creating an infinite retry loop.
Environment
- OpenClaw version: 2026.3.28 (f9b1079)
- Config:
agents.defaults.model.primary: anthropic/claude-opus-4-6, agents.defaults.model.fallbacks: [openai-codex/gpt-5.4]
- models.mode:
merge
- Auth profiles: One profile per provider (
anthropic:default, openai-codex:default)
Root Cause
In the embedded run loop (src/agents/live-model-switch.ts / runtime at dist/auth-profiles-*.js), before each attempt, the code checks the persisted session store for live model switches:
const nextSelection = resolvePersistedLiveSelection();
if (hasDifferentLiveSessionModelSelection(resolveCurrentLiveSelection(), nextSelection)) {
throw new LiveSessionModelSwitchError(nextSelection);
}
resolveLiveSessionModelSelection() reads providerOverride/modelOverride from the session store. When no explicit user override exists (no /model command), it falls back to defaultModelRef — which resolves to the configured primary model (opus).
When the outer runWithModelFallback() loop hands off to gpt-5.4:
- Inner run starts for gpt-5.4
- Pre-attempt check sees: current = gpt-5.4, persisted = opus (default)
hasDifferentLiveSessionModelSelection() returns true
LiveSessionModelSwitchError(opus) is thrown — switching back to the rate-limited model
- Opus fails again → outer loop advances to gpt-5.4 → inner loop switches back → infinite loop
Observed Behavior
Over a 12-hour window with 3 rate-limit episodes:
- 389 correct
candidate_failed → next=openai-codex/gpt-5.4 fallback decisions
- 800+ total retry attempts
- Only 1 successful gpt-5.4 execution (likely during a brief cooldown gap)
- Manual gateway restarts and config hot-reloads required to restore service
Expected Behavior
When runWithModelFallback() selects a fallback candidate, the inner run loop should execute using that candidate model without the live-switch check pulling it back to the (rate-limited) primary.
Suggested Fix
The live-switch check (designed for user /model commands mid-run) should not fire during failover. Options:
- Pass
isFallbackCandidate flag to the embedded run; suppress the resolvePersistedLiveSelection() pre-attempt check when true (still allow explicit consumeLiveSessionModelSwitch() for user commands)
- Write fallback model to session store as a temporary override during failover, so persisted selection matches the current candidate
- Distinguish explicit overrides from defaults in
resolveLiveSessionModelSelection() — only throw LiveSessionModelSwitchError when there's an actual user override, not when the function is returning the configured default
Option 3 is cleanest: a default value should never be treated as an intentional model switch request.
Labels
bug
Description
When the primary model (e.g.,
anthropic/claude-opus-4-6) hits rate limits and all auth profiles enter cooldown, the failover engine correctly decides to fall back to the next model inagents.defaults.model.fallbacks(e.g.,openai-codex/gpt-5.4). However, the fallback candidate never actually executes — the session is immediately pulled back to the rate-limited primary model, creating an infinite retry loop.Environment
agents.defaults.model.primary: anthropic/claude-opus-4-6,agents.defaults.model.fallbacks: [openai-codex/gpt-5.4]mergeanthropic:default,openai-codex:default)Root Cause
In the embedded run loop (
src/agents/live-model-switch.ts/ runtime atdist/auth-profiles-*.js), before each attempt, the code checks the persisted session store for live model switches:resolveLiveSessionModelSelection()readsproviderOverride/modelOverridefrom the session store. When no explicit user override exists (no/modelcommand), it falls back todefaultModelRef— which resolves to the configured primary model (opus).When the outer
runWithModelFallback()loop hands off to gpt-5.4:hasDifferentLiveSessionModelSelection()returnstrueLiveSessionModelSwitchError(opus)is thrown — switching back to the rate-limited modelObserved Behavior
Over a 12-hour window with 3 rate-limit episodes:
candidate_failed → next=openai-codex/gpt-5.4fallback decisionsExpected Behavior
When
runWithModelFallback()selects a fallback candidate, the inner run loop should execute using that candidate model without the live-switch check pulling it back to the (rate-limited) primary.Suggested Fix
The live-switch check (designed for user
/modelcommands mid-run) should not fire during failover. Options:isFallbackCandidateflag to the embedded run; suppress theresolvePersistedLiveSelection()pre-attempt check when true (still allow explicitconsumeLiveSessionModelSwitch()for user commands)resolveLiveSessionModelSelection()— only throwLiveSessionModelSwitchErrorwhen there's an actual user override, not when the function is returning the configured defaultOption 3 is cleanest: a default value should never be treated as an intentional model switch request.
Labels
bug