LiveSessionModelSwitchError overrides model failover, creating infinite retry loop

## Description

When the primary model (e.g., `anthropic/claude-opus-4-6`) hits rate limits and all auth profiles enter cooldown, the failover engine correctly decides to fall back to the next model in `agents.defaults.model.fallbacks` (e.g., `openai-codex/gpt-5.4`). However, the fallback candidate never actually executes — the session is immediately pulled back to the rate-limited primary model, creating an infinite retry loop.

## Environment

- **OpenClaw version:** 2026.3.28 (f9b1079)
- **Config:** `agents.defaults.model.primary: anthropic/claude-opus-4-6`, `agents.defaults.model.fallbacks: [openai-codex/gpt-5.4]`
- **models.mode:** `merge`
- **Auth profiles:** One profile per provider (`anthropic:default`, `openai-codex:default`)

## Root Cause

In the embedded run loop (`src/agents/live-model-switch.ts` / runtime at `dist/auth-profiles-*.js`), before each attempt, the code checks the persisted session store for live model switches:

```js
const nextSelection = resolvePersistedLiveSelection();
if (hasDifferentLiveSessionModelSelection(resolveCurrentLiveSelection(), nextSelection)) {
    throw new LiveSessionModelSwitchError(nextSelection);
}
```

`resolveLiveSessionModelSelection()` reads `providerOverride`/`modelOverride` from the session store. When no explicit user override exists (no `/model` command), it falls back to `defaultModelRef` — which resolves to the configured **primary model** (opus).

When the outer `runWithModelFallback()` loop hands off to gpt-5.4:
1. Inner run starts for gpt-5.4
2. Pre-attempt check sees: current = gpt-5.4, persisted = opus (default)
3. `hasDifferentLiveSessionModelSelection()` returns `true`
4. `LiveSessionModelSwitchError(opus)` is thrown — switching **back to the rate-limited model**
5. Opus fails again → outer loop advances to gpt-5.4 → inner loop switches back → infinite loop

## Observed Behavior

Over a 12-hour window with 3 rate-limit episodes:
- **389** correct `candidate_failed → next=openai-codex/gpt-5.4` fallback decisions
- **800+** total retry attempts
- Only **1** successful gpt-5.4 execution (likely during a brief cooldown gap)
- Manual gateway restarts and config hot-reloads required to restore service

## Expected Behavior

When `runWithModelFallback()` selects a fallback candidate, the inner run loop should execute using that candidate model without the live-switch check pulling it back to the (rate-limited) primary.

## Suggested Fix

The live-switch check (designed for user `/model` commands mid-run) should not fire during failover. Options:

1. **Pass `isFallbackCandidate` flag** to the embedded run; suppress the `resolvePersistedLiveSelection()` pre-attempt check when true (still allow explicit `consumeLiveSessionModelSwitch()` for user commands)
2. **Write fallback model to session store** as a temporary override during failover, so persisted selection matches the current candidate
3. **Distinguish explicit overrides from defaults** in `resolveLiveSessionModelSelection()` — only throw `LiveSessionModelSwitchError` when there's an actual user override, not when the function is returning the configured default

Option 3 is cleanest: a default value should never be treated as an intentional model switch request.

## Labels

bug


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LiveSessionModelSwitchError overrides model failover, creating infinite retry loop #57812

Description

Environment

Root Cause

Observed Behavior

Expected Behavior

Suggested Fix

Labels

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

LiveSessionModelSwitchError overrides model failover, creating infinite retry loop #57812

Description

Description

Environment

Root Cause

Observed Behavior

Expected Behavior

Suggested Fix

Labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions