[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight

# [bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight

## Description

When the primary model (e.g. `anthropic/claude-sonnet-4-6`) returns `overloaded_error` (503), the fallback chain initiates correctly. However, because the overload cooldown is very short (30s for 1st error, 60s for 2nd), it expires **during** the fallback chain execution. When it expires, `requestLiveSessionModelSwitch` fires and aborts the in-flight fallback attempt, forcing a switch back to the primary — which is still overloaded. This creates an infinite loop where no model ever successfully responds.

## Steps to Reproduce

1. Configure primary `anthropic/claude-sonnet-4-6` with fallbacks `[anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash]`
2. Wait for Anthropic overload (503)
3. Observe fallback chain starting
4. Observe cooldown expiring (~30-60s) mid-flight
5. `requestLiveSessionModelSwitch` cancels fallback, switches back to Sonnet
6. Sonnet fails again → loop

## Expected Behavior

When a fallback is in-flight, `requestLiveSessionModelSwitch` should NOT cancel it. The primary should only be restored **after** the current fallback attempt completes (success or failure).

## Actual Behavior

Every fallback candidate (Opus, GPT-5.4, Gemini Flash) is aborted mid-attempt with:
```
errorPreview: "Live session model switch requested: anthropic/claude-sonnet-4-6"
```
The agent becomes completely unresponsive despite having 4 models configured.

## Logs

```
21:29:09 — candidate_failed: openai-codex/gpt-5.4, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — candidate_failed: anthropic/claude-opus-4-6, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — live session model switch detected: openai-codex/gpt-5.4 -> anthropic/claude-sonnet-4-6
21:30:50 — candidate_failed: anthropic/claude-opus-4-6 (same error, loop continues)
21:36:10 — candidate_succeeded: anthropic/claude-sonnet-4-6 (finally works after ~7 minutes of loop)
```

Pattern repeats every ~50s: Sonnet fails → Opus fails → GPT-5.4 aborted by switch → Gemini aborted by switch → back to Sonnet.

## Root Cause (from source analysis)

In `auth-profiles-B5ypC5S-.js`:

```javascript
// Overload cooldown is hardcoded and very short:
function calculateAuthProfileCooldownMs(errorCount) {
    if (normalized <= 1) return 30000;   // 30s
    if (normalized <= 2) return 60000;   // 60s
    return 300000;                        // 5 min
}
```

In `login-B5O9Mtcp.js`, around line 169326:
```javascript
// This fires when cooldown expires, even during active fallback:
log.info(`live session model switch detected before attempt for ${params.sessionId}`);
throw new LiveSessionModelSwitchError(nextSelection);
```

The cooldown expiry triggers a switch request that aborts any in-flight fallback, regardless of whether the primary is actually healthy.

## Suggested Fix

Option A: Don't fire `requestLiveSessionModelSwitch` if a fallback attempt is currently in-flight. Only check after the current attempt completes.

Option B: Add a minimum grace period after an overload error before allowing switch-back (e.g., don't switch back within 5 minutes of the last overload from that provider).

Option C: Make the overload cooldown configurable via `auth.cooldowns` (currently only `billingBackoffHours` is configurable; the overload cooldown in `calculateAuthProfileCooldownMs` is hardcoded).

## Environment

- OpenClaw version: 2026.3.28
- OS: Ubuntu Linux 6.8.0-90-generic (x64)
- Node: v22.22.1
- Primary: anthropic/claude-sonnet-4-6
- Fallbacks: anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight #58578

[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs

Root Cause (from source analysis)

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight #58578

Description

[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs

Root Cause (from source analysis)

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions