[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight
Description
When the primary model (e.g. anthropic/claude-sonnet-4-6) returns overloaded_error (503), the fallback chain initiates correctly. However, because the overload cooldown is very short (30s for 1st error, 60s for 2nd), it expires during the fallback chain execution. When it expires, requestLiveSessionModelSwitch fires and aborts the in-flight fallback attempt, forcing a switch back to the primary — which is still overloaded. This creates an infinite loop where no model ever successfully responds.
Steps to Reproduce
- Configure primary
anthropic/claude-sonnet-4-6 with fallbacks [anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash]
- Wait for Anthropic overload (503)
- Observe fallback chain starting
- Observe cooldown expiring (~30-60s) mid-flight
requestLiveSessionModelSwitch cancels fallback, switches back to Sonnet
- Sonnet fails again → loop
Expected Behavior
When a fallback is in-flight, requestLiveSessionModelSwitch should NOT cancel it. The primary should only be restored after the current fallback attempt completes (success or failure).
Actual Behavior
Every fallback candidate (Opus, GPT-5.4, Gemini Flash) is aborted mid-attempt with:
errorPreview: "Live session model switch requested: anthropic/claude-sonnet-4-6"
The agent becomes completely unresponsive despite having 4 models configured.
Logs
21:29:09 — candidate_failed: openai-codex/gpt-5.4, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — candidate_failed: anthropic/claude-opus-4-6, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — live session model switch detected: openai-codex/gpt-5.4 -> anthropic/claude-sonnet-4-6
21:30:50 — candidate_failed: anthropic/claude-opus-4-6 (same error, loop continues)
21:36:10 — candidate_succeeded: anthropic/claude-sonnet-4-6 (finally works after ~7 minutes of loop)
Pattern repeats every ~50s: Sonnet fails → Opus fails → GPT-5.4 aborted by switch → Gemini aborted by switch → back to Sonnet.
Root Cause (from source analysis)
In auth-profiles-B5ypC5S-.js:
// Overload cooldown is hardcoded and very short:
function calculateAuthProfileCooldownMs(errorCount) {
if (normalized <= 1) return 30000; // 30s
if (normalized <= 2) return 60000; // 60s
return 300000; // 5 min
}
In login-B5O9Mtcp.js, around line 169326:
// This fires when cooldown expires, even during active fallback:
log.info(`live session model switch detected before attempt for ${params.sessionId}`);
throw new LiveSessionModelSwitchError(nextSelection);
The cooldown expiry triggers a switch request that aborts any in-flight fallback, regardless of whether the primary is actually healthy.
Suggested Fix
Option A: Don't fire requestLiveSessionModelSwitch if a fallback attempt is currently in-flight. Only check after the current attempt completes.
Option B: Add a minimum grace period after an overload error before allowing switch-back (e.g., don't switch back within 5 minutes of the last overload from that provider).
Option C: Make the overload cooldown configurable via auth.cooldowns (currently only billingBackoffHours is configurable; the overload cooldown in calculateAuthProfileCooldownMs is hardcoded).
Environment
- OpenClaw version: 2026.3.28
- OS: Ubuntu Linux 6.8.0-90-generic (x64)
- Node: v22.22.1
- Primary: anthropic/claude-sonnet-4-6
- Fallbacks: anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash
[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight
Description
When the primary model (e.g.
anthropic/claude-sonnet-4-6) returnsoverloaded_error(503), the fallback chain initiates correctly. However, because the overload cooldown is very short (30s for 1st error, 60s for 2nd), it expires during the fallback chain execution. When it expires,requestLiveSessionModelSwitchfires and aborts the in-flight fallback attempt, forcing a switch back to the primary — which is still overloaded. This creates an infinite loop where no model ever successfully responds.Steps to Reproduce
anthropic/claude-sonnet-4-6with fallbacks[anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash]requestLiveSessionModelSwitchcancels fallback, switches back to SonnetExpected Behavior
When a fallback is in-flight,
requestLiveSessionModelSwitchshould NOT cancel it. The primary should only be restored after the current fallback attempt completes (success or failure).Actual Behavior
Every fallback candidate (Opus, GPT-5.4, Gemini Flash) is aborted mid-attempt with:
The agent becomes completely unresponsive despite having 4 models configured.
Logs
Pattern repeats every ~50s: Sonnet fails → Opus fails → GPT-5.4 aborted by switch → Gemini aborted by switch → back to Sonnet.
Root Cause (from source analysis)
In
auth-profiles-B5ypC5S-.js:In
login-B5O9Mtcp.js, around line 169326:The cooldown expiry triggers a switch request that aborts any in-flight fallback, regardless of whether the primary is actually healthy.
Suggested Fix
Option A: Don't fire
requestLiveSessionModelSwitchif a fallback attempt is currently in-flight. Only check after the current attempt completes.Option B: Add a minimum grace period after an overload error before allowing switch-back (e.g., don't switch back within 5 minutes of the last overload from that provider).
Option C: Make the overload cooldown configurable via
auth.cooldowns(currently onlybillingBackoffHoursis configurable; the overload cooldown incalculateAuthProfileCooldownMsis hardcoded).Environment