Skip to content

[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight #58578

@MANFREDFODAO

Description

@MANFREDFODAO

[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight

Description

When the primary model (e.g. anthropic/claude-sonnet-4-6) returns overloaded_error (503), the fallback chain initiates correctly. However, because the overload cooldown is very short (30s for 1st error, 60s for 2nd), it expires during the fallback chain execution. When it expires, requestLiveSessionModelSwitch fires and aborts the in-flight fallback attempt, forcing a switch back to the primary — which is still overloaded. This creates an infinite loop where no model ever successfully responds.

Steps to Reproduce

  1. Configure primary anthropic/claude-sonnet-4-6 with fallbacks [anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash]
  2. Wait for Anthropic overload (503)
  3. Observe fallback chain starting
  4. Observe cooldown expiring (~30-60s) mid-flight
  5. requestLiveSessionModelSwitch cancels fallback, switches back to Sonnet
  6. Sonnet fails again → loop

Expected Behavior

When a fallback is in-flight, requestLiveSessionModelSwitch should NOT cancel it. The primary should only be restored after the current fallback attempt completes (success or failure).

Actual Behavior

Every fallback candidate (Opus, GPT-5.4, Gemini Flash) is aborted mid-attempt with:

errorPreview: "Live session model switch requested: anthropic/claude-sonnet-4-6"

The agent becomes completely unresponsive despite having 4 models configured.

Logs

21:29:09 — candidate_failed: openai-codex/gpt-5.4, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — candidate_failed: anthropic/claude-opus-4-6, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — live session model switch detected: openai-codex/gpt-5.4 -> anthropic/claude-sonnet-4-6
21:30:50 — candidate_failed: anthropic/claude-opus-4-6 (same error, loop continues)
21:36:10 — candidate_succeeded: anthropic/claude-sonnet-4-6 (finally works after ~7 minutes of loop)

Pattern repeats every ~50s: Sonnet fails → Opus fails → GPT-5.4 aborted by switch → Gemini aborted by switch → back to Sonnet.

Root Cause (from source analysis)

In auth-profiles-B5ypC5S-.js:

// Overload cooldown is hardcoded and very short:
function calculateAuthProfileCooldownMs(errorCount) {
    if (normalized <= 1) return 30000;   // 30s
    if (normalized <= 2) return 60000;   // 60s
    return 300000;                        // 5 min
}

In login-B5O9Mtcp.js, around line 169326:

// This fires when cooldown expires, even during active fallback:
log.info(`live session model switch detected before attempt for ${params.sessionId}`);
throw new LiveSessionModelSwitchError(nextSelection);

The cooldown expiry triggers a switch request that aborts any in-flight fallback, regardless of whether the primary is actually healthy.

Suggested Fix

Option A: Don't fire requestLiveSessionModelSwitch if a fallback attempt is currently in-flight. Only check after the current attempt completes.

Option B: Add a minimum grace period after an overload error before allowing switch-back (e.g., don't switch back within 5 minutes of the last overload from that provider).

Option C: Make the overload cooldown configurable via auth.cooldowns (currently only billingBackoffHours is configurable; the overload cooldown in calculateAuthProfileCooldownMs is hardcoded).

Environment

  • OpenClaw version: 2026.3.28
  • OS: Ubuntu Linux 6.8.0-90-generic (x64)
  • Node: v22.22.1
  • Primary: anthropic/claude-sonnet-4-6
  • Fallbacks: anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions