Skip to content

Fallback chain empty when session runs non-primary model (dead end after failover) #25912

@Taskle

Description

@Taskle

Bug

When a session is running a non-primary model (e.g. Codex after Claude was rate-limited), resolveFallbackCandidates() returns only the configured primary as a fallback — the configured fallback chain is skipped.

This creates a dead-end scenario:

  1. Claude (primary) hits rate limit → session fails over to Codex (configured fallback)
  2. Codex encounters an error (timeout, 5xx, etc.)
  3. resolveFallbackCandidates() sees Codex ≠ configured primary, so modelFallbacks = []
  4. Only the configured primary (Claude) is added as a fallback candidate
  5. Claude is still in cooldown and at candidate index >0, so shouldProbePrimaryDuringCooldown returns false (it only probes index 0)
  6. All candidates exhausted → hard failure with no recovery

Root cause

In src/agents/model-fallback.ts, resolveFallbackCandidates():

if (!sameModelCandidate(normalizedPrimary, configuredPrimary)) {
  return []; // Override model failed → go straight to configured default
}

This was intended to handle explicit --model overrides, but it also fires when the session is running a failover model. The configured fallback chain (which could include other working models) is discarded.

Impact

  • Post-failover sessions lose resilience — they can only fail back to the (possibly still-cooldown) primary
  • If the primary provider is in extended rate limiting (hours/days), sessions on the fallback model are fragile
  • Creates a vicious cycle: failover → no fallback chain → hard failure → manual intervention required

Proposed fix

Remove the early return and always include the configured fallback chain:

// When running a non-default model (e.g. after failover), still include
// the configured fallback chain so all models remain reachable.
return resolveAgentModelFallbackValues(params.cfg?.agents?.defaults?.model);

The createModelCandidateCollector already deduplicates by provider+model, so there's no risk of duplicate candidates. The existing fallbacksOverride path (for explicit overrides via spawn) is preserved and takes priority.

Test changes

Updated 5 tests in model-fallback.test.ts to reflect the new behavior:

  • Override models now fall back through the configured chain (not straight to primary)
  • All 30 tests pass

Environment

  • Discovered while diagnosing persistent flakiness after an Anthropic rate-limit event
  • Affects any setup with model.primary + model.fallbacks configured
  • Workaround: monkeypatch resolveFallbackCandidates in the dist file

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions