-
-
Notifications
You must be signed in to change notification settings - Fork 52.5k
Description
Bug
When a session is running a non-primary model (e.g. Codex after Claude was rate-limited), resolveFallbackCandidates() returns only the configured primary as a fallback — the configured fallback chain is skipped.
This creates a dead-end scenario:
- Claude (primary) hits rate limit → session fails over to Codex (configured fallback)
- Codex encounters an error (timeout, 5xx, etc.)
resolveFallbackCandidates()sees Codex ≠ configured primary, somodelFallbacks = []- Only the configured primary (Claude) is added as a fallback candidate
- Claude is still in cooldown and at candidate index >0, so
shouldProbePrimaryDuringCooldownreturnsfalse(it only probes index 0) - All candidates exhausted → hard failure with no recovery
Root cause
In src/agents/model-fallback.ts, resolveFallbackCandidates():
if (!sameModelCandidate(normalizedPrimary, configuredPrimary)) {
return []; // Override model failed → go straight to configured default
}This was intended to handle explicit --model overrides, but it also fires when the session is running a failover model. The configured fallback chain (which could include other working models) is discarded.
Impact
- Post-failover sessions lose resilience — they can only fail back to the (possibly still-cooldown) primary
- If the primary provider is in extended rate limiting (hours/days), sessions on the fallback model are fragile
- Creates a vicious cycle: failover → no fallback chain → hard failure → manual intervention required
Proposed fix
Remove the early return and always include the configured fallback chain:
// When running a non-default model (e.g. after failover), still include
// the configured fallback chain so all models remain reachable.
return resolveAgentModelFallbackValues(params.cfg?.agents?.defaults?.model);The createModelCandidateCollector already deduplicates by provider+model, so there's no risk of duplicate candidates. The existing fallbacksOverride path (for explicit overrides via spawn) is preserved and takes priority.
Test changes
Updated 5 tests in model-fallback.test.ts to reflect the new behavior:
- Override models now fall back through the configured chain (not straight to primary)
- All 30 tests pass
Environment
- Discovered while diagnosing persistent flakiness after an Anthropic rate-limit event
- Affects any setup with
model.primary+model.fallbacksconfigured - Workaround: monkeypatch
resolveFallbackCandidatesin the dist file