Skip to content

fix: include configured fallback chain when running non-primary model#25922

Closed
Taskle wants to merge 1 commit intoopenclaw:mainfrom
Taskle:fix/model-fallback-chain
Closed

fix: include configured fallback chain when running non-primary model#25922
Taskle wants to merge 1 commit intoopenclaw:mainfrom
Taskle:fix/model-fallback-chain

Conversation

@Taskle
Copy link
Contributor

@Taskle Taskle commented Feb 25, 2026

Summary

When a session runs a non-primary model (e.g. after failover from Claude to Codex), resolveFallbackCandidates() skips the configured fallback chain and only adds the configured primary as a fallback. If that primary's provider is still in cooldown and at candidate index >0 (not eligible for probing), all candidates are exhausted — creating a dead end with no recovery.

Fixes #25912

Problem

In src/agents/model-fallback.ts, this guard discards the entire fallback chain for non-primary models:

if (!sameModelCandidate(normalizedPrimary, configuredPrimary)) {
  return []; // Override model failed → go straight to configured default
}

This was intended for explicit --model overrides, but it also fires when the session is running a failover model. The result:

  1. Claude (primary) hits rate limit → session fails over to Codex
  2. Codex encounters an error → resolveFallbackCandidates() returns only Claude as a fallback
  3. Claude is at candidate index 1 (not 0), so shouldProbePrimaryDuringCooldown returns false
  4. All candidates exhausted → hard failure, no recovery path

Fix

Remove the early return and always include the configured fallback chain. The createModelCandidateCollector already deduplicates by provider+model, so there's no risk of duplicate candidates. The fallbacksOverride path (for explicit spawn overrides) is preserved and takes priority.

Before

Non-primary model fails → candidates: [currentModel, configuredPrimary]

After

Non-primary model fails → candidates: [currentModel, ...configuredFallbacks, configuredPrimary]

Changes

  • src/agents/model-fallback.ts: Remove sameModelCandidate guard and configuredPrimary variable (both now unused). Replace with comment explaining the design decision.
  • src/agents/model-fallback.test.ts: Update 5 tests to reflect new behavior — override models now fall back through the configured chain instead of jumping straight to primary. Remove createOverrideFailureRun helper (no longer needed). All 30 tests pass.

Testing

npx vitest run src/agents/model-fallback.test.ts
# ✓ 30 tests passed

Edge cases considered

  • Primary running, primary fails: candidates unchanged — [primary, ...fallbacks, primary(deduped)]
  • Override model running, override fails: improved — now tries [override, ...fallbacks, primary] instead of [override, primary]
  • Fallback model running (after failover), fallback fails: fixed — [fallback, ...otherFallbacks, primary] instead of [fallback, primary]
  • fallbacksOverride set (explicit spawn): unchanged — takes priority before this code path
  • Allowlist enforcement: unchanged — fallback candidates use enforceAllowlist: true, primary/override use false

Greptile Summary

Removes early return that skipped the configured fallback chain when running non-primary models, fixing a dead-end scenario where sessions could fail to recover after failover. The fix ensures all models in the configured fallback chain remain reachable even when running override or failover models, while deduplication in createModelCandidateCollector prevents duplicate candidates.

  • Removed unused sameModelCandidate function and configuredPrimary variable (both became unnecessary after removing the guard)
  • Replaced removed logic with explanatory comment documenting the design decision
  • Updated 5 tests to reflect new behavior where override models fall back through the configured chain instead of jumping directly to the primary
  • Removed createOverrideFailureRun test helper (no longer needed with updated test approach)
  • All 30 tests reported passing in PR description

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The change is well-reasoned and addresses a clear bug in the fallback logic. The removal of the early return is straightforward, and deduplication ensures no duplicate candidates are added. Test updates properly reflect the new behavior, with all tests passing. The change improves robustness by preventing dead-end scenarios during failover recovery.
  • No files require special attention

Last reviewed commit: 61b9dcc

When a session runs a non-primary model (e.g. after failover),
resolveFallbackCandidates() previously returned an empty fallback list.
This meant only the configured primary was available as a fallback.

If the primary provider was in cooldown and at candidate index >0
(not probe-eligible), all candidates would be exhausted with no
recovery path — creating a dead end after failover.

Now the full configured fallback chain is included regardless of
whether the current model matches the configured primary, giving
non-primary sessions the same resilience as primary sessions.

Fixes openclaw#25912
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Feb 25, 2026
@gumadeiras gumadeiras self-assigned this Feb 25, 2026
@steipete
Copy link
Contributor

Reviewed and landed on main as commit bf5a96ad6 with a scoped reimplementation.

What shipped:

  • src/agents/model-fallback.ts
    • Preserves configured fallback-chain traversal when the current run model is itself one of the configured fallbacks (post-failover path).
    • Keeps legacy behavior for ad-hoc override models outside the configured chain (still collapses to primary-only fallback).
    • Result: avoids dead-end candidate sets during fallback-on-fallback retries while minimizing behavior expansion.
  • src/agents/model-fallback.test.ts
    • Added regression: when current model is a configured fallback and fails, resolver continues through remaining configured fallbacks.
  • CHANGELOG.md
    • Added user-facing fix note under 2026.2.24 (Unreleased).

Validation:

  • Full gate passed: pnpm lint && pnpm build && pnpm test.
  • Focused tests passed: pnpm test src/agents/model-fallback.test.ts src/agents/model-fallback.probe.test.ts.

Thanks for the issue analysis and patch direction, @Taskle.

@steipete
Copy link
Contributor

Landed on main in bf5a96a with scoped fallback-chain fix + regression coverage; closing in favor of landed commit.

@steipete steipete closed this Feb 25, 2026
margulans pushed a commit to margulans/Neiron-AI-assistant that referenced this pull request Feb 25, 2026
Jackson3195 pushed a commit to Jackson3195/openclaw-with-a-personal-touch that referenced this pull request Feb 25, 2026
brianleach pushed a commit to brianleach/openclaw that referenced this pull request Feb 26, 2026
execute008 pushed a commit to execute008/openclaw that referenced this pull request Feb 27, 2026
r4jiv007 pushed a commit to r4jiv007/openclaw that referenced this pull request Feb 28, 2026
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
thebenjaminlee pushed a commit to escape-velocity-ventures/openclaw that referenced this pull request Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fallback chain empty when session runs non-primary model (dead end after failover)

3 participants