fix(agent): reset fallback chain index when activation fails in interactive sessions#20793
Closed
konsisumer wants to merge 1 commit into
Closed
Conversation
… activated In long-lived interactive sessions, _try_activate_fallback() advances _fallback_index before attempting client resolution. When resolution fails (provider not configured, etc.) the function returns False without ever setting _fallback_activated=True. _restore_primary_runtime() then skips its reset block entirely (guarded by `if not _fallback_activated`), leaving _fallback_index >= len(_fallback_chain) for all subsequent turns. The eager-fallback guard at the top of the retry loop checks `_fallback_index < len(_fallback_chain)`, so the condition fails silently and no fallback is ever attempted again for that session. Cron jobs spawn a fresh AIAgent per run and never hit this path, which is why the same fallback chain works reliably for cron but not interactive. Fix: reset _fallback_index=0 in the `not _fallback_activated` early-return branch so every new turn starts with the full chain available. Fixes NousResearch#20465
Collaborator
Contributor
|
Merged via #27185 — cherry-picked onto current main with your authorship preserved via rebase-merge. Thanks @konsisumer! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed and why
In long-lived interactive CLI sessions (and gateway agent-cache hits), a single
AIAgentinstance is reused across multiple turns._try_activate_fallback()increments_fallback_indexbefore resolving the fallback provider's client. When client resolution fails (provider not configured, auth missing, etc.) the function returnsFalsewithout ever setting_fallback_activated = True._restore_primary_runtime()was guarded byif not self._fallback_activated: return False, so it skipped its reset block entirely._fallback_indexwas left at>= len(_fallback_chain)across turns. On the next Codex 429, the eager-fallback checkfails silently — no status message, no fallback attempt, just three retries and a hard failure every time.
Cron jobs spawn a fresh
AIAgentper run (index always 0), which is exactly why the same chain works reliably for cron while interactive sessions fail.Fix: add an unconditional
self._fallback_index = 0reset in thenot _fallback_activatedearly-return branch of_restore_primary_runtime(). Every new turn now starts with the full chain available, regardless of what happened in the prior turn.Per the reporter's follow-up on 2026-05-06 confirming that sessions with
msgs=2(effectively single-shot shape) also fail without fallback, and that the one session where fallback fired at 14:51:34 was likely the first turn after a fresh agent start.Fixes #20465
How to test
model.provider: openai-codexand afallback_providerslist with a reachable local provider (e.g. Ollama).hermessession, exhaust the Codex 5-hour quota.Fallback activated: gpt-5.5 → <model> (custom)) instead ofAPI call failed after 3 retries.pytest tests/run_agent/test_primary_runtime_restore.py tests/run_agent/test_provider_fallback.py -q— all 51 tests pass.Alternatively, reproduce with the unit test added in this PR:
What platforms tested on