Skip to content

fix(agent): reset fallback chain index when activation fails in interactive sessions#20793

Closed
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/interactive-fallback-index-not-reset-on-failed-activation
Closed

fix(agent): reset fallback chain index when activation fails in interactive sessions#20793
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/interactive-fallback-index-not-reset-on-failed-activation

Conversation

@konsisumer

Copy link
Copy Markdown
Contributor

What changed and why

In long-lived interactive CLI sessions (and gateway agent-cache hits), a single AIAgent instance is reused across multiple turns. _try_activate_fallback() increments _fallback_index before resolving the fallback provider's client. When client resolution fails (provider not configured, auth missing, etc.) the function returns False without ever setting _fallback_activated = True.

_restore_primary_runtime() was guarded by if not self._fallback_activated: return False, so it skipped its reset block entirely. _fallback_index was left at >= len(_fallback_chain) across turns. On the next Codex 429, the eager-fallback check

if is_rate_limited and self._fallback_index < len(self._fallback_chain):

fails silently — no status message, no fallback attempt, just three retries and a hard failure every time.

Cron jobs spawn a fresh AIAgent per run (index always 0), which is exactly why the same chain works reliably for cron while interactive sessions fail.

Fix: add an unconditional self._fallback_index = 0 reset in the not _fallback_activated early-return branch of _restore_primary_runtime(). Every new turn now starts with the full chain available, regardless of what happened in the prior turn.

Per the reporter's follow-up on 2026-05-06 confirming that sessions with msgs=2 (effectively single-shot shape) also fail without fallback, and that the one session where fallback fired at 14:51:34 was likely the first turn after a fresh agent start.

Fixes #20465

How to test

  1. Configure model.provider: openai-codex and a fallback_providers list with a reachable local provider (e.g. Ollama).
  2. In a single interactive hermes session, exhaust the Codex 5-hour quota.
  3. Send a second message: confirm the fallback activates (log line Fallback activated: gpt-5.5 → <model> (custom)) instead of API call failed after 3 retries.
  4. Unit test path: pytest tests/run_agent/test_primary_runtime_restore.py tests/run_agent/test_provider_fallback.py -q — all 51 tests pass.

Alternatively, reproduce with the unit test added in this PR:

pytest tests/run_agent/test_primary_runtime_restore.py::TestRestorePrimaryRuntime::test_resets_index_when_fallback_not_activated -v

What platforms tested on

  • macOS (Darwin 24.6.0) — unit tests
  • WSL2 Ubuntu 24.04 — reporter confirmed reproduction environment

… activated

In long-lived interactive sessions, _try_activate_fallback() advances
_fallback_index before attempting client resolution.  When resolution
fails (provider not configured, etc.) the function returns False without
ever setting _fallback_activated=True.  _restore_primary_runtime() then
skips its reset block entirely (guarded by `if not _fallback_activated`),
leaving _fallback_index >= len(_fallback_chain) for all subsequent turns.
The eager-fallback guard at the top of the retry loop checks
`_fallback_index < len(_fallback_chain)`, so the condition fails silently
and no fallback is ever attempted again for that session.

Cron jobs spawn a fresh AIAgent per run and never hit this path, which is
why the same fallback chain works reliably for cron but not interactive.

Fix: reset _fallback_index=0 in the `not _fallback_activated` early-return
branch so every new turn starts with the full chain available.

Fixes NousResearch#20465
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 6, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #17824 — same root cause (fallback_index not reset in _restore_primary_runtime() when activation never succeeded) and same fix (reset _fallback_index=0 in early-return branch). Also overlaps closed #18156.

@teknium1

Copy link
Copy Markdown
Contributor

Merged via #27185 — cherry-picked onto current main with your authorship preserved via rebase-merge. Thanks @konsisumer!

@teknium1 teknium1 closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Interactive CLI session does not auto-fallback on Codex 429 'usage_limit_reached', while cron jobs with the same fallback chain do

3 participants