fix(agent): reset fallback chain index when activation fails in interactive sessions by konsisumer · Pull Request #20793 · NousResearch/hermes-agent

konsisumer · 2026-05-06T15:22:36Z

What changed and why

In long-lived interactive CLI sessions (and gateway agent-cache hits), a single AIAgent instance is reused across multiple turns. _try_activate_fallback() increments _fallback_index before resolving the fallback provider's client. When client resolution fails (provider not configured, auth missing, etc.) the function returns False without ever setting _fallback_activated = True.

_restore_primary_runtime() was guarded by if not self._fallback_activated: return False, so it skipped its reset block entirely. _fallback_index was left at >= len(_fallback_chain) across turns. On the next Codex 429, the eager-fallback check

if is_rate_limited and self._fallback_index < len(self._fallback_chain):

fails silently — no status message, no fallback attempt, just three retries and a hard failure every time.

Cron jobs spawn a fresh AIAgent per run (index always 0), which is exactly why the same chain works reliably for cron while interactive sessions fail.

Fix: add an unconditional self._fallback_index = 0 reset in the not _fallback_activated early-return branch of _restore_primary_runtime(). Every new turn now starts with the full chain available, regardless of what happened in the prior turn.

Per the reporter's follow-up on 2026-05-06 confirming that sessions with msgs=2 (effectively single-shot shape) also fail without fallback, and that the one session where fallback fired at 14:51:34 was likely the first turn after a fresh agent start.

Fixes #20465

How to test

Configure model.provider: openai-codex and a fallback_providers list with a reachable local provider (e.g. Ollama).
In a single interactive hermes session, exhaust the Codex 5-hour quota.
Send a second message: confirm the fallback activates (log line Fallback activated: gpt-5.5 → <model> (custom)) instead of API call failed after 3 retries.
Unit test path: pytest tests/run_agent/test_primary_runtime_restore.py tests/run_agent/test_provider_fallback.py -q — all 51 tests pass.

Alternatively, reproduce with the unit test added in this PR:

pytest tests/run_agent/test_primary_runtime_restore.py::TestRestorePrimaryRuntime::test_resets_index_when_fallback_not_activated -v

What platforms tested on

macOS (Darwin 24.6.0) — unit tests
WSL2 Ubuntu 24.04 — reporter confirmed reproduction environment

… activated In long-lived interactive sessions, _try_activate_fallback() advances _fallback_index before attempting client resolution. When resolution fails (provider not configured, etc.) the function returns False without ever setting _fallback_activated=True. _restore_primary_runtime() then skips its reset block entirely (guarded by `if not _fallback_activated`), leaving _fallback_index >= len(_fallback_chain) for all subsequent turns. The eager-fallback guard at the top of the retry loop checks `_fallback_index < len(_fallback_chain)`, so the condition fails silently and no fallback is ever attempted again for that session. Cron jobs spawn a fresh AIAgent per run and never hit this path, which is why the same fallback chain works reliably for cron but not interactive. Fix: reset _fallback_index=0 in the `not _fallback_activated` early-return branch so every new turn starts with the full chain available. Fixes NousResearch#20465

alt-glitch · 2026-05-06T15:28:04Z

Likely duplicate of #17824 — same root cause (fallback_index not reset in _restore_primary_runtime() when activation never succeeded) and same fix (reset _fallback_index=0 in early-return branch). Also overlaps closed #18156.

teknium1 · 2026-05-17T00:12:57Z

Merged via #27185 — cherry-picked onto current main with your authorship preserved via rebase-merge. Thanks @konsisumer!

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 6, 2026

konsisumer mentioned this pull request May 9, 2026

[Bug] Interactive CLI session does not auto-fallback on Codex 429 'usage_limit_reached', while cron jobs with the same fallback chain do #20465

Closed

teknium1 mentioned this pull request May 17, 2026

fix(agent): reset _fallback_index at turn start even when no fallback activated #27185

Merged

teknium1 closed this May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): reset fallback chain index when activation fails in interactive sessions#20793

fix(agent): reset fallback chain index when activation fails in interactive sessions#20793
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/interactive-fallback-index-not-reset-on-failed-activation

konsisumer commented May 6, 2026

Uh oh!

alt-glitch commented May 6, 2026

Uh oh!

teknium1 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

konsisumer commented May 6, 2026

What changed and why

How to test

What platforms tested on

Uh oh!

alt-glitch commented May 6, 2026

Uh oh!

teknium1 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants