fix(agent): reset _fallback_index when previous turn exhausted chain without activating (#17446) by yelog · Pull Request #17824 · NousResearch/hermes-agent

yelog · 2026-04-30T07:55:35Z

What

Fix #17446 — sessions get permanently pinned to a broken primary model after one fallback-activation failure, even when a perfectly valid `fallback_model` is configured. Every subsequent turn prints "⚠️ Non-retryable error — trying fallback..." but the fallback never actually engages.

Why

`_try_activate_fallback()` (run_agent.py) increments `_fallback_index` before attempting activation and only flips `_fallback_activated = True` on the success path. If every chain entry fails (e.g. `fb_client is None` from a misconfigured credential, invalid provider/model, or any exception during client resolution), the recursion exhausts the chain leaving:

```
_fallback_index == len(_fallback_chain) # past the end
_fallback_activated == False # never succeeded
```

On the next turn, `run_conversation()` calls `_restore_primary_runtime()` at the top of the loop. The old guard returned early on `not self._fallback_activated`, leaving the stale index. The next `_try_activate_fallback()` then short-circuits at the `if self._fallback_index >= len(...)` bounds check, returning False instantly — so the runner announces fallback, sends nothing different, and the user sees the same primary error forever.

Reproducible scenario from the issue:

Telegram bot configured with `fallback_model: google/gemini-2.0-flash-001` (valid)
User runs `/model deepseek/deepseek-v4` (invalid OpenRouter ID)
Turn 1: 400 from primary, fallback activation fails for state-machine reasons, chain exhausted
Turn 2+: every message permanently aborts; only fix today is hand-editing the session JSON

How

One-liner functional fix in `_restore_primary_runtime()`: when the early-return path runs (no fallback to roll back), still reset `_fallback_index` to 0 if it has drifted past 0. Guarantees a fresh chain attempt on every turn that begins on the primary, regardless of whether the previous turn's fallback path succeeded, partially advanced, or fully exhausted the chain.

How to test

```
scripts/run_tests.sh tests/run_agent/test_primary_runtime_restore.py
scripts/run_tests.sh tests/run_agent/test_provider_fallback.py tests/run_agent/test_compressor_fallback_update.py tests/run_agent/test_switch_model_fallback_prune.py tests/agent/test_credential_pool_routing.py
```

Three new regression tests in `test_primary_runtime_restore.py`:

`test_resets_fallback_index_after_failed_activation` — buggy state (chain exhausted + activated False) is reset
`test_does_not_disturb_index_when_already_zero` — pure no-op when index is already 0
`test_subsequent_fallback_attempt_succeeds_after_reset` — end-to-end: after reset, a fresh `_try_activate_fallback()` succeeds

All 34 tests in the file pass; 35 related fallback tests across 4 files also green.

Platforms

Affects all platforms (CLI, gateway, every messaging adapter) — the bug is in core agent state, not platform-specific code.

…without activating (NousResearch#17446) When _try_activate_fallback() fails to activate any fallback (e.g. all chain entries return fb_client is None due to misconfigured credentials, or all entries have invalid provider/model), the index is incremented through the chain via recursion but _fallback_activated is never flipped to True. The session ends with _fallback_index == len(_fallback_chain) and _fallback_activated == False. On the next turn, _restore_primary_runtime() returned early at the `if not self._fallback_activated` guard without resetting the index. Every subsequent turn's _try_activate_fallback() then short-circuited at the bounds check and returned False immediately, so the user saw '⚠️ Non-retryable error — trying fallback...' followed by silence, the broken primary kept being sent on the wire, and the session was permanently pinned to the failing model until the session JSON was hand-edited. Reproducible scenario from the issue: Telegram bot with fallback_model: google/gemini-2.0-flash-001 (valid), user runs /model deepseek/deepseek-v4 (invalid OpenRouter ID). First turn fails with 400, fallback activation fails (chain entry valid but state machine wedged), every subsequent turn permanently aborts. Fix: when _restore_primary_runtime() returns early because there's no fallback to roll back, still reset _fallback_index to 0 if it has drifted past 0. This guarantees a fresh chain attempt on every turn that begins on the primary, regardless of whether the previous turn's fallback path succeeded, partially advanced the index, or fully exhausted the chain. Adds three regression tests covering: - the buggy state (chain exhausted + activated False) is now reset - no-op when index is already 0 (no spurious writes) - end-to-end: after reset, a fresh _try_activate_fallback() succeeds

teknium1 · 2026-06-10T10:03:34Z

This appears to be implemented on current main. Automated hermes-sweeper review found the PR's functional fix present in the refactored helper location rather than the original run_agent.py body.

Evidence:

agent/agent_runtime_helpers.py:906 resets _fallback_index = 0 even when _fallback_activated is false, covering the exhausted-index-after-failed-activation path described here.
agent/turn_context.py:111 calls _restore_primary_runtime() at the start of each turn, so long-lived CLI/gateway sessions get a fresh fallback chain before the next message.
agent/chat_completion_helpers.py:1054 and agent/chat_completion_helpers.py:1058 still show the same bounds-check/advance behavior that made the reset necessary, while agent/chat_completion_helpers.py:1115 covers the provider-not-configured failure path.
The implementing commit is 4ab9a06a51268a2864cc66ee36ef34bf6f9ef6e8 (fix(agent): reset _fallback_index at turn start even when no fallback activated), which notes it re-applied the pre-refactor fix into agent.agent_runtime_helpers.restore_primary_runtime.
The linked issue [Bug] Fallback announced but never sent: trying fallback... logged when /model-set invalid id triggers HTTP 400, but fallback_model is never invoked and session aborts #17446 has also been closed after considering the maintainer comments relating it to fix(agent): avoid false fallback status on client errors #15286.

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists labels Apr 30, 2026

This was referenced May 1, 2026

fix(agent): reset fallback index when activation never succeeded (#16677) #18156

Closed

fix(agent): reset fallback chain index when activation fails in interactive sessions #20793

Closed

konsisumer mentioned this pull request May 6, 2026

[Bug] Interactive CLI session does not auto-fallback on Codex 429 'usage_limit_reached', while cron jobs with the same fallback chain do #20465

Closed

teknium1 closed this Jun 10, 2026

teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): reset _fallback_index when previous turn exhausted chain without activating (#17446)#17824

fix(agent): reset _fallback_index when previous turn exhausted chain without activating (#17446)#17824
yelog wants to merge 1 commit into
NousResearch:mainfrom
yelog:fix/fallback-index-reset-on-failed-activation

yelog commented Apr 30, 2026

Uh oh!

teknium1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yelog commented Apr 30, 2026

What

Why

How

How to test

Platforms

Uh oh!

teknium1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants