Skip to content

fix(agent): reset _fallback_index when previous turn exhausted chain without activating (#17446)#17824

Closed
yelog wants to merge 1 commit into
NousResearch:mainfrom
yelog:fix/fallback-index-reset-on-failed-activation
Closed

fix(agent): reset _fallback_index when previous turn exhausted chain without activating (#17446)#17824
yelog wants to merge 1 commit into
NousResearch:mainfrom
yelog:fix/fallback-index-reset-on-failed-activation

Conversation

@yelog

@yelog yelog commented Apr 30, 2026

Copy link
Copy Markdown

What

Fix #17446 — sessions get permanently pinned to a broken primary model after one fallback-activation failure, even when a perfectly valid `fallback_model` is configured. Every subsequent turn prints "⚠️ Non-retryable error — trying fallback..." but the fallback never actually engages.

Why

`_try_activate_fallback()` (run_agent.py) increments `_fallback_index` before attempting activation and only flips `_fallback_activated = True` on the success path. If every chain entry fails (e.g. `fb_client is None` from a misconfigured credential, invalid provider/model, or any exception during client resolution), the recursion exhausts the chain leaving:

```
_fallback_index == len(_fallback_chain) # past the end
_fallback_activated == False # never succeeded
```

On the next turn, `run_conversation()` calls `_restore_primary_runtime()` at the top of the loop. The old guard returned early on `not self._fallback_activated`, leaving the stale index. The next `_try_activate_fallback()` then short-circuits at the `if self._fallback_index >= len(...)` bounds check, returning False instantly — so the runner announces fallback, sends nothing different, and the user sees the same primary error forever.

Reproducible scenario from the issue:

  • Telegram bot configured with `fallback_model: google/gemini-2.0-flash-001` (valid)
  • User runs `/model deepseek/deepseek-v4` (invalid OpenRouter ID)
  • Turn 1: 400 from primary, fallback activation fails for state-machine reasons, chain exhausted
  • Turn 2+: every message permanently aborts; only fix today is hand-editing the session JSON

How

One-liner functional fix in `_restore_primary_runtime()`: when the early-return path runs (no fallback to roll back), still reset `_fallback_index` to 0 if it has drifted past 0. Guarantees a fresh chain attempt on every turn that begins on the primary, regardless of whether the previous turn's fallback path succeeded, partially advanced, or fully exhausted the chain.

How to test

```
scripts/run_tests.sh tests/run_agent/test_primary_runtime_restore.py
scripts/run_tests.sh tests/run_agent/test_provider_fallback.py tests/run_agent/test_compressor_fallback_update.py tests/run_agent/test_switch_model_fallback_prune.py tests/agent/test_credential_pool_routing.py
```

Three new regression tests in `test_primary_runtime_restore.py`:

  1. `test_resets_fallback_index_after_failed_activation` — buggy state (chain exhausted + activated False) is reset
  2. `test_does_not_disturb_index_when_already_zero` — pure no-op when index is already 0
  3. `test_subsequent_fallback_attempt_succeeds_after_reset` — end-to-end: after reset, a fresh `_try_activate_fallback()` succeeds

All 34 tests in the file pass; 35 related fallback tests across 4 files also green.

Platforms

Affects all platforms (CLI, gateway, every messaging adapter) — the bug is in core agent state, not platform-specific code.

…without activating (NousResearch#17446)

When _try_activate_fallback() fails to activate any fallback (e.g. all
chain entries return fb_client is None due to misconfigured credentials,
or all entries have invalid provider/model), the index is incremented
through the chain via recursion but _fallback_activated is never flipped
to True. The session ends with _fallback_index == len(_fallback_chain)
and _fallback_activated == False.

On the next turn, _restore_primary_runtime() returned early at the
`if not self._fallback_activated` guard without resetting the index.
Every subsequent turn's _try_activate_fallback() then short-circuited
at the bounds check and returned False immediately, so the user saw
'⚠️ Non-retryable error — trying fallback...' followed by silence,
the broken primary kept being sent on the wire, and the session was
permanently pinned to the failing model until the session JSON was
hand-edited.

Reproducible scenario from the issue: Telegram bot with
fallback_model: google/gemini-2.0-flash-001 (valid), user runs
/model deepseek/deepseek-v4 (invalid OpenRouter ID). First turn fails
with 400, fallback activation fails (chain entry valid but state machine
wedged), every subsequent turn permanently aborts.

Fix: when _restore_primary_runtime() returns early because there's no
fallback to roll back, still reset _fallback_index to 0 if it has
drifted past 0. This guarantees a fresh chain attempt on every turn
that begins on the primary, regardless of whether the previous turn's
fallback path succeeded, partially advanced the index, or fully
exhausted the chain.

Adds three regression tests covering:
- the buggy state (chain exhausted + activated False) is now reset
- no-op when index is already 0 (no spurious writes)
- end-to-end: after reset, a fresh _try_activate_fallback() succeeds
@teknium1

Copy link
Copy Markdown
Contributor

This appears to be implemented on current main. Automated hermes-sweeper review found the PR's functional fix present in the refactored helper location rather than the original run_agent.py body.

Evidence:

  • agent/agent_runtime_helpers.py:906 resets _fallback_index = 0 even when _fallback_activated is false, covering the exhausted-index-after-failed-activation path described here.
  • agent/turn_context.py:111 calls _restore_primary_runtime() at the start of each turn, so long-lived CLI/gateway sessions get a fresh fallback chain before the next message.
  • agent/chat_completion_helpers.py:1054 and agent/chat_completion_helpers.py:1058 still show the same bounds-check/advance behavior that made the reset necessary, while agent/chat_completion_helpers.py:1115 covers the provider-not-configured failure path.
  • The implementing commit is 4ab9a06a51268a2864cc66ee36ef34bf6f9ef6e8 (fix(agent): reset _fallback_index at turn start even when no fallback activated), which notes it re-applied the pre-refactor fix into agent.agent_runtime_helpers.restore_primary_runtime.
  • The linked issue [Bug] Fallback announced but never sent: trying fallback... logged when /model-set invalid id triggers HTTP 400, but fallback_model is never invoked and session aborts #17446 has also been closed after considering the maintainer comments relating it to fix(agent): avoid false fallback status on client errors #15286.

@teknium1 teknium1 closed this Jun 10, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

3 participants