[Bug] Fallback announced but never sent: `trying fallback...` logged when `/model`-set invalid id triggers HTTP 400, but `fallback_model` is never invoked and session aborts

## Summary

When the active session has a model override (set via `/model <invalid-id>`) that triggers an HTTP 400 `not a valid model ID` from the provider, Hermes logs `⚠️ Non-retryable error (HTTP 400) — trying fallback...` but the configured `fallback_model` is **never actually invoked** — the next API request body still contains the broken primary model and the session aborts. The chat (Telegram in my case) becomes permanently unresponsive on every subsequent turn until the session JSON is hand-edited.

## Environment

- Hermes Agent: **v0.11.0** (image `nousresearch/hermes-agent:latest`, sha256 `148f233e89d1`)
- Container created: 2026‑04‑28
- OS: Linux (Docker, Ubuntu base)
- Provider: `openrouter`
- Primary (config): `deepseek/deepseek-v4-flash:floor`
- Session model override (broken): `deepseek/deepseek-v4` (invalid OpenRouter ID)
- Configured fallback_model: `google/gemini-2.0-flash-001` (valid, confirmed via OpenRouter `/api/v1/models`)
- Integration: Telegram bot (polling)
- `fallback_providers: []` (only the legacy single-dict `fallback_model` is set)

## Repro

1. Start Hermes with this `config.yaml`:
   ```yaml
   model:
     provider: openrouter
     model: deepseek/deepseek-v4-flash:floor
     default: deepseek/deepseek-v4-pro
   fallback_providers: []
   fallback_model:
     provider: openrouter
     model: google/gemini-2.0-flash-001
   ```
2. Connect a Telegram bot, start a session.
3. In the chat, switch to an invalid model (intentional typo of a real model published 2026‑04‑24, e.g. dropping `-pro`):
   `/model deepseek/deepseek-v4`
   The switch is accepted (no upfront catalog rejection — see related #7922).
4. Send any message. OpenRouter returns:
   ```
   HTTP 400: deepseek/deepseek-v4 is not a valid model ID
   ```
5. Send another message. Same error. **Bot is dead until the session JSON is patched by hand.**

## Observed log output (gateway, two consecutive user messages)

```
⚠️  API call failed (attempt 1/3): BadRequestError [HTTP 400]
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   🌐 Endpoint: https://openrouter.ai/api/v1
   📝 Error: HTTP 400: deepseek/deepseek-v4 is not a valid model ID
⚠️ Non-retryable error (HTTP 400) — trying fallback...
🧾 Request debug dump written to: /opt/data/sessions/request_dump_…_899256.json
❌ Non-retryable error (HTTP 400): HTTP 400: deepseek/deepseek-v4 is not a valid model ID
❌ Non-retryable client error (HTTP 400). Aborting.
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   🌐 Endpoint: https://openrouter.ai/api/v1
ERROR root: Non-retryable client error: …'deepseek/deepseek-v4 is not a valid model ID'…

⚠️  API call failed (attempt 1/3): BadRequestError [HTTP 400]
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   …same sequence…
```

Notice: the line `🔄 Primary model failed — switching to fallback: <fb_model> via <fb_provider>` (emitted by `_try_activate_fallback` at run_agent.py:7178 after a successful client swap) is **absent**. The `trying fallback...` message at run_agent.py:11819 is emitted, but the subsequent `_try_activate_fallback()` call returns `False`, so execution falls straight through to the abort path.

## Evidence: request dump confirms body never carries fallback model

Two dumps from the same session, written for the two consecutive aborted turns:

```json
// request_dump_20260429_123206_c28fa020_20260429_123304_899256.json
{
  "reason": "non_retryable_client_error",
  "request": {
    "url": "https://openrouter.ai/api/v1/chat/completions",
    "body": {
      "model": "deepseek/deepseek-v4",
      "messages": [...],
      "tools": [...],
      "extra_body": {"reasoning": {"enabled": true, "effort": "medium"}}
    }
  },
  "error": {"status_code": 400, "body": {"message": "deepseek/deepseek-v4 is not a valid model ID", "code": 400}}
}
```

The second dump (next turn, 1 minute later) is byte-identical on `body.model`. There is **no second dump** showing a request to `google/gemini-2.0-flash-001` — the fallback is announced but never sent on the wire.

## Probable root cause

Reading `run_agent.py` v0.11.0:

- `_try_activate_fallback()` (line 6997) advances `self._fallback_index` *before* the activation can fail (line 7022: `self._fallback_index += 1`). With a single-entry chain (legacy `fallback_model:` form, no `fallback_providers:`), the index reaches `len(_fallback_chain)` after the very first attempt.
- `_restore_primary_runtime()` (line ~7196) resets `_fallback_index = 0` only when `self._fallback_activated` is True — i.e. after a *successful* fallback activation. If activation failed earlier in the session (e.g. earlier the primary had a transient `No models provided` 400 and the fallback briefly succeeded but `_fallback_activated` got cleared on the next primary restoration without resetting index — or if any activation path returned False via the recursive `return self._try_activate_fallback()` exhaustion guard), `_fallback_index` is permanently stuck past the end of the chain.
- From that point on, every `_try_activate_fallback()` call returns `False` at the bounds check (line 7018: `if self._fallback_index >= len(self._fallback_chain): return False`), even though a perfectly valid `fallback_model` is configured.

The user-facing symptom is then: `trying fallback...` logged → no client swap → same broken model in body → same 400 → abort. The session is pinned forever.

I haven't fully proven this is the exact cause (would need to instrument `_fallback_index` at runtime), but it's consistent with all observations: configured fallback exists, primary is broken in a way the fallback would not share, dumps show only primary model on the wire, and the `switching to fallback` log line is missing.

## Why this is distinct from existing issues

- **#7922** (provider-prefixed slug sent as raw `model` field) — that one ships the wrong slug to a *valid* endpoint. Here the `model` value is a plausible-but-non-existent OpenRouter ID; the `body.model` is verbatim what `/model` stored.
- **#16677** (DeepSeek V4 Pro crash loop on rate limits / vision aux) — that's gateway-process crashing under 429/aux mis-resolution. Here the gateway stays up; only the conversation aborts cleanly with no fallback attempt.
- **#6380 / #7385** — cosmetic status-bar staleness after fallback. Here the issue is functional: fallback is *announced* but does not happen.
- **#15072** — `.` → `-` normalization in model names. Not the case here; the user-typed `/model deepseek/deepseek-v4` is stored verbatim.

I scanned all open `fallback`-titled issues plus closed `fallback_model` / `session override` issues and none describe the "fallback announced but never sent on the wire" symptom on a session with a `/model`-set invalid override.

## Suggested fixes

1. **Reset `_fallback_index` on every new turn unconditionally**, not only when `_fallback_activated` is True. Move the `self._fallback_index = 0` line out of the `if not self._fallback_activated: return False` early-exit branch in `_restore_primary_runtime()`, or do it at the top of `run_conversation()`.
2. **Don't increment `_fallback_index` until after activation succeeds.** Currently it's incremented before any failure can occur, so a single transient failure can permanently exhaust a length‑1 chain. Increment at the bottom of the success path (just before `return True`), and let the recursive retry inside the function advance through the chain explicitly.
3. **Reject `/model` slugs that don't exist in the resolved provider catalog.** The current acceptance with a warning (already noted in #7922) makes this class of typo a footgun. At minimum, when the override later produces `400 not a valid model ID`, the gateway could automatically clear the override and revert to `model.model` from config.
4. **Don't emit `trying fallback...` until the activation actually succeeds.** Move the status emit into `_try_activate_fallback()` after `self._fallback_activated = True` (line 7095). Right now it's emitted at the call site (line 11819) before knowing whether the swap will happen, which is the misleading UX bit.

## Workaround used

Hand-patched `<HERMES_HOME>/sessions/session_<id>.json` setting `model` back to a valid id (`deepseek/deepseek-v4-pro`), and updated `model.default` in `config.yaml` to the same. Bot resumed on the next message without restart.

Happy to provide full session JSONs or instrument `_fallback_index` if helpful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fallback announced but never sent: `trying fallback...` logged when `/model`-set invalid id triggers HTTP 400, but `fallback_model` is never invoked and session aborts #17446

Summary

Environment

Repro

Observed log output (gateway, two consecutive user messages)

Evidence: request dump confirms body never carries fallback model

Probable root cause

Why this is distinct from existing issues

Suggested fixes

Workaround used

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Fallback announced but never sent: trying fallback... logged when /model-set invalid id triggers HTTP 400, but fallback_model is never invoked and session aborts #17446

Description

Summary

Environment

Repro

Observed log output (gateway, two consecutive user messages)

Evidence: request dump confirms body never carries fallback model

Probable root cause

Why this is distinct from existing issues

Suggested fixes

Workaround used

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Fallback announced but never sent: `trying fallback...` logged when `/model`-set invalid id triggers HTTP 400, but `fallback_model` is never invoked and session aborts #17446