Skip to content

[Bug] Fallback announced but never sent: trying fallback... logged when /model-set invalid id triggers HTTP 400, but fallback_model is never invoked and session aborts #17446

@Enerlens

Description

@Enerlens

Summary

When the active session has a model override (set via /model <invalid-id>) that triggers an HTTP 400 not a valid model ID from the provider, Hermes logs ⚠️ Non-retryable error (HTTP 400) — trying fallback... but the configured fallback_model is never actually invoked — the next API request body still contains the broken primary model and the session aborts. The chat (Telegram in my case) becomes permanently unresponsive on every subsequent turn until the session JSON is hand-edited.

Environment

  • Hermes Agent: v0.11.0 (image nousresearch/hermes-agent:latest, sha256 148f233e89d1)
  • Container created: 2026‑04‑28
  • OS: Linux (Docker, Ubuntu base)
  • Provider: openrouter
  • Primary (config): deepseek/deepseek-v4-flash:floor
  • Session model override (broken): deepseek/deepseek-v4 (invalid OpenRouter ID)
  • Configured fallback_model: google/gemini-2.0-flash-001 (valid, confirmed via OpenRouter /api/v1/models)
  • Integration: Telegram bot (polling)
  • fallback_providers: [] (only the legacy single-dict fallback_model is set)

Repro

  1. Start Hermes with this config.yaml:
    model:
      provider: openrouter
      model: deepseek/deepseek-v4-flash:floor
      default: deepseek/deepseek-v4-pro
    fallback_providers: []
    fallback_model:
      provider: openrouter
      model: google/gemini-2.0-flash-001
  2. Connect a Telegram bot, start a session.
  3. In the chat, switch to an invalid model (intentional typo of a real model published 2026‑04‑24, e.g. dropping -pro):
    /model deepseek/deepseek-v4
    The switch is accepted (no upfront catalog rejection — see related Gateway /model command sends provider-prefixed slug as raw model name for non-curated models (HTTP 400 Unknown Model) #7922).
  4. Send any message. OpenRouter returns:
    HTTP 400: deepseek/deepseek-v4 is not a valid model ID
    
  5. Send another message. Same error. Bot is dead until the session JSON is patched by hand.

Observed log output (gateway, two consecutive user messages)

⚠️  API call failed (attempt 1/3): BadRequestError [HTTP 400]
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   🌐 Endpoint: https://openrouter.ai/api/v1
   📝 Error: HTTP 400: deepseek/deepseek-v4 is not a valid model ID
⚠️ Non-retryable error (HTTP 400) — trying fallback...
🧾 Request debug dump written to: /opt/data/sessions/request_dump_…_899256.json
❌ Non-retryable error (HTTP 400): HTTP 400: deepseek/deepseek-v4 is not a valid model ID
❌ Non-retryable client error (HTTP 400). Aborting.
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   🌐 Endpoint: https://openrouter.ai/api/v1
ERROR root: Non-retryable client error: …'deepseek/deepseek-v4 is not a valid model ID'…

⚠️  API call failed (attempt 1/3): BadRequestError [HTTP 400]
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   …same sequence…

Notice: the line 🔄 Primary model failed — switching to fallback: <fb_model> via <fb_provider> (emitted by _try_activate_fallback at run_agent.py:7178 after a successful client swap) is absent. The trying fallback... message at run_agent.py:11819 is emitted, but the subsequent _try_activate_fallback() call returns False, so execution falls straight through to the abort path.

Evidence: request dump confirms body never carries fallback model

Two dumps from the same session, written for the two consecutive aborted turns:

// request_dump_20260429_123206_c28fa020_20260429_123304_899256.json
{
  "reason": "non_retryable_client_error",
  "request": {
    "url": "https://openrouter.ai/api/v1/chat/completions",
    "body": {
      "model": "deepseek/deepseek-v4",
      "messages": [...],
      "tools": [...],
      "extra_body": {"reasoning": {"enabled": true, "effort": "medium"}}
    }
  },
  "error": {"status_code": 400, "body": {"message": "deepseek/deepseek-v4 is not a valid model ID", "code": 400}}
}

The second dump (next turn, 1 minute later) is byte-identical on body.model. There is no second dump showing a request to google/gemini-2.0-flash-001 — the fallback is announced but never sent on the wire.

Probable root cause

Reading run_agent.py v0.11.0:

  • _try_activate_fallback() (line 6997) advances self._fallback_index before the activation can fail (line 7022: self._fallback_index += 1). With a single-entry chain (legacy fallback_model: form, no fallback_providers:), the index reaches len(_fallback_chain) after the very first attempt.
  • _restore_primary_runtime() (line ~7196) resets _fallback_index = 0 only when self._fallback_activated is True — i.e. after a successful fallback activation. If activation failed earlier in the session (e.g. earlier the primary had a transient No models provided 400 and the fallback briefly succeeded but _fallback_activated got cleared on the next primary restoration without resetting index — or if any activation path returned False via the recursive return self._try_activate_fallback() exhaustion guard), _fallback_index is permanently stuck past the end of the chain.
  • From that point on, every _try_activate_fallback() call returns False at the bounds check (line 7018: if self._fallback_index >= len(self._fallback_chain): return False), even though a perfectly valid fallback_model is configured.

The user-facing symptom is then: trying fallback... logged → no client swap → same broken model in body → same 400 → abort. The session is pinned forever.

I haven't fully proven this is the exact cause (would need to instrument _fallback_index at runtime), but it's consistent with all observations: configured fallback exists, primary is broken in a way the fallback would not share, dumps show only primary model on the wire, and the switching to fallback log line is missing.

Why this is distinct from existing issues

I scanned all open fallback-titled issues plus closed fallback_model / session override issues and none describe the "fallback announced but never sent on the wire" symptom on a session with a /model-set invalid override.

Suggested fixes

  1. Reset _fallback_index on every new turn unconditionally, not only when _fallback_activated is True. Move the self._fallback_index = 0 line out of the if not self._fallback_activated: return False early-exit branch in _restore_primary_runtime(), or do it at the top of run_conversation().
  2. Don't increment _fallback_index until after activation succeeds. Currently it's incremented before any failure can occur, so a single transient failure can permanently exhaust a length‑1 chain. Increment at the bottom of the success path (just before return True), and let the recursive retry inside the function advance through the chain explicitly.
  3. Reject /model slugs that don't exist in the resolved provider catalog. The current acceptance with a warning (already noted in Gateway /model command sends provider-prefixed slug as raw model name for non-curated models (HTTP 400 Unknown Model) #7922) makes this class of typo a footgun. At minimum, when the override later produces 400 not a valid model ID, the gateway could automatically clear the override and revert to model.model from config.
  4. Don't emit trying fallback... until the activation actually succeeds. Move the status emit into _try_activate_fallback() after self._fallback_activated = True (line 7095). Right now it's emitted at the call site (line 11819) before knowing whether the swap will happen, which is the misleading UX bit.

Workaround used

Hand-patched <HERMES_HOME>/sessions/session_<id>.json setting model back to a valid id (deepseek/deepseek-v4-pro), and updated model.default in config.yaml to the same. Bot resumed on the next message without restart.

Happy to provide full session JSONs or instrument _fallback_index if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt builderplatform/telegramTelegram bot adaptersweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions