You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug] Fallback announced but never sent: trying fallback... logged when /model-set invalid id triggers HTTP 400, but fallback_model is never invoked and session aborts #17446
When the active session has a model override (set via /model <invalid-id>) that triggers an HTTP 400 not a valid model ID from the provider, Hermes logs ⚠️ Non-retryable error (HTTP 400) — trying fallback... but the configured fallback_model is never actually invoked — the next API request body still contains the broken primary model and the session aborts. The chat (Telegram in my case) becomes permanently unresponsive on every subsequent turn until the session JSON is hand-edited.
HTTP 400: deepseek/deepseek-v4 is not a valid model ID
Send another message. Same error. Bot is dead until the session JSON is patched by hand.
Observed log output (gateway, two consecutive user messages)
⚠️ API call failed (attempt 1/3): BadRequestError [HTTP 400]
🔌 Provider: openrouter Model: deepseek/deepseek-v4
🌐 Endpoint: https://openrouter.ai/api/v1
📝 Error: HTTP 400: deepseek/deepseek-v4 is not a valid model ID
⚠️ Non-retryable error (HTTP 400) — trying fallback...
🧾 Request debug dump written to: /opt/data/sessions/request_dump_…_899256.json
❌ Non-retryable error (HTTP 400): HTTP 400: deepseek/deepseek-v4 is not a valid model ID
❌ Non-retryable client error (HTTP 400). Aborting.
🔌 Provider: openrouter Model: deepseek/deepseek-v4
🌐 Endpoint: https://openrouter.ai/api/v1
ERROR root: Non-retryable client error: …'deepseek/deepseek-v4 is not a valid model ID'…
⚠️ API call failed (attempt 1/3): BadRequestError [HTTP 400]
🔌 Provider: openrouter Model: deepseek/deepseek-v4
…same sequence…
Notice: the line 🔄 Primary model failed — switching to fallback: <fb_model> via <fb_provider> (emitted by _try_activate_fallback at run_agent.py:7178 after a successful client swap) is absent. The trying fallback... message at run_agent.py:11819 is emitted, but the subsequent _try_activate_fallback() call returns False, so execution falls straight through to the abort path.
Evidence: request dump confirms body never carries fallback model
Two dumps from the same session, written for the two consecutive aborted turns:
// request_dump_20260429_123206_c28fa020_20260429_123304_899256.json
{
"reason": "non_retryable_client_error",
"request": {
"url": "https://openrouter.ai/api/v1/chat/completions",
"body": {
"model": "deepseek/deepseek-v4",
"messages": [...],
"tools": [...],
"extra_body": {"reasoning": {"enabled": true, "effort": "medium"}}
}
},
"error": {"status_code": 400, "body": {"message": "deepseek/deepseek-v4 is not a valid model ID", "code": 400}}
}
The second dump (next turn, 1 minute later) is byte-identical on body.model. There is no second dump showing a request to google/gemini-2.0-flash-001 — the fallback is announced but never sent on the wire.
Probable root cause
Reading run_agent.py v0.11.0:
_try_activate_fallback() (line 6997) advances self._fallback_indexbefore the activation can fail (line 7022: self._fallback_index += 1). With a single-entry chain (legacy fallback_model: form, no fallback_providers:), the index reaches len(_fallback_chain) after the very first attempt.
_restore_primary_runtime() (line ~7196) resets _fallback_index = 0 only when self._fallback_activated is True — i.e. after a successful fallback activation. If activation failed earlier in the session (e.g. earlier the primary had a transient No models provided 400 and the fallback briefly succeeded but _fallback_activated got cleared on the next primary restoration without resetting index — or if any activation path returned False via the recursive return self._try_activate_fallback() exhaustion guard), _fallback_index is permanently stuck past the end of the chain.
From that point on, every _try_activate_fallback() call returns False at the bounds check (line 7018: if self._fallback_index >= len(self._fallback_chain): return False), even though a perfectly valid fallback_model is configured.
The user-facing symptom is then: trying fallback... logged → no client swap → same broken model in body → same 400 → abort. The session is pinned forever.
I haven't fully proven this is the exact cause (would need to instrument _fallback_index at runtime), but it's consistent with all observations: configured fallback exists, primary is broken in a way the fallback would not share, dumps show only primary model on the wire, and the switching to fallback log line is missing.
I scanned all open fallback-titled issues plus closed fallback_model / session override issues and none describe the "fallback announced but never sent on the wire" symptom on a session with a /model-set invalid override.
Suggested fixes
Reset _fallback_index on every new turn unconditionally, not only when _fallback_activated is True. Move the self._fallback_index = 0 line out of the if not self._fallback_activated: return False early-exit branch in _restore_primary_runtime(), or do it at the top of run_conversation().
Don't increment _fallback_index until after activation succeeds. Currently it's incremented before any failure can occur, so a single transient failure can permanently exhaust a length‑1 chain. Increment at the bottom of the success path (just before return True), and let the recursive retry inside the function advance through the chain explicitly.
Don't emit trying fallback... until the activation actually succeeds. Move the status emit into _try_activate_fallback() after self._fallback_activated = True (line 7095). Right now it's emitted at the call site (line 11819) before knowing whether the swap will happen, which is the misleading UX bit.
Workaround used
Hand-patched <HERMES_HOME>/sessions/session_<id>.json setting model back to a valid id (deepseek/deepseek-v4-pro), and updated model.default in config.yaml to the same. Bot resumed on the next message without restart.
Happy to provide full session JSONs or instrument _fallback_index if helpful.
Summary
When the active session has a model override (set via
/model <invalid-id>) that triggers an HTTP 400not a valid model IDfrom the provider, Hermes logs⚠️ Non-retryable error (HTTP 400) — trying fallback...but the configuredfallback_modelis never actually invoked — the next API request body still contains the broken primary model and the session aborts. The chat (Telegram in my case) becomes permanently unresponsive on every subsequent turn until the session JSON is hand-edited.Environment
nousresearch/hermes-agent:latest, sha256148f233e89d1)openrouterdeepseek/deepseek-v4-flash:floordeepseek/deepseek-v4(invalid OpenRouter ID)google/gemini-2.0-flash-001(valid, confirmed via OpenRouter/api/v1/models)fallback_providers: [](only the legacy single-dictfallback_modelis set)Repro
config.yaml:-pro):/model deepseek/deepseek-v4The switch is accepted (no upfront catalog rejection — see related Gateway /model command sends provider-prefixed slug as raw model name for non-curated models (HTTP 400 Unknown Model) #7922).
Observed log output (gateway, two consecutive user messages)
Notice: the line
🔄 Primary model failed — switching to fallback: <fb_model> via <fb_provider>(emitted by_try_activate_fallbackat run_agent.py:7178 after a successful client swap) is absent. Thetrying fallback...message at run_agent.py:11819 is emitted, but the subsequent_try_activate_fallback()call returnsFalse, so execution falls straight through to the abort path.Evidence: request dump confirms body never carries fallback model
Two dumps from the same session, written for the two consecutive aborted turns:
The second dump (next turn, 1 minute later) is byte-identical on
body.model. There is no second dump showing a request togoogle/gemini-2.0-flash-001— the fallback is announced but never sent on the wire.Probable root cause
Reading
run_agent.pyv0.11.0:_try_activate_fallback()(line 6997) advancesself._fallback_indexbefore the activation can fail (line 7022:self._fallback_index += 1). With a single-entry chain (legacyfallback_model:form, nofallback_providers:), the index reacheslen(_fallback_chain)after the very first attempt._restore_primary_runtime()(line ~7196) resets_fallback_index = 0only whenself._fallback_activatedis True — i.e. after a successful fallback activation. If activation failed earlier in the session (e.g. earlier the primary had a transientNo models provided400 and the fallback briefly succeeded but_fallback_activatedgot cleared on the next primary restoration without resetting index — or if any activation path returned False via the recursivereturn self._try_activate_fallback()exhaustion guard),_fallback_indexis permanently stuck past the end of the chain._try_activate_fallback()call returnsFalseat the bounds check (line 7018:if self._fallback_index >= len(self._fallback_chain): return False), even though a perfectly validfallback_modelis configured.The user-facing symptom is then:
trying fallback...logged → no client swap → same broken model in body → same 400 → abort. The session is pinned forever.I haven't fully proven this is the exact cause (would need to instrument
_fallback_indexat runtime), but it's consistent with all observations: configured fallback exists, primary is broken in a way the fallback would not share, dumps show only primary model on the wire, and theswitching to fallbacklog line is missing.Why this is distinct from existing issues
modelfield) — that one ships the wrong slug to a valid endpoint. Here themodelvalue is a plausible-but-non-existent OpenRouter ID; thebody.modelis verbatim what/modelstored..→-normalization in model names. Not the case here; the user-typed/model deepseek/deepseek-v4is stored verbatim.I scanned all open
fallback-titled issues plus closedfallback_model/session overrideissues and none describe the "fallback announced but never sent on the wire" symptom on a session with a/model-set invalid override.Suggested fixes
_fallback_indexon every new turn unconditionally, not only when_fallback_activatedis True. Move theself._fallback_index = 0line out of theif not self._fallback_activated: return Falseearly-exit branch in_restore_primary_runtime(), or do it at the top ofrun_conversation()._fallback_indexuntil after activation succeeds. Currently it's incremented before any failure can occur, so a single transient failure can permanently exhaust a length‑1 chain. Increment at the bottom of the success path (just beforereturn True), and let the recursive retry inside the function advance through the chain explicitly./modelslugs that don't exist in the resolved provider catalog. The current acceptance with a warning (already noted in Gateway /model command sends provider-prefixed slug as raw model name for non-curated models (HTTP 400 Unknown Model) #7922) makes this class of typo a footgun. At minimum, when the override later produces400 not a valid model ID, the gateway could automatically clear the override and revert tomodel.modelfrom config.trying fallback...until the activation actually succeeds. Move the status emit into_try_activate_fallback()afterself._fallback_activated = True(line 7095). Right now it's emitted at the call site (line 11819) before knowing whether the swap will happen, which is the misleading UX bit.Workaround used
Hand-patched
<HERMES_HOME>/sessions/session_<id>.jsonsettingmodelback to a valid id (deepseek/deepseek-v4-pro), and updatedmodel.defaultinconfig.yamlto the same. Bot resumed on the next message without restart.Happy to provide full session JSONs or instrument
_fallback_indexif helpful.