Bug description
When the main chat model hits a rate/usage limit and Hermes switches the conversation to a configured fallback_providers entry, context compression can still fail because the compression summarizer is routed through auxiliary.compression and, in auto mode, it resolves back to the main model instead of the active fallback model or configured fallback providers.
Observed warning:
⚠ Compression summary failed: Error code: 429 - {'error': {'type': 'usage_limit_reached', 'message': 'The usage limit has been reached', 'plan_type': 'plus', 'resets_at': 1777132814, 'eligible_promo': None, 'resets_in_seconds': 1657}}. Inserted a fallback context marker.
This degrades long-running sessions right when fallback is most important: the main agent can continue via fallback, but the compaction summary fails and Hermes inserts only a fallback context marker.
Environment
- Hermes Agent:
v0.11.0 (2026.4.23)
- Main provider/model:
openai-codex / gpt-5.5
- Configured fallback provider/model: custom
deepseek-v4 / deepseek-v4-flash
auxiliary.compression.provider: auto
Relevant config shape:
model:
default: gpt-5.5
provider: openai-codex
base_url: https://chatgpt.com/backend-api/codex
fallback_providers:
- provider: deepseek-v4
model: deepseek-v4-flash
reasoning_effort: xhigh
providers:
deepseek-v4:
name: DeepSeek V4 Official
base_url: https://api.deepseek.com/v1
key_env: DEEPSEEK_API_KEY
transport: openai_chat
default_model: deepseek-v4-flash
auxiliary:
compression:
provider: auto
model: ""
timeout: 120
Steps to reproduce
- Configure a main provider/model that can return 429/usage limit errors.
- Configure a working
fallback_providers entry.
- Keep
auxiliary.compression.provider: auto.
- Run a long session until context compression is triggered while the main provider is rate/usage-limited.
Expected behavior
If the main model is rate/usage-limited and the conversation is already able to continue through fallback_providers, compression should also use an available fallback-capable route. The compression summary should be generated by the fallback provider rather than failing with a 429 from the exhausted main provider.
Actual behavior
The main conversation can fall back, but compression still attempts the main provider/model through auxiliary auto-routing and fails with usage_limit_reached. Hermes then inserts a fallback context marker instead of a real summary.
Current workaround
Pin compression explicitly to the fallback provider:
auxiliary:
compression:
provider: deepseek-v4
model: deepseek-v4-flash
timeout: 120
This works, but it is static and does not generalize to users with multiple fallback providers or changing fallback preference.
Recommended fix
I recommend fixing this at the auxiliary routing layer, not only by documenting the workaround.
Specifically, when auxiliary.<task>.provider: auto, agent/auxiliary_client.py should include configured fallback_providers in the auto/fallback resolution path for auxiliary LLM calls.
Suggested behavior:
- Try the active/main provider as today.
- If the call fails with rate/usage/quota exhaustion, try the configured
fallback_providers in order.
- Then fall back to the existing auxiliary provider chain (
openrouter, nous, local/custom, openai-codex, API-key providers).
This makes auxiliary tasks track the same resilience policy as the main agent and avoids requiring users to duplicate fallback config under every auxiliary.* task.
It would also be useful to treat OpenAI/Codex-style usage-limit errors as fallback-worthy. Current payment/credit detection appears to cover billing/credit terms, but this observed error uses:
usage_limit_reached
The usage limit has been reached
resets_in_seconds
So the fallback classifier should likely include these usage-limit/rate-limit patterns as transient/exhaustion signals that can trigger provider fallback in auto mode.
Code pointers
Likely areas:
agent/auxiliary_client.py
_resolve_auto(...)
_get_provider_chain()
_is_payment_error(...) or a broader auxiliary fallback classifier
call_llm(...) fallback handling
agent/context_compressor.py
_generate_summary(...) calls call_llm(task="compression", main_runtime=...)
Why this matters
Context compression is critical for long sessions. If it fails during provider fallback, Hermes preserves much less continuity exactly when the session is already under pressure. Aligning auxiliary auto-routing with fallback_providers should make long conversations significantly more robust.
Bug description
When the main chat model hits a rate/usage limit and Hermes switches the conversation to a configured
fallback_providersentry, context compression can still fail because the compression summarizer is routed throughauxiliary.compressionand, inautomode, it resolves back to the main model instead of the active fallback model or configured fallback providers.Observed warning:
This degrades long-running sessions right when fallback is most important: the main agent can continue via fallback, but the compaction summary fails and Hermes inserts only a fallback context marker.
Environment
v0.11.0 (2026.4.23)openai-codex/gpt-5.5deepseek-v4/deepseek-v4-flashauxiliary.compression.provider:autoRelevant config shape:
Steps to reproduce
fallback_providersentry.auxiliary.compression.provider: auto.Expected behavior
If the main model is rate/usage-limited and the conversation is already able to continue through
fallback_providers, compression should also use an available fallback-capable route. The compression summary should be generated by the fallback provider rather than failing with a 429 from the exhausted main provider.Actual behavior
The main conversation can fall back, but compression still attempts the main provider/model through auxiliary auto-routing and fails with
usage_limit_reached. Hermes then inserts a fallback context marker instead of a real summary.Current workaround
Pin compression explicitly to the fallback provider:
This works, but it is static and does not generalize to users with multiple fallback providers or changing fallback preference.
Recommended fix
I recommend fixing this at the auxiliary routing layer, not only by documenting the workaround.
Specifically, when
auxiliary.<task>.provider: auto,agent/auxiliary_client.pyshould include configuredfallback_providersin the auto/fallback resolution path for auxiliary LLM calls.Suggested behavior:
fallback_providersin order.openrouter,nous,local/custom,openai-codex, API-key providers).This makes auxiliary tasks track the same resilience policy as the main agent and avoids requiring users to duplicate fallback config under every
auxiliary.*task.It would also be useful to treat OpenAI/Codex-style usage-limit errors as fallback-worthy. Current payment/credit detection appears to cover billing/credit terms, but this observed error uses:
So the fallback classifier should likely include these usage-limit/rate-limit patterns as transient/exhaustion signals that can trigger provider fallback in auto mode.
Code pointers
Likely areas:
agent/auxiliary_client.py_resolve_auto(...)_get_provider_chain()_is_payment_error(...)or a broader auxiliary fallback classifiercall_llm(...)fallback handlingagent/context_compressor.py_generate_summary(...)callscall_llm(task="compression", main_runtime=...)Why this matters
Context compression is critical for long sessions. If it fails during provider fallback, Hermes preserves much less continuity exactly when the session is already under pressure. Aligning auxiliary auto-routing with
fallback_providersshould make long conversations significantly more robust.