Skip to content

Auxiliary compression auto-routing ignores fallback_providers after main model usage limit #15714

@huacao59109

Description

@huacao59109

Bug description

When the main chat model hits a rate/usage limit and Hermes switches the conversation to a configured fallback_providers entry, context compression can still fail because the compression summarizer is routed through auxiliary.compression and, in auto mode, it resolves back to the main model instead of the active fallback model or configured fallback providers.

Observed warning:

⚠ Compression summary failed: Error code: 429 - {'error': {'type': 'usage_limit_reached', 'message': 'The usage limit has been reached', 'plan_type': 'plus', 'resets_at': 1777132814, 'eligible_promo': None, 'resets_in_seconds': 1657}}. Inserted a fallback context marker.

This degrades long-running sessions right when fallback is most important: the main agent can continue via fallback, but the compaction summary fails and Hermes inserts only a fallback context marker.

Environment

  • Hermes Agent: v0.11.0 (2026.4.23)
  • Main provider/model: openai-codex / gpt-5.5
  • Configured fallback provider/model: custom deepseek-v4 / deepseek-v4-flash
  • auxiliary.compression.provider: auto

Relevant config shape:

model:
  default: gpt-5.5
  provider: openai-codex
  base_url: https://chatgpt.com/backend-api/codex

fallback_providers:
  - provider: deepseek-v4
    model: deepseek-v4-flash
    reasoning_effort: xhigh

providers:
  deepseek-v4:
    name: DeepSeek V4 Official
    base_url: https://api.deepseek.com/v1
    key_env: DEEPSEEK_API_KEY
    transport: openai_chat
    default_model: deepseek-v4-flash

auxiliary:
  compression:
    provider: auto
    model: ""
    timeout: 120

Steps to reproduce

  1. Configure a main provider/model that can return 429/usage limit errors.
  2. Configure a working fallback_providers entry.
  3. Keep auxiliary.compression.provider: auto.
  4. Run a long session until context compression is triggered while the main provider is rate/usage-limited.

Expected behavior

If the main model is rate/usage-limited and the conversation is already able to continue through fallback_providers, compression should also use an available fallback-capable route. The compression summary should be generated by the fallback provider rather than failing with a 429 from the exhausted main provider.

Actual behavior

The main conversation can fall back, but compression still attempts the main provider/model through auxiliary auto-routing and fails with usage_limit_reached. Hermes then inserts a fallback context marker instead of a real summary.

Current workaround

Pin compression explicitly to the fallback provider:

auxiliary:
  compression:
    provider: deepseek-v4
    model: deepseek-v4-flash
    timeout: 120

This works, but it is static and does not generalize to users with multiple fallback providers or changing fallback preference.

Recommended fix

I recommend fixing this at the auxiliary routing layer, not only by documenting the workaround.

Specifically, when auxiliary.<task>.provider: auto, agent/auxiliary_client.py should include configured fallback_providers in the auto/fallback resolution path for auxiliary LLM calls.

Suggested behavior:

  1. Try the active/main provider as today.
  2. If the call fails with rate/usage/quota exhaustion, try the configured fallback_providers in order.
  3. Then fall back to the existing auxiliary provider chain (openrouter, nous, local/custom, openai-codex, API-key providers).

This makes auxiliary tasks track the same resilience policy as the main agent and avoids requiring users to duplicate fallback config under every auxiliary.* task.

It would also be useful to treat OpenAI/Codex-style usage-limit errors as fallback-worthy. Current payment/credit detection appears to cover billing/credit terms, but this observed error uses:

usage_limit_reached
The usage limit has been reached
resets_in_seconds

So the fallback classifier should likely include these usage-limit/rate-limit patterns as transient/exhaustion signals that can trigger provider fallback in auto mode.

Code pointers

Likely areas:

  • agent/auxiliary_client.py
    • _resolve_auto(...)
    • _get_provider_chain()
    • _is_payment_error(...) or a broader auxiliary fallback classifier
    • call_llm(...) fallback handling
  • agent/context_compressor.py
    • _generate_summary(...) calls call_llm(task="compression", main_runtime=...)

Why this matters

Context compression is critical for long sessions. If it fails during provider fallback, Hermes preserves much less continuity exactly when the session is already under pressure. Aligning auxiliary auto-routing with fallback_providers should make long conversations significantly more robust.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions