Auxiliary compression auto-routing ignores fallback_providers after main model usage limit

## Bug description

When the main chat model hits a rate/usage limit and Hermes switches the conversation to a configured `fallback_providers` entry, context compression can still fail because the compression summarizer is routed through `auxiliary.compression` and, in `auto` mode, it resolves back to the main model instead of the active fallback model or configured fallback providers.

Observed warning:

```text
⚠ Compression summary failed: Error code: 429 - {'error': {'type': 'usage_limit_reached', 'message': 'The usage limit has been reached', 'plan_type': 'plus', 'resets_at': 1777132814, 'eligible_promo': None, 'resets_in_seconds': 1657}}. Inserted a fallback context marker.
```

This degrades long-running sessions right when fallback is most important: the main agent can continue via fallback, but the compaction summary fails and Hermes inserts only a fallback context marker.

## Environment

- Hermes Agent: `v0.11.0 (2026.4.23)`
- Main provider/model: `openai-codex` / `gpt-5.5`
- Configured fallback provider/model: custom `deepseek-v4` / `deepseek-v4-flash`
- `auxiliary.compression.provider`: `auto`

Relevant config shape:

```yaml
model:
  default: gpt-5.5
  provider: openai-codex
  base_url: https://chatgpt.com/backend-api/codex

fallback_providers:
  - provider: deepseek-v4
    model: deepseek-v4-flash
    reasoning_effort: xhigh

providers:
  deepseek-v4:
    name: DeepSeek V4 Official
    base_url: https://api.deepseek.com/v1
    key_env: DEEPSEEK_API_KEY
    transport: openai_chat
    default_model: deepseek-v4-flash

auxiliary:
  compression:
    provider: auto
    model: ""
    timeout: 120
```

## Steps to reproduce

1. Configure a main provider/model that can return 429/usage limit errors.
2. Configure a working `fallback_providers` entry.
3. Keep `auxiliary.compression.provider: auto`.
4. Run a long session until context compression is triggered while the main provider is rate/usage-limited.

## Expected behavior

If the main model is rate/usage-limited and the conversation is already able to continue through `fallback_providers`, compression should also use an available fallback-capable route. The compression summary should be generated by the fallback provider rather than failing with a 429 from the exhausted main provider.

## Actual behavior

The main conversation can fall back, but compression still attempts the main provider/model through auxiliary auto-routing and fails with `usage_limit_reached`. Hermes then inserts a fallback context marker instead of a real summary.

## Current workaround

Pin compression explicitly to the fallback provider:

```yaml
auxiliary:
  compression:
    provider: deepseek-v4
    model: deepseek-v4-flash
    timeout: 120
```

This works, but it is static and does not generalize to users with multiple fallback providers or changing fallback preference.

## Recommended fix

I recommend fixing this at the auxiliary routing layer, not only by documenting the workaround.

Specifically, when `auxiliary.<task>.provider: auto`, `agent/auxiliary_client.py` should include configured `fallback_providers` in the auto/fallback resolution path for auxiliary LLM calls.

Suggested behavior:

1. Try the active/main provider as today.
2. If the call fails with rate/usage/quota exhaustion, try the configured `fallback_providers` in order.
3. Then fall back to the existing auxiliary provider chain (`openrouter`, `nous`, `local/custom`, `openai-codex`, API-key providers).

This makes auxiliary tasks track the same resilience policy as the main agent and avoids requiring users to duplicate fallback config under every `auxiliary.*` task.

It would also be useful to treat OpenAI/Codex-style usage-limit errors as fallback-worthy. Current payment/credit detection appears to cover billing/credit terms, but this observed error uses:

```text
usage_limit_reached
The usage limit has been reached
resets_in_seconds
```

So the fallback classifier should likely include these usage-limit/rate-limit patterns as transient/exhaustion signals that can trigger provider fallback in auto mode.

## Code pointers

Likely areas:

- `agent/auxiliary_client.py`
  - `_resolve_auto(...)`
  - `_get_provider_chain()`
  - `_is_payment_error(...)` or a broader auxiliary fallback classifier
  - `call_llm(...)` fallback handling
- `agent/context_compressor.py`
  - `_generate_summary(...)` calls `call_llm(task="compression", main_runtime=...)`

## Why this matters

Context compression is critical for long sessions. If it fails during provider fallback, Hermes preserves much less continuity exactly when the session is already under pressure. Aligning auxiliary auto-routing with `fallback_providers` should make long conversations significantly more robust.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auxiliary compression auto-routing ignores fallback_providers after main model usage limit #15714

Bug description

Environment

Steps to reproduce

Expected behavior

Actual behavior

Current workaround

Recommended fix

Code pointers

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Auxiliary compression auto-routing ignores fallback_providers after main model usage limit #15714

Description

Bug description

Environment

Steps to reproduce

Expected behavior

Actual behavior

Current workaround

Recommended fix

Code pointers

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions