Auxiliary call_llm fallback doesn't trigger on provider rate limits (429 daily quota)

## Problem

When the auxiliary LLM provider (used for context compression, memory flush, web extraction, etc.) returns a 429 rate limit with a daily quota message like `"Too many tokens per day"`, the fallback chain in `call_llm()` does not activate. This causes context compaction to silently fail, dropping conversation history without a summary.

Two root causes:

### 1. Daily rate limits not classified as fallback-worthy errors

`_is_payment_error()` checks for keywords like "credits", "insufficient funds", "billing", "payment required" — but daily token quota exhaustion (common with Bedrock, Vertex AI, and other cloud providers) uses different language like "Too many tokens per day" or "quota exceeded". These are functionally identical to credit exhaustion but don't trigger fallback.

**Suggested fix:** Add quota-related keywords to `_is_payment_error()` or create a separate `_is_quota_error()`:
```python
# In _is_payment_error or a new _is_quota_exhaustion check:
if any(kw in err_lower for kw in ("quota", "too many tokens", "rate limit exceeded",
                                    "daily limit", "tokens per day")):
    return True
```

### 2. Fallback chain gated on `resolved_provider == "auto"` only

In `call_llm()` (~line 2293):
```python
is_auto = resolved_provider in ("auto", "", None)
if should_fallback and is_auto:
```

When a task resolves to a specific provider (e.g., "custom" for a LiteLLM proxy, or "openrouter"), the fallback chain is completely disabled. If that provider fails with a retriable error, `call_llm` raises instead of trying alternatives.

This is overly conservative. The intent is to respect explicit provider choice, but when the error is clearly "this provider can't serve right now" (payment, quota, connection), trying alternatives is better than failing entirely — especially for background tasks like context compression where the user didn't explicitly choose a provider.

**Suggested fix:** Allow fallback for quota/payment/connection errors regardless of provider resolution source, or at minimum for tasks where the provider was resolved via auto-detection chain rather than explicit user config:
```python
# Allow fallback when the error is clearly about provider capacity,
# not about the request itself (4xx client errors etc.)
if should_fallback:
    # ... try fallback chain
```

## Impact

- Context compaction fails silently when the primary provider hits daily limits
- Middle conversation turns are dropped without summary
- Agent loses task context and starts repeating work or acting confused
- Affects any deployment using provider rate limits (Bedrock, Vertex AI, free-tier OpenRouter, etc.)

## Environment

- hermes-agent 0.8.0
- LiteLLM proxy routing to Bedrock (daily token limit) with Anthropic as fallback
- Context compressor calls `call_llm(task="compression", ...)` which resolves to the custom/litellm provider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auxiliary call_llm fallback doesn't trigger on provider rate limits (429 daily quota) #26803

Problem

1. Daily rate limits not classified as fallback-worthy errors

2. Fallback chain gated on `resolved_provider == "auto"` only

Impact

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Auxiliary call_llm fallback doesn't trigger on provider rate limits (429 daily quota) #26803

Description

Problem

1. Daily rate limits not classified as fallback-worthy errors

2. Fallback chain gated on resolved_provider == "auto" only

Impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Fallback chain gated on `resolved_provider == "auto"` only