Problem
When the auxiliary LLM provider (used for context compression, memory flush, web extraction, etc.) returns a 429 rate limit with a daily quota message like "Too many tokens per day", the fallback chain in call_llm() does not activate. This causes context compaction to silently fail, dropping conversation history without a summary.
Two root causes:
1. Daily rate limits not classified as fallback-worthy errors
_is_payment_error() checks for keywords like "credits", "insufficient funds", "billing", "payment required" — but daily token quota exhaustion (common with Bedrock, Vertex AI, and other cloud providers) uses different language like "Too many tokens per day" or "quota exceeded". These are functionally identical to credit exhaustion but don't trigger fallback.
Suggested fix: Add quota-related keywords to _is_payment_error() or create a separate _is_quota_error():
# In _is_payment_error or a new _is_quota_exhaustion check:
if any(kw in err_lower for kw in ("quota", "too many tokens", "rate limit exceeded",
"daily limit", "tokens per day")):
return True
2. Fallback chain gated on resolved_provider == "auto" only
In call_llm() (~line 2293):
is_auto = resolved_provider in ("auto", "", None)
if should_fallback and is_auto:
When a task resolves to a specific provider (e.g., "custom" for a LiteLLM proxy, or "openrouter"), the fallback chain is completely disabled. If that provider fails with a retriable error, call_llm raises instead of trying alternatives.
This is overly conservative. The intent is to respect explicit provider choice, but when the error is clearly "this provider can't serve right now" (payment, quota, connection), trying alternatives is better than failing entirely — especially for background tasks like context compression where the user didn't explicitly choose a provider.
Suggested fix: Allow fallback for quota/payment/connection errors regardless of provider resolution source, or at minimum for tasks where the provider was resolved via auto-detection chain rather than explicit user config:
# Allow fallback when the error is clearly about provider capacity,
# not about the request itself (4xx client errors etc.)
if should_fallback:
# ... try fallback chain
Impact
- Context compaction fails silently when the primary provider hits daily limits
- Middle conversation turns are dropped without summary
- Agent loses task context and starts repeating work or acting confused
- Affects any deployment using provider rate limits (Bedrock, Vertex AI, free-tier OpenRouter, etc.)
Environment
- hermes-agent 0.8.0
- LiteLLM proxy routing to Bedrock (daily token limit) with Anthropic as fallback
- Context compressor calls
call_llm(task="compression", ...) which resolves to the custom/litellm provider
Problem
When the auxiliary LLM provider (used for context compression, memory flush, web extraction, etc.) returns a 429 rate limit with a daily quota message like
"Too many tokens per day", the fallback chain incall_llm()does not activate. This causes context compaction to silently fail, dropping conversation history without a summary.Two root causes:
1. Daily rate limits not classified as fallback-worthy errors
_is_payment_error()checks for keywords like "credits", "insufficient funds", "billing", "payment required" — but daily token quota exhaustion (common with Bedrock, Vertex AI, and other cloud providers) uses different language like "Too many tokens per day" or "quota exceeded". These are functionally identical to credit exhaustion but don't trigger fallback.Suggested fix: Add quota-related keywords to
_is_payment_error()or create a separate_is_quota_error():2. Fallback chain gated on
resolved_provider == "auto"onlyIn
call_llm()(~line 2293):When a task resolves to a specific provider (e.g., "custom" for a LiteLLM proxy, or "openrouter"), the fallback chain is completely disabled. If that provider fails with a retriable error,
call_llmraises instead of trying alternatives.This is overly conservative. The intent is to respect explicit provider choice, but when the error is clearly "this provider can't serve right now" (payment, quota, connection), trying alternatives is better than failing entirely — especially for background tasks like context compression where the user didn't explicitly choose a provider.
Suggested fix: Allow fallback for quota/payment/connection errors regardless of provider resolution source, or at minimum for tasks where the provider was resolved via auto-detection chain rather than explicit user config:
Impact
Environment
call_llm(task="compression", ...)which resolves to the custom/litellm provider