Skip to content

Auxiliary call_llm fallback doesn't trigger on provider rate limits (429 daily quota) #26803

@juniperbevensee

Description

@juniperbevensee

Problem

When the auxiliary LLM provider (used for context compression, memory flush, web extraction, etc.) returns a 429 rate limit with a daily quota message like "Too many tokens per day", the fallback chain in call_llm() does not activate. This causes context compaction to silently fail, dropping conversation history without a summary.

Two root causes:

1. Daily rate limits not classified as fallback-worthy errors

_is_payment_error() checks for keywords like "credits", "insufficient funds", "billing", "payment required" — but daily token quota exhaustion (common with Bedrock, Vertex AI, and other cloud providers) uses different language like "Too many tokens per day" or "quota exceeded". These are functionally identical to credit exhaustion but don't trigger fallback.

Suggested fix: Add quota-related keywords to _is_payment_error() or create a separate _is_quota_error():

# In _is_payment_error or a new _is_quota_exhaustion check:
if any(kw in err_lower for kw in ("quota", "too many tokens", "rate limit exceeded",
                                    "daily limit", "tokens per day")):
    return True

2. Fallback chain gated on resolved_provider == "auto" only

In call_llm() (~line 2293):

is_auto = resolved_provider in ("auto", "", None)
if should_fallback and is_auto:

When a task resolves to a specific provider (e.g., "custom" for a LiteLLM proxy, or "openrouter"), the fallback chain is completely disabled. If that provider fails with a retriable error, call_llm raises instead of trying alternatives.

This is overly conservative. The intent is to respect explicit provider choice, but when the error is clearly "this provider can't serve right now" (payment, quota, connection), trying alternatives is better than failing entirely — especially for background tasks like context compression where the user didn't explicitly choose a provider.

Suggested fix: Allow fallback for quota/payment/connection errors regardless of provider resolution source, or at minimum for tasks where the provider was resolved via auto-detection chain rather than explicit user config:

# Allow fallback when the error is clearly about provider capacity,
# not about the request itself (4xx client errors etc.)
if should_fallback:
    # ... try fallback chain

Impact

  • Context compaction fails silently when the primary provider hits daily limits
  • Middle conversation turns are dropped without summary
  • Agent loses task context and starts repeating work or acting confused
  • Affects any deployment using provider rate limits (Bedrock, Vertex AI, free-tier OpenRouter, etc.)

Environment

  • hermes-agent 0.8.0
  • LiteLLM proxy routing to Bedrock (daily token limit) with Anthropic as fallback
  • Context compressor calls call_llm(task="compression", ...) which resolves to the custom/litellm provider

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions