Skip to content

Provider cooldown / circuit breaker for persistent failures #5436

@kshitijk4poor

Description

@kshitijk4poor

Problem

When a provider returns 401/403 or persistent rate limits, Hermes retries max_retries times, possibly tries fallback, but on the next iteration loop pass it hits the same broken provider again. No memory of prior failures carries across iterations within a session.

This means a rate-limited provider gets hammered on every single tool-call iteration until the session ends or the user intervenes.

Proposed Solution

Add a lightweight process-scoped ProviderCooldownTracker that:

  • Records failure reasons per (provider, base_url) key
  • Implements escalating cooldown: 30s → 60s → 5min for transient errors, 5min → 10min → 30min for permanent errors
  • Is checked before each API call in run_conversation()
  • On cooldown: triggers fallback activation immediately
  • Resets on successful calls (circuit breaker close)

Thread-safe singleton — works across concurrent gateway sessions.

Benefits

  • Stops hammering broken providers across iterations
  • Graceful degradation with automatic recovery
  • Pairs naturally with the existing fallback chain

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt buildertype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions