Skip to content

feat: Intelligent Failover with Circuit Breaker for Permanent Errors #16668

@riyaazd29

Description

@riyaazd29

Problem\n\nThe current model failover logic is robust for transient errors but can be overly aggressive when faced with permanent, unrecoverable errors like or .\n\nWhen a primary model's API key is invalid, the system currently attempts to fail over to the next model in the list. However, it retries the exact same prompt, which can be very large. This single large retry can be enough to exhaust the TPM (Tokens Per Minute) quota of the fallback model. Every subsequent request repeats this pattern, creating a self-inflicted denial-of-service where all available models quickly become rate-limited due to the primary model's permanent failure.\n\n### Proposed Solution\n\nImplement a more intelligent failover mechanism that incorporates a circuit breaker pattern, differentiating between transient and permanent errors.\n\n1. Error Categorization:\n * Permanent Errors: Treat HTTP , , and potentially (for invalid model IDs) as permanent configuration issues.\n * Transient Errors: Treat HTTP , , , , as temporary service issues.\n\n2. Circuit Breaker Logic:\n * If a model provider returns a permanent error, the system should immediately "trip the circuit" for that specific auth profile (e.g., ).\n * The unhealthy profile should be placed into a cooldown state for a configurable duration (e.g., 10-15 minutes) to prevent further requests.\n * The system should immediately attempt the request on the next model in the fallback list.\n * A high-priority system notification should be generated, informing the user that a provider has been disabled due to an authentication/configuration error (e.g., ).\n\n3. Failover for Transient Errors:\n * If a model fails with a transient error (like a rate limit), the existing failover logic to the next model is appropriate.\n\n### Benefits\n\n- Preserves Resources: Prevents the system from wasting API calls and burning through the rate limits of healthy fallback models.\n- Increases Resilience: Allows the system to gracefully degrade by automatically sidelining a misconfigured provider while continuing to function on others.\n- Improves Diagnosability: Provides clear, immediate feedback about which part of the configuration is broken, allowing for faster resolution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions