feat: Intelligent Failover with Circuit Breaker for Permanent Errors

### Problem\n\nThe current model failover logic is robust for transient errors but can be overly aggressive when faced with permanent, unrecoverable errors like  or .\n\nWhen a primary model's API key is invalid, the system currently attempts to fail over to the next model in the list. However, it retries the *exact same prompt*, which can be very large. This single large retry can be enough to exhaust the TPM (Tokens Per Minute) quota of the fallback model. Every subsequent request repeats this pattern, creating a self-inflicted denial-of-service where all available models quickly become rate-limited due to the primary model's permanent failure.\n\n### Proposed Solution\n\nImplement a more intelligent failover mechanism that incorporates a circuit breaker pattern, differentiating between transient and permanent errors.\n\n1.  **Error Categorization:**\n    *   **Permanent Errors:** Treat HTTP , , and potentially  (for invalid model IDs) as permanent configuration issues.\n    *   **Transient Errors:** Treat HTTP , , , ,  as temporary service issues.\n\n2.  **Circuit Breaker Logic:**\n    *   If a model provider returns a **permanent error**, the system should immediately "trip the circuit" for that specific auth profile (e.g., ).\n    *   The unhealthy profile should be placed into a **cooldown state** for a configurable duration (e.g., 10-15 minutes) to prevent further requests.\n    *   The system should immediately attempt the request on the next model in the fallback list.\n    *   A high-priority system notification should be generated, informing the user that a provider has been disabled due to an authentication/configuration error (e.g., ).\n\n3.  **Failover for Transient Errors:**\n    *   If a model fails with a transient error (like a rate limit), the existing failover logic to the next model is appropriate.\n\n### Benefits\n\n-   **Preserves Resources:** Prevents the system from wasting API calls and burning through the rate limits of healthy fallback models.\n-   **Increases Resilience:** Allows the system to gracefully degrade by automatically sidelining a misconfigured provider while continuing to function on others.\n-   **Improves Diagnosability:** Provides clear, immediate feedback about which part of the configuration is broken, allowing for faster resolution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Intelligent Failover with Circuit Breaker for Permanent Errors #16668

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

feat: Intelligent Failover with Circuit Breaker for Permanent Errors #16668

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions