Skip to content

bug: Incorrect Model Fallback and Retry Logic for 429 Quota Errors #9248

@gsquared94

Description

@gsquared94

What happened?

The current model fallback logic for 429 quota errors is unreliable and leads to a poor user experience. It relies on fragile string matching of error messages and a simple counter for consecutive errors. This causes incorrect behavior, such as downgrading the model for transient, short-term limits when a simple retry would suffice, or failing to downgrade when a hard daily limit is hit. The current implementation only handles one authentication type (LOGIN_WITH_GOOGLE) and does not cover all API endpoints that it should.

What did you expect to happen?

The model fallback and retry logic should be robust, predictable, and apply to all authentication types (Google Sign-In, API Keys for Gemini & Vertex) and relevant API calls. It should intelligently distinguish between short-term (retryable) and long-term (fallback-worthy) quota limits.

When a short-term, per-minute limit is hit, the CLI should retry after the delay specified by the API. The user should not be interrupted.

When a long-term, daily limit is hit, the CLI should immediately and clearly inform the user and suggest falling back to a different model (e.g., Flash), because retrying is unlikely to solve the problem in a reasonable timeframe.

Client information

Client Information

Run gemini to enter the interactive CLI, then run the /about command.

> /about
| About Gemini CLI                                                │
│                                                                 │
│ CLI Version           0.7.0-nightly.20250918.2722473a           │
│ Git Commit            f46e50b27                                 │
│ Model                 gemini-2.5-pro                            │
│ Sandbox               no sandbox                                │
│ OS                    linux                                     │
│ Auth Method           vertex-ai                                 │
│ GCP Project           gaghosh-project-1                         │
│ IDE Client            VS Code  

Login information

This issue affects all login types, including Google Account (GCA), Gemini API Keys, and Vertex API authentication, as all these services return structured 429 errors that are not being properly utilized.

Anything else we need to know?

The backend APIs (GCA, Gemini, Vertex) generally always return structured errors for 429 responses, compliant with Google API standards. We can create a parser for these errors like https://gist.github.com/gsquared94/375a7220f6ea0d32961e7ab1a1d63da5 (parseGoogleApiError). We can refactor the retryWithBackoff logic to use this parser.

An example of a structured error is:

{
  "error": {
    "message": "{\n \"error\": {\n \"code\": 429,\n \"message\": \"You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.\\n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 50\nPlease retry in 34.074824224s.\",\n \"status\": \"RESOURCE_EXHAUSTED\",\n \"details\": [\n {\n \"@type\": \"type.googleapis.com/google.rpc.QuotaFailure\",\n \"violations\": [\n {\n \"quotaMetric\": \"generativelanguage.googleapis.com/generate_content_free_tier_requests\",\n \"quotaId\": \"GenerateRequestsPerDayPerProjectPerModel-FreeTier\",\n \"quotaDimensions\": {\n \"location\": \"global\",\n \"model\": \"gemini-2.5-pro\"\n },\n \"quotaValue\": \"50\"\n }\n ]\n },\n {\n \"@type\": \"type.googleapis.com/google.rpc.Help\",\n \"links\": [\n {\n \"description\": \"Learn more about Gemini API quotas\",\n \"url\": \"https://ai.google.dev/gemini-api/docs/rate-limits\"\n }
 ]\n },\n {\n \"@type\": \"type.googleapis.com/google.rpc.RetryInfo\",\n \"retryDelay\": \"34s\"\n }
 ]\n }\n}",
    "code": 429,
    "status": "Too Many Requests"
  }
}

Note: The message field can contain a stringified JSON. The parser will need to handle this with JSON.parse().

Implementation Proposal (without too much refactoring):

  1. In retryWithBackoff (and other relevant error handling locations), call parseGoogleApiError on any caught error.
  2. If a structured GoogleApiError is parsed, inspect its details:
    • Check for QuotaFailure:
      • Examine the quotaId in the violations. If it contains substrings like PerDay, Daily, or other long-term indicators, it's a terminal quota for the session. This should trigger the model fallback flow.
      • If the quotaId contains PerMinute, PerSecond, or indicates a short-term limit, it's a transient error.
    • Check for RetryInfo:
      • Use the retryDelay. A short delay (e.g., < 5 minutes) confirms a transient error, and the system should wait for this duration and retry.
      • A long delay (e.g., hours) indicates a long-term lockout, which should also trigger the model fallback flow.
  3. Decision Logic:
    • Trigger Fallback IF: (QuotaFailure exists AND quotaId indicates a daily/long-term limit) OR (RetryInfo exists AND retryDelay is long, e.g., > 5 minutes).
    • Retry Silently IF: (QuotaFailure exists AND quotaId indicates a minute/short-term limit) OR (RetryInfo exists AND retryDelay is short).
    • If no structured error is found, the system can revert to the existing (less reliable) backoff mechanism as a legacy fallback.
  4. This new logic should replace the current string-matching and consecutive error counting, and it should be applied to all auth types.

This approach will make the retry/fallback behavior much more accurate and improve the user experience by correctly interpreting the specific reason for the rate limit.

Metadata

Metadata

Assignees

Labels

area/coreIssues related to User Interface, OS Support, Core Functionality

Type

No fields configured for Task.

Projects

Status

Closed

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions