What happened?
The current model fallback logic for 429 quota errors is unreliable and leads to a poor user experience. It relies on fragile string matching of error messages and a simple counter for consecutive errors. This causes incorrect behavior, such as downgrading the model for transient, short-term limits when a simple retry would suffice, or failing to downgrade when a hard daily limit is hit. The current implementation only handles one authentication type (LOGIN_WITH_GOOGLE) and does not cover all API endpoints that it should.
What did you expect to happen?
The model fallback and retry logic should be robust, predictable, and apply to all authentication types (Google Sign-In, API Keys for Gemini & Vertex) and relevant API calls. It should intelligently distinguish between short-term (retryable) and long-term (fallback-worthy) quota limits.
When a short-term, per-minute limit is hit, the CLI should retry after the delay specified by the API. The user should not be interrupted.
When a long-term, daily limit is hit, the CLI should immediately and clearly inform the user and suggest falling back to a different model (e.g., Flash), because retrying is unlikely to solve the problem in a reasonable timeframe.
Client information
Client Information
Run gemini to enter the interactive CLI, then run the /about command.
> /about
| About Gemini CLI │
│ │
│ CLI Version 0.7.0-nightly.20250918.2722473a │
│ Git Commit f46e50b27 │
│ Model gemini-2.5-pro │
│ Sandbox no sandbox │
│ OS linux │
│ Auth Method vertex-ai │
│ GCP Project gaghosh-project-1 │
│ IDE Client VS Code
Login information
This issue affects all login types, including Google Account (GCA), Gemini API Keys, and Vertex API authentication, as all these services return structured 429 errors that are not being properly utilized.
Anything else we need to know?
The backend APIs (GCA, Gemini, Vertex) generally always return structured errors for 429 responses, compliant with Google API standards. We can create a parser for these errors like https://gist.github.com/gsquared94/375a7220f6ea0d32961e7ab1a1d63da5 (parseGoogleApiError). We can refactor the retryWithBackoff logic to use this parser.
An example of a structured error is:
{
"error": {
"message": "{\n \"error\": {\n \"code\": 429,\n \"message\": \"You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.\\n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 50\nPlease retry in 34.074824224s.\",\n \"status\": \"RESOURCE_EXHAUSTED\",\n \"details\": [\n {\n \"@type\": \"type.googleapis.com/google.rpc.QuotaFailure\",\n \"violations\": [\n {\n \"quotaMetric\": \"generativelanguage.googleapis.com/generate_content_free_tier_requests\",\n \"quotaId\": \"GenerateRequestsPerDayPerProjectPerModel-FreeTier\",\n \"quotaDimensions\": {\n \"location\": \"global\",\n \"model\": \"gemini-2.5-pro\"\n },\n \"quotaValue\": \"50\"\n }\n ]\n },\n {\n \"@type\": \"type.googleapis.com/google.rpc.Help\",\n \"links\": [\n {\n \"description\": \"Learn more about Gemini API quotas\",\n \"url\": \"https://ai.google.dev/gemini-api/docs/rate-limits\"\n }
]\n },\n {\n \"@type\": \"type.googleapis.com/google.rpc.RetryInfo\",\n \"retryDelay\": \"34s\"\n }
]\n }\n}",
"code": 429,
"status": "Too Many Requests"
}
}
Note: The message field can contain a stringified JSON. The parser will need to handle this with JSON.parse().
Implementation Proposal (without too much refactoring):
- In
retryWithBackoff (and other relevant error handling locations), call parseGoogleApiError on any caught error.
- If a structured
GoogleApiError is parsed, inspect its details:
- Check for
QuotaFailure:
- Examine the
quotaId in the violations. If it contains substrings like PerDay, Daily, or other long-term indicators, it's a terminal quota for the session. This should trigger the model fallback flow.
- If the
quotaId contains PerMinute, PerSecond, or indicates a short-term limit, it's a transient error.
- Check for
RetryInfo:
- Use the
retryDelay. A short delay (e.g., < 5 minutes) confirms a transient error, and the system should wait for this duration and retry.
- A long delay (e.g., hours) indicates a long-term lockout, which should also trigger the model fallback flow.
- Decision Logic:
- Trigger Fallback IF: (
QuotaFailure exists AND quotaId indicates a daily/long-term limit) OR (RetryInfo exists AND retryDelay is long, e.g., > 5 minutes).
- Retry Silently IF: (
QuotaFailure exists AND quotaId indicates a minute/short-term limit) OR (RetryInfo exists AND retryDelay is short).
- If no structured error is found, the system can revert to the existing (less reliable) backoff mechanism as a legacy fallback.
- This new logic should replace the current string-matching and consecutive error counting, and it should be applied to all auth types.
This approach will make the retry/fallback behavior much more accurate and improve the user experience by correctly interpreting the specific reason for the rate limit.
What happened?
The current model fallback logic for 429 quota errors is unreliable and leads to a poor user experience. It relies on fragile string matching of error messages and a simple counter for consecutive errors. This causes incorrect behavior, such as downgrading the model for transient, short-term limits when a simple retry would suffice, or failing to downgrade when a hard daily limit is hit. The current implementation only handles one authentication type (
LOGIN_WITH_GOOGLE) and does not cover all API endpoints that it should.What did you expect to happen?
The model fallback and retry logic should be robust, predictable, and apply to all authentication types (Google Sign-In, API Keys for Gemini & Vertex) and relevant API calls. It should intelligently distinguish between short-term (retryable) and long-term (fallback-worthy) quota limits.
When a short-term, per-minute limit is hit, the CLI should retry after the delay specified by the API. The user should not be interrupted.
When a long-term, daily limit is hit, the CLI should immediately and clearly inform the user and suggest falling back to a different model (e.g., Flash), because retrying is unlikely to solve the problem in a reasonable timeframe.
Client information
Client Information
Run
geminito enter the interactive CLI, then run the/aboutcommand.Login information
This issue affects all login types, including Google Account (GCA), Gemini API Keys, and Vertex API authentication, as all these services return structured 429 errors that are not being properly utilized.
Anything else we need to know?
The backend APIs (GCA, Gemini, Vertex) generally always return structured errors for 429 responses, compliant with Google API standards. We can create a parser for these errors like https://gist.github.com/gsquared94/375a7220f6ea0d32961e7ab1a1d63da5 (
parseGoogleApiError). We can refactor theretryWithBackofflogic to use this parser.An example of a structured error is:
{ "error": { "message": "{\n \"error\": {\n \"code\": 429,\n \"message\": \"You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.\\n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 50\nPlease retry in 34.074824224s.\",\n \"status\": \"RESOURCE_EXHAUSTED\",\n \"details\": [\n {\n \"@type\": \"type.googleapis.com/google.rpc.QuotaFailure\",\n \"violations\": [\n {\n \"quotaMetric\": \"generativelanguage.googleapis.com/generate_content_free_tier_requests\",\n \"quotaId\": \"GenerateRequestsPerDayPerProjectPerModel-FreeTier\",\n \"quotaDimensions\": {\n \"location\": \"global\",\n \"model\": \"gemini-2.5-pro\"\n },\n \"quotaValue\": \"50\"\n }\n ]\n },\n {\n \"@type\": \"type.googleapis.com/google.rpc.Help\",\n \"links\": [\n {\n \"description\": \"Learn more about Gemini API quotas\",\n \"url\": \"https://ai.google.dev/gemini-api/docs/rate-limits\"\n } ]\n },\n {\n \"@type\": \"type.googleapis.com/google.rpc.RetryInfo\",\n \"retryDelay\": \"34s\"\n } ]\n }\n}", "code": 429, "status": "Too Many Requests" } }Note: The
messagefield can contain a stringified JSON. The parser will need to handle this withJSON.parse().Implementation Proposal (without too much refactoring):
retryWithBackoff(and other relevant error handling locations), callparseGoogleApiErroron any caught error.GoogleApiErroris parsed, inspect itsdetails:QuotaFailure:quotaIdin the violations. If it contains substrings likePerDay,Daily, or other long-term indicators, it's a terminal quota for the session. This should trigger the model fallback flow.quotaIdcontainsPerMinute,PerSecond, or indicates a short-term limit, it's a transient error.RetryInfo:retryDelay. A short delay (e.g., < 5 minutes) confirms a transient error, and the system should wait for this duration and retry.QuotaFailureexists ANDquotaIdindicates a daily/long-term limit) OR (RetryInfoexists ANDretryDelayis long, e.g., > 5 minutes).QuotaFailureexists ANDquotaIdindicates a minute/short-term limit) OR (RetryInfoexists ANDretryDelayis short).This approach will make the retry/fallback behavior much more accurate and improve the user experience by correctly interpreting the specific reason for the rate limit.