-
-
Notifications
You must be signed in to change notification settings - Fork 52.7k
Description
Problem
resolveFailoverReasonFromError() in src/agents/failover-error.ts maps both HTTP 401 and 403 to "auth" with the same exponential cooldown/retry behavior. This means a permanently revoked API key retries indefinitely with backoff, instead of failing fast.
Impact: When an API key is revoked (e.g., during key rotation), agents retry forever instead of surfacing a clear "token revoked" error and stopping. The exponential backoff delays error visibility and wastes resources.
Current behavior
// src/agents/failover-error.ts:160
if (status === 401 || status === 403) {
return "auth";
}Both status codes get the same FailoverReason, same retry curve, same cooldown.
Desired behavior
Distinguish between:
- Transient auth errors (401 with expired token that can be refreshed) — retry with backoff ✅
- Permanent auth errors (403 with revoked key, or 401 with invalid credentials) — fail immediately with actionable error message
Possible approaches
- Add a new
FailoverReasonlike"auth_permanent"for revoked/invalid credentials - Check response body for provider-specific "revoked" / "invalid_api_key" signals
- Separate 401 (unauthorized — might be refreshable) from 403 (forbidden — likely permanent)
- Add a max retry count for auth failures specifically
Context
This was discovered during an API key rotation on Feb 20, 2025. Revoking the old Anthropic key caused a cascade: 403 mid-tool-call → agents retrying forever → combined with other gaps (gateway crash on FailoverError, session corruption from synthetic tool results) → full gateway death.
The other two gaps are being addressed in separate PRs:
- Gateway crash on FailoverError → catch FailoverError in unhandled rejection handler
- Session corruption from synthetic tool results → skip flush when stream interrupted