Skip to content

Auth failover: distinguish revoked tokens from transient auth errors #25689

@kevins88288

Description

@kevins88288

Problem

resolveFailoverReasonFromError() in src/agents/failover-error.ts maps both HTTP 401 and 403 to "auth" with the same exponential cooldown/retry behavior. This means a permanently revoked API key retries indefinitely with backoff, instead of failing fast.

Impact: When an API key is revoked (e.g., during key rotation), agents retry forever instead of surfacing a clear "token revoked" error and stopping. The exponential backoff delays error visibility and wastes resources.

Current behavior

// src/agents/failover-error.ts:160
if (status === 401 || status === 403) {
  return "auth";
}

Both status codes get the same FailoverReason, same retry curve, same cooldown.

Desired behavior

Distinguish between:

  1. Transient auth errors (401 with expired token that can be refreshed) — retry with backoff ✅
  2. Permanent auth errors (403 with revoked key, or 401 with invalid credentials) — fail immediately with actionable error message

Possible approaches

  • Add a new FailoverReason like "auth_permanent" for revoked/invalid credentials
  • Check response body for provider-specific "revoked" / "invalid_api_key" signals
  • Separate 401 (unauthorized — might be refreshable) from 403 (forbidden — likely permanent)
  • Add a max retry count for auth failures specifically

Context

This was discovered during an API key rotation on Feb 20, 2025. Revoking the old Anthropic key caused a cascade: 403 mid-tool-call → agents retrying forever → combined with other gaps (gateway crash on FailoverError, session corruption from synthetic tool results) → full gateway death.

The other two gaps are being addressed in separate PRs:

  • Gateway crash on FailoverError → catch FailoverError in unhandled rejection handler
  • Session corruption from synthetic tool results → skip flush when stream interrupted

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions