Skip to content

Add retry logic to OAuth token refresh #8673

@skyblue-will

Description

@skyblue-will

Problem

When the OAuth token refresh API call fails transiently (network blip, API timeout), the gateway immediately throws an error without retrying. This can cause agents to fail even when the underlying issue is temporary.

Observed behavior

2026-02-04T08:14:05.225Z [agents/auth-profiles] wrote refreshed credentials to claude cli file
2026-02-04T08:14:11.573Z [diagnostic] lane task error: lane=main durationMs=183 error="Error: OAuth token refresh failed for anthropic: Failed to refresh OAuth token for anthropic. Please try again or re-authenticate."

The token wasn't expired (8hrs remaining) - the refresh API call itself failed transiently. A gateway restart fixed it immediately.

Root Cause

In src/agents/auth-profiles/oauth.ts (~line 60-73), the getOAuthApiKey() call has no retry logic:

const result = await getOAuthApiKey(oauthProvider, oauthCreds);
if (!result) {
  return null;
}

The code has good fallback logic after failure (check if another process refreshed, try main agent credentials), but no retry on the actual API call.

Proposed Fix

Add a withRetry() helper with exponential backoff:

const MAX_REFRESH_RETRIES = 3;
const RETRY_DELAY_MS = 1000;

async function withRetry<T>(
  fn: () => Promise<T>,
  retries: number = MAX_REFRESH_RETRIES,
  delayMs: number = RETRY_DELAY_MS,
): Promise<T> {
  for (let attempt = 1; attempt <= retries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === retries) throw error;
      log.warn(\`OAuth refresh attempt ${attempt}/${retries} failed, retrying in ${delayMs}ms\`, {
        error: error instanceof Error ? error.message : String(error),
      });
      await new Promise((resolve) => setTimeout(resolve, delayMs * attempt));
    }
  }
  throw new Error("Unexpected: retry loop exited without return or throw");
}

Then wrap the call:

return await withRetry(() => getOAuthApiKey(oauthProvider, oauthCreds));

This handles transient failures without masking actual auth errors.

Environment

  • Clawdbot version: 2026.1.24-1
  • Running on Ubuntu VPS
  • Using Claude CLI OAuth credentials

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.enhancementNew feature or requestimpact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions