Skip to content

failover: rotate auth profiles within a candidate before falling through to next provider #79604

@wherewolf87

Description

@wherewolf87

failover: rotate auth profiles within a candidate before falling through to next provider

Labels: bug, enhancement
Repo: openclaw/openclaw

Summary

When a provider has multiple auth profiles configured in the credential pool (e.g. multiple ChatGPT OAuth profiles for openai-codex), the model-fallback loop only tries one profile per candidate before giving up on that provider and moving to the next fallback candidate (e.g. claude-cli, openrouter).

The remaining profiles in the same provider's pool are never tried within the same failover sequence. They only get exercised by subsequent user requests via fill_first selection (which works because the previously-used profile is now in cooldown). This means a single user request that hits a rate-limited profile drops straight to a different provider instead of trying the other available profiles of the same provider first.

Environment

  • openclaw 2026.5.7 (also reproducible on 2026.4.23)
  • Provider: openai-codex
  • Three OAuth profiles in credential_pool.openai-codex with credential_pool_strategies.openai-codex: fill_first
  • Fallback chain: openai-codex/gpt-5.5claude-cli/claude-opus-4-7openrouter/...

Reproduction

  1. Configure openai-codex with 3 OAuth profiles (priorities 0, 1, 2).
  2. Configure fallback_providers with claude-cli as the next candidate.
  3. Send a request while profile 0 is rate-limited (e.g. usage_limit_reached) but profiles 1 and 2 are healthy.
  4. Observe the journal: the request hits profile 0, gets 429 usage_limit_reached, and the failover decision jumps directly to claude-cli without trying profiles 1 or 2.

Real evidence (anonymized)

A single runId from a production journal:

runId=44aed4b8-506b-4aba-8a24-a59da47e3dbd
  18:29:37  embedded run agent end: error=usage_limit_reached  (retry 1)
  18:29:39  embedded run agent end: error=usage_limit_reached  (retry 2)
  18:29:44  embedded run agent end: error=usage_limit_reached  (retry 3)
  18:29:55  embedded run agent end: error=usage_limit_reached  (retry 4)
  18:29:57  auth profile failure state updated:
              profile=sha256:84fb7df94e09  ← SAME profile sha across all retries
  18:29:57  embedded run failover decision:
              from=openai-codex/gpt-5.5
              profile=sha256:84fb7df94e09
              decision=fallback_model
  18:29:57  model fallback decision:
              candidate=openai-codex/gpt-5.5
              next=claude-cli/claude-opus-4-7   ← jumped to next provider,
                                                  bypassing remaining profiles

5 retries, one profile sha, then next=claude-cli. Two other healthy openai-codex profiles in the pool were not attempted in this sequence.

Same pattern repeats on runId=7a37ed4d and runId=e63d5b16. Across hours of traffic, the journal's auth profile failure state updated line never references more than one profile sha within a single failover sequence.

Expected behavior

When a profile-level failure occurs (rate_limit, usage_limit_reached, auth_invalid), the loop should mark that profile cooled-down and retry the same candidate using the next available profile for that provider, before falling through to the next fallback candidate.

Order:

profile 0 fails → try profile 1 → try profile 2 → THEN claude-cli

instead of the current:

profile 0 fails → claude-cli

Why the existing fill_first rotation doesn't cover this

fill_first correctly rotates between requests (request N+1 picks the highest-priority non-cooled-down profile). But it does not rotate within a single failover sequence: runFallbackAttempt makes one call with one profile, and on profile-level failure the candidate-iteration loop advances to the next model candidate, not the next profile of the same model.

In practice, all profiles do eventually get exercised over time — but a single user-facing request still pays the latency hit of going to a slower / less-preferred provider when the user's preferred provider has healthy profiles available.

Suggested fix

In src/agents/model-fallback.ts, inside the candidate-iteration loop in the function that runs the failover sequence, insert a profile-rotation sub-loop between runFallbackAttempt and the existing fall-through:

let attemptRun;
let profileAttempt = 0;
const MAX_PROFILE_ROTATIONS = 3;

while (profileAttempt < MAX_PROFILE_ROTATIONS) {
  attemptRun = await runFallbackAttempt({ ...candidate, options: runOptions, ... });
  if ('success' in attemptRun) break;

  const errInfo = describeFailoverError(attemptRun.error);
  const isProfileLevelFailure =
    errInfo.reason === 'rate_limit' ||
    errInfo.providerErrorType === 'usage_limit_reached' ||
    errInfo.reason === 'auth_invalid';

  if (!isProfileLevelFailure || !authRuntime || !authStore) break;

  // The failed profile is already marked cooled-down by the existing
  // `auth profile failure state updated` emitter. Re-resolve order
  // and exclude any now-cooled-down profiles.
  const remainingProfiles = authRuntime
    .resolveAuthProfileOrder({ cfg: params.cfg, store: authStore, provider: candidate.provider })
    .filter(id => !authRuntime.isProfileInCooldown(authStore, id, void 0, candidate.model));

  if (remainingProfiles.length === 0) break;  // exhausted → fall through

  await observeDecision({
    decision: 'rotate_profile',
    runId: params.runId,
    sessionId: params.sessionId,
    lane: params.lane,
    candidate,
    attempt: i + 1,
    total: candidates.length,
    reason: errInfo.reason,
    nextProfileId: remainingProfiles[0],
    profileCount: remainingProfiles.length,
    isPrimary,
  });

  profileAttempt += 1;
  // Loop iterates: authRuntime fill_first will pick remainingProfiles[0].
}

if (attemptRun && 'success' in attemptRun) {
  // existing candidate_succeeded emit path
  return attemptRun.success;
}
// existing candidate_failed emit + continue to next candidate

The same shape likely applies in src/agents/pi-embedded-runner/run/assistant-failover.ts if it has the same single-profile-per-candidate pattern (the [agent/embedded] embedded run failover decision log lines suggest it does).

Behavioral guarantees

  • Only profile-level failures rotate. Network errors, 5xx, model-not-found, etc. fall through immediately as today.
  • MAX_PROFILE_ROTATIONS cap prevents pathological loops if the profile pool grows large or cooldown bookkeeping has bugs.
  • Existing fill_first / cooldown bookkeeping is reused — no schema changes.
  • New rotate_profile observability event makes the in-failover rotation visible in the journal.

Suggested tests

Extend src/agents/model-fallback.test.ts (and the e2e variant) with:

  1. Rotate-then-succeed: 3 profiles configured; profile 0 returns usage_limit_reached; profile 1 succeeds. Assert the success comes from profile 1, the fallback candidate (claude-cli) was never invoked, and a rotate_profile decision was observed.
  2. Rotate-exhausted-then-fallback: all profiles return usage_limit_reached; assert the fallback candidate is invoked after the rotation cap, and MAX_PROFILE_ROTATIONS is honored.
  3. No rotation on non-profile-level error: profile 0 returns a generic 5xx; assert no rotation occurs and the loop falls through to the next candidate immediately.
  4. fill_first regression: verify the between-requests rotation path (current behavior) still works after a profile is cooled down by an earlier sequence.

Impact

  • Reduces user-perceived fallback latency when preferred provider has multiple healthy profiles.
  • Maximizes utilization of paid auth pools (Pro/Team plan profiles) before paying the cross-provider context-loss tax.
  • No-op for installations that have only one profile per provider.

Filed by

OpenClaw operator instance, with corroborating journal evidence from a Hermes deployment running openclaw 2026.5.7 against three openai-codex OAuth profiles (one Pro plan, two Team plan).

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions