failover: rotate auth profiles within a candidate before falling through to next provider

# failover: rotate auth profiles within a candidate before falling through to next provider

**Labels:** `bug`, `enhancement`
**Repo:** openclaw/openclaw

## Summary

When a provider has multiple auth profiles configured in the credential pool (e.g. multiple ChatGPT OAuth profiles for `openai-codex`), the model-fallback loop only tries **one** profile per candidate before giving up on that provider and moving to the next fallback candidate (e.g. `claude-cli`, `openrouter`).

The remaining profiles in the same provider's pool are never tried within the same failover sequence. They only get exercised by *subsequent* user requests via `fill_first` selection (which works because the previously-used profile is now in cooldown). This means a single user request that hits a rate-limited profile drops straight to a different provider instead of trying the other available profiles of the same provider first.

## Environment

- `openclaw` 2026.5.7 (also reproducible on 2026.4.23)
- Provider: `openai-codex`
- Three OAuth profiles in `credential_pool.openai-codex` with `credential_pool_strategies.openai-codex: fill_first`
- Fallback chain: `openai-codex/gpt-5.5` → `claude-cli/claude-opus-4-7` → `openrouter/...`

## Reproduction

1. Configure `openai-codex` with 3 OAuth profiles (priorities 0, 1, 2).
2. Configure `fallback_providers` with `claude-cli` as the next candidate.
3. Send a request while profile 0 is rate-limited (e.g. `usage_limit_reached`) but profiles 1 and 2 are healthy.
4. Observe the journal: the request hits profile 0, gets `429 usage_limit_reached`, and the failover decision jumps directly to `claude-cli` without trying profiles 1 or 2.

### Real evidence (anonymized)

A single `runId` from a production journal:

```
runId=44aed4b8-506b-4aba-8a24-a59da47e3dbd
  18:29:37  embedded run agent end: error=usage_limit_reached  (retry 1)
  18:29:39  embedded run agent end: error=usage_limit_reached  (retry 2)
  18:29:44  embedded run agent end: error=usage_limit_reached  (retry 3)
  18:29:55  embedded run agent end: error=usage_limit_reached  (retry 4)
  18:29:57  auth profile failure state updated:
              profile=sha256:84fb7df94e09  ← SAME profile sha across all retries
  18:29:57  embedded run failover decision:
              from=openai-codex/gpt-5.5
              profile=sha256:84fb7df94e09
              decision=fallback_model
  18:29:57  model fallback decision:
              candidate=openai-codex/gpt-5.5
              next=claude-cli/claude-opus-4-7   ← jumped to next provider,
                                                  bypassing remaining profiles
```

5 retries, **one** profile sha, then `next=claude-cli`. Two other healthy `openai-codex` profiles in the pool were not attempted in this sequence.

Same pattern repeats on `runId=7a37ed4d` and `runId=e63d5b16`. Across hours of traffic, the journal's `auth profile failure state updated` line never references more than one profile sha within a single failover sequence.

## Expected behavior

When a profile-level failure occurs (`rate_limit`, `usage_limit_reached`, `auth_invalid`), the loop should mark that profile cooled-down and **retry the same candidate using the next available profile** for that provider, before falling through to the next fallback candidate.

Order:

```
profile 0 fails → try profile 1 → try profile 2 → THEN claude-cli
```

instead of the current:

```
profile 0 fails → claude-cli
```

## Why the existing `fill_first` rotation doesn't cover this

`fill_first` correctly rotates **between** requests (request N+1 picks the highest-priority non-cooled-down profile). But it does not rotate **within** a single failover sequence: `runFallbackAttempt` makes one call with one profile, and on profile-level failure the candidate-iteration loop advances to the next *model* candidate, not the next *profile* of the same model.

In practice, all profiles do eventually get exercised over time — but a single user-facing request still pays the latency hit of going to a slower / less-preferred provider when the user's preferred provider has healthy profiles available.

## Suggested fix

In `src/agents/model-fallback.ts`, inside the candidate-iteration loop in the function that runs the failover sequence, insert a profile-rotation sub-loop between `runFallbackAttempt` and the existing fall-through:

```ts
let attemptRun;
let profileAttempt = 0;
const MAX_PROFILE_ROTATIONS = 3;

while (profileAttempt < MAX_PROFILE_ROTATIONS) {
  attemptRun = await runFallbackAttempt({ ...candidate, options: runOptions, ... });
  if ('success' in attemptRun) break;

  const errInfo = describeFailoverError(attemptRun.error);
  const isProfileLevelFailure =
    errInfo.reason === 'rate_limit' ||
    errInfo.providerErrorType === 'usage_limit_reached' ||
    errInfo.reason === 'auth_invalid';

  if (!isProfileLevelFailure || !authRuntime || !authStore) break;

  // The failed profile is already marked cooled-down by the existing
  // `auth profile failure state updated` emitter. Re-resolve order
  // and exclude any now-cooled-down profiles.
  const remainingProfiles = authRuntime
    .resolveAuthProfileOrder({ cfg: params.cfg, store: authStore, provider: candidate.provider })
    .filter(id => !authRuntime.isProfileInCooldown(authStore, id, void 0, candidate.model));

  if (remainingProfiles.length === 0) break;  // exhausted → fall through

  await observeDecision({
    decision: 'rotate_profile',
    runId: params.runId,
    sessionId: params.sessionId,
    lane: params.lane,
    candidate,
    attempt: i + 1,
    total: candidates.length,
    reason: errInfo.reason,
    nextProfileId: remainingProfiles[0],
    profileCount: remainingProfiles.length,
    isPrimary,
  });

  profileAttempt += 1;
  // Loop iterates: authRuntime fill_first will pick remainingProfiles[0].
}

if (attemptRun && 'success' in attemptRun) {
  // existing candidate_succeeded emit path
  return attemptRun.success;
}
// existing candidate_failed emit + continue to next candidate
```

The same shape likely applies in `src/agents/pi-embedded-runner/run/assistant-failover.ts` if it has the same single-profile-per-candidate pattern (the `[agent/embedded] embedded run failover decision` log lines suggest it does).

## Behavioral guarantees

- Only profile-level failures rotate. Network errors, 5xx, model-not-found, etc. fall through immediately as today.
- `MAX_PROFILE_ROTATIONS` cap prevents pathological loops if the profile pool grows large or cooldown bookkeeping has bugs.
- Existing `fill_first` / cooldown bookkeeping is reused — no schema changes.
- New `rotate_profile` observability event makes the in-failover rotation visible in the journal.

## Suggested tests

Extend `src/agents/model-fallback.test.ts` (and the e2e variant) with:

1. **Rotate-then-succeed**: 3 profiles configured; profile 0 returns `usage_limit_reached`; profile 1 succeeds. Assert the success comes from profile 1, the fallback candidate (`claude-cli`) was never invoked, and a `rotate_profile` decision was observed.
2. **Rotate-exhausted-then-fallback**: all profiles return `usage_limit_reached`; assert the fallback candidate is invoked after the rotation cap, and `MAX_PROFILE_ROTATIONS` is honored.
3. **No rotation on non-profile-level error**: profile 0 returns a generic 5xx; assert no rotation occurs and the loop falls through to the next candidate immediately.
4. **fill_first regression**: verify the between-requests rotation path (current behavior) still works after a profile is cooled down by an earlier sequence.

## Impact

- Reduces user-perceived fallback latency when preferred provider has multiple healthy profiles.
- Maximizes utilization of paid auth pools (Pro/Team plan profiles) before paying the cross-provider context-loss tax.
- No-op for installations that have only one profile per provider.

## Filed by

OpenClaw operator instance, with corroborating journal evidence from a Hermes deployment running `openclaw 2026.5.7` against three `openai-codex` OAuth profiles (one Pro plan, two Team plan).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

failover: rotate auth profiles within a candidate before falling through to next provider #79604

failover: rotate auth profiles within a candidate before falling through to next provider

Summary

Environment

Reproduction

Real evidence (anonymized)

Expected behavior

Why the existing `fill_first` rotation doesn't cover this

Suggested fix

Behavioral guarantees

Suggested tests

Impact

Filed by

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

failover: rotate auth profiles within a candidate before falling through to next provider #79604

Description

failover: rotate auth profiles within a candidate before falling through to next provider

Summary

Environment

Reproduction

Real evidence (anonymized)

Expected behavior

Why the existing fill_first rotation doesn't cover this

Suggested fix

Behavioral guarantees

Suggested tests

Impact

Filed by

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why the existing `fill_first` rotation doesn't cover this