failover: rotate auth profiles within a candidate before falling through to next provider
Labels: bug, enhancement
Repo: openclaw/openclaw
Summary
When a provider has multiple auth profiles configured in the credential pool (e.g. multiple ChatGPT OAuth profiles for openai-codex), the model-fallback loop only tries one profile per candidate before giving up on that provider and moving to the next fallback candidate (e.g. claude-cli, openrouter).
The remaining profiles in the same provider's pool are never tried within the same failover sequence. They only get exercised by subsequent user requests via fill_first selection (which works because the previously-used profile is now in cooldown). This means a single user request that hits a rate-limited profile drops straight to a different provider instead of trying the other available profiles of the same provider first.
Environment
openclaw 2026.5.7 (also reproducible on 2026.4.23)
- Provider:
openai-codex
- Three OAuth profiles in
credential_pool.openai-codex with credential_pool_strategies.openai-codex: fill_first
- Fallback chain:
openai-codex/gpt-5.5 → claude-cli/claude-opus-4-7 → openrouter/...
Reproduction
- Configure
openai-codex with 3 OAuth profiles (priorities 0, 1, 2).
- Configure
fallback_providers with claude-cli as the next candidate.
- Send a request while profile 0 is rate-limited (e.g.
usage_limit_reached) but profiles 1 and 2 are healthy.
- Observe the journal: the request hits profile 0, gets
429 usage_limit_reached, and the failover decision jumps directly to claude-cli without trying profiles 1 or 2.
Real evidence (anonymized)
A single runId from a production journal:
runId=44aed4b8-506b-4aba-8a24-a59da47e3dbd
18:29:37 embedded run agent end: error=usage_limit_reached (retry 1)
18:29:39 embedded run agent end: error=usage_limit_reached (retry 2)
18:29:44 embedded run agent end: error=usage_limit_reached (retry 3)
18:29:55 embedded run agent end: error=usage_limit_reached (retry 4)
18:29:57 auth profile failure state updated:
profile=sha256:84fb7df94e09 ← SAME profile sha across all retries
18:29:57 embedded run failover decision:
from=openai-codex/gpt-5.5
profile=sha256:84fb7df94e09
decision=fallback_model
18:29:57 model fallback decision:
candidate=openai-codex/gpt-5.5
next=claude-cli/claude-opus-4-7 ← jumped to next provider,
bypassing remaining profiles
5 retries, one profile sha, then next=claude-cli. Two other healthy openai-codex profiles in the pool were not attempted in this sequence.
Same pattern repeats on runId=7a37ed4d and runId=e63d5b16. Across hours of traffic, the journal's auth profile failure state updated line never references more than one profile sha within a single failover sequence.
Expected behavior
When a profile-level failure occurs (rate_limit, usage_limit_reached, auth_invalid), the loop should mark that profile cooled-down and retry the same candidate using the next available profile for that provider, before falling through to the next fallback candidate.
Order:
profile 0 fails → try profile 1 → try profile 2 → THEN claude-cli
instead of the current:
profile 0 fails → claude-cli
Why the existing fill_first rotation doesn't cover this
fill_first correctly rotates between requests (request N+1 picks the highest-priority non-cooled-down profile). But it does not rotate within a single failover sequence: runFallbackAttempt makes one call with one profile, and on profile-level failure the candidate-iteration loop advances to the next model candidate, not the next profile of the same model.
In practice, all profiles do eventually get exercised over time — but a single user-facing request still pays the latency hit of going to a slower / less-preferred provider when the user's preferred provider has healthy profiles available.
Suggested fix
In src/agents/model-fallback.ts, inside the candidate-iteration loop in the function that runs the failover sequence, insert a profile-rotation sub-loop between runFallbackAttempt and the existing fall-through:
let attemptRun;
let profileAttempt = 0;
const MAX_PROFILE_ROTATIONS = 3;
while (profileAttempt < MAX_PROFILE_ROTATIONS) {
attemptRun = await runFallbackAttempt({ ...candidate, options: runOptions, ... });
if ('success' in attemptRun) break;
const errInfo = describeFailoverError(attemptRun.error);
const isProfileLevelFailure =
errInfo.reason === 'rate_limit' ||
errInfo.providerErrorType === 'usage_limit_reached' ||
errInfo.reason === 'auth_invalid';
if (!isProfileLevelFailure || !authRuntime || !authStore) break;
// The failed profile is already marked cooled-down by the existing
// `auth profile failure state updated` emitter. Re-resolve order
// and exclude any now-cooled-down profiles.
const remainingProfiles = authRuntime
.resolveAuthProfileOrder({ cfg: params.cfg, store: authStore, provider: candidate.provider })
.filter(id => !authRuntime.isProfileInCooldown(authStore, id, void 0, candidate.model));
if (remainingProfiles.length === 0) break; // exhausted → fall through
await observeDecision({
decision: 'rotate_profile',
runId: params.runId,
sessionId: params.sessionId,
lane: params.lane,
candidate,
attempt: i + 1,
total: candidates.length,
reason: errInfo.reason,
nextProfileId: remainingProfiles[0],
profileCount: remainingProfiles.length,
isPrimary,
});
profileAttempt += 1;
// Loop iterates: authRuntime fill_first will pick remainingProfiles[0].
}
if (attemptRun && 'success' in attemptRun) {
// existing candidate_succeeded emit path
return attemptRun.success;
}
// existing candidate_failed emit + continue to next candidate
The same shape likely applies in src/agents/pi-embedded-runner/run/assistant-failover.ts if it has the same single-profile-per-candidate pattern (the [agent/embedded] embedded run failover decision log lines suggest it does).
Behavioral guarantees
- Only profile-level failures rotate. Network errors, 5xx, model-not-found, etc. fall through immediately as today.
MAX_PROFILE_ROTATIONS cap prevents pathological loops if the profile pool grows large or cooldown bookkeeping has bugs.
- Existing
fill_first / cooldown bookkeeping is reused — no schema changes.
- New
rotate_profile observability event makes the in-failover rotation visible in the journal.
Suggested tests
Extend src/agents/model-fallback.test.ts (and the e2e variant) with:
- Rotate-then-succeed: 3 profiles configured; profile 0 returns
usage_limit_reached; profile 1 succeeds. Assert the success comes from profile 1, the fallback candidate (claude-cli) was never invoked, and a rotate_profile decision was observed.
- Rotate-exhausted-then-fallback: all profiles return
usage_limit_reached; assert the fallback candidate is invoked after the rotation cap, and MAX_PROFILE_ROTATIONS is honored.
- No rotation on non-profile-level error: profile 0 returns a generic 5xx; assert no rotation occurs and the loop falls through to the next candidate immediately.
- fill_first regression: verify the between-requests rotation path (current behavior) still works after a profile is cooled down by an earlier sequence.
Impact
- Reduces user-perceived fallback latency when preferred provider has multiple healthy profiles.
- Maximizes utilization of paid auth pools (Pro/Team plan profiles) before paying the cross-provider context-loss tax.
- No-op for installations that have only one profile per provider.
Filed by
OpenClaw operator instance, with corroborating journal evidence from a Hermes deployment running openclaw 2026.5.7 against three openai-codex OAuth profiles (one Pro plan, two Team plan).
failover: rotate auth profiles within a candidate before falling through to next provider
Labels:
bug,enhancementRepo: openclaw/openclaw
Summary
When a provider has multiple auth profiles configured in the credential pool (e.g. multiple ChatGPT OAuth profiles for
openai-codex), the model-fallback loop only tries one profile per candidate before giving up on that provider and moving to the next fallback candidate (e.g.claude-cli,openrouter).The remaining profiles in the same provider's pool are never tried within the same failover sequence. They only get exercised by subsequent user requests via
fill_firstselection (which works because the previously-used profile is now in cooldown). This means a single user request that hits a rate-limited profile drops straight to a different provider instead of trying the other available profiles of the same provider first.Environment
openclaw2026.5.7 (also reproducible on 2026.4.23)openai-codexcredential_pool.openai-codexwithcredential_pool_strategies.openai-codex: fill_firstopenai-codex/gpt-5.5→claude-cli/claude-opus-4-7→openrouter/...Reproduction
openai-codexwith 3 OAuth profiles (priorities 0, 1, 2).fallback_providerswithclaude-clias the next candidate.usage_limit_reached) but profiles 1 and 2 are healthy.429 usage_limit_reached, and the failover decision jumps directly toclaude-cliwithout trying profiles 1 or 2.Real evidence (anonymized)
A single
runIdfrom a production journal:5 retries, one profile sha, then
next=claude-cli. Two other healthyopenai-codexprofiles in the pool were not attempted in this sequence.Same pattern repeats on
runId=7a37ed4dandrunId=e63d5b16. Across hours of traffic, the journal'sauth profile failure state updatedline never references more than one profile sha within a single failover sequence.Expected behavior
When a profile-level failure occurs (
rate_limit,usage_limit_reached,auth_invalid), the loop should mark that profile cooled-down and retry the same candidate using the next available profile for that provider, before falling through to the next fallback candidate.Order:
instead of the current:
Why the existing
fill_firstrotation doesn't cover thisfill_firstcorrectly rotates between requests (request N+1 picks the highest-priority non-cooled-down profile). But it does not rotate within a single failover sequence:runFallbackAttemptmakes one call with one profile, and on profile-level failure the candidate-iteration loop advances to the next model candidate, not the next profile of the same model.In practice, all profiles do eventually get exercised over time — but a single user-facing request still pays the latency hit of going to a slower / less-preferred provider when the user's preferred provider has healthy profiles available.
Suggested fix
In
src/agents/model-fallback.ts, inside the candidate-iteration loop in the function that runs the failover sequence, insert a profile-rotation sub-loop betweenrunFallbackAttemptand the existing fall-through:The same shape likely applies in
src/agents/pi-embedded-runner/run/assistant-failover.tsif it has the same single-profile-per-candidate pattern (the[agent/embedded] embedded run failover decisionlog lines suggest it does).Behavioral guarantees
MAX_PROFILE_ROTATIONScap prevents pathological loops if the profile pool grows large or cooldown bookkeeping has bugs.fill_first/ cooldown bookkeeping is reused — no schema changes.rotate_profileobservability event makes the in-failover rotation visible in the journal.Suggested tests
Extend
src/agents/model-fallback.test.ts(and the e2e variant) with:usage_limit_reached; profile 1 succeeds. Assert the success comes from profile 1, the fallback candidate (claude-cli) was never invoked, and arotate_profiledecision was observed.usage_limit_reached; assert the fallback candidate is invoked after the rotation cap, andMAX_PROFILE_ROTATIONSis honored.Impact
Filed by
OpenClaw operator instance, with corroborating journal evidence from a Hermes deployment running
openclaw 2026.5.7against threeopenai-codexOAuth profiles (one Pro plan, two Team plan).