-
-
Notifications
You must be signed in to change notification settings - Fork 52.6k
Description
Timeout-driven auth rotation causes premature model/provider fallback
Problem
In the embedded runner’s auth-profile failover loop (used by providers that support auth.profiles), a request timeout is currently treated as a strong signal to rotate/cool down the current profile. This can cascade into “no available auth profile” and then into model/provider fallback (when agents.defaults.model.fallbacks is configured), even when the underlying provider may be temporarily slow rather than rate-limited.
Concrete example below uses
openai-codex, but the underlying mechanism is provider-agnostic: any provider that participates in the auth-profile loop can hit the same timeout → cooldown/rotate → fallback cascade.
When a request times out, OpenClaw immediately:
- Marks the current auth profile as failed (
reason: "timeout") - Applies exponential cooldown to that profile
- Attempts to rotate to the next account/profile
- If all profiles are in cooldown/unavailable, raises
No available auth profile for <provider> (all in cooldown or unavailable) - If fallbacks are configured, proceeds to the next model/provider candidate
This is too aggressive for generic timeouts, which are often transient (network blip, slow streaming, SDK hiccup, temporary provider latency) rather than a reliable rate-limit signal.
Observed log sequence (example: openai-codex)
(Representative messages; wrapper/context may vary, but the key strings are stable.)
Profile openai-codex:default timed out (possible rate limit). Trying next account...
No available auth profile for openai-codex (all in cooldown or unavailable).
... provider=openai model=gpt-5.2 ... # run continues using agents.defaults.model.fallbacks
Root cause (code pointers)
src/agents/pi-embedded-runner/run.ts
- Timeout is treated as a rotation condition (no retry gate)
- On rotate: calls
markAuthProfileFailure(reason: "timeout"), thenadvanceAuthProfile() - Emits:
Profile ${lastProfileId} timed out (possible rate limit). Trying next account...No available auth profile for ${provider} (all in cooldown or unavailable).
src/agents/auth-profiles/usage.ts
calculateAuthProfileCooldownMs(errorCount)applies the same exponential schedule fortimeoutand explicit rate-limit failures alike (~1m → 5m → 25m → 1h cap).
Expected behavior
On a timeout, OpenClaw should retry the same profile at least once with jittered backoff before applying cooldown or rotating. Aggressive cooldown and rotation should be reserved for strong rate-limit signals (HTTP 429 or provider-specific rate-limit codes).
Proposed fix
Minimal (recommended first step)
Add a per-reason retry gate before calling markAuthProfileFailure(reason="timeout") and rotating:
- If
failureReason === "timeout"andconsecutiveTimeouts < retrySameProfileOnTimeout:- wait
jitter(retryBackoffMs) - re-issue the request on the same profile
- do not write cooldown state yet
- wait
- Only after retries are exhausted: apply cooldown + rotate as today
Suggested config knob (additive / backward-compatible):
agents: {
defaults: {
modelFailover: {
retrySameProfileOnTimeout: 1,
retryBackoffMs: [300, 1200]
}
}
}Optional extensions
- Per-reason cooldown schedules (lighter backoff for timeouts)
- Minimum consecutive failures before cooldown (don’t cooldown on first timeout)
- Separate “rotate on” criteria per failure reason
Acceptance criteria
- A single timeout retries the same auth profile with jittered delay; no cooldown entry is written.
- Cooldown for
timeoutonly applies after configured retries are exhausted. - Explicit rate-limit failures (429 / provider error codes) still trigger immediate cooldown + rotation.
- Logs/telemetry show: retry attempt #, delay, and whether cooldown was applied.
Test ideas
- Unit: timeout → retry same profile → success; assert no cooldown written, no rotation.
- Unit: timeout → timeout (retries exhausted) → cooldown+rotate.
- Mocked e2e: multiple profiles + intermittent timeouts should not immediately exhaust all profiles.
Workarounds (today)
- Increase
agents.defaults.timeoutSecondsto reduce timeout frequency. - Disable or narrow
agents.defaults.model.fallbacksif provider switching is undesirable.