Skip to content

Timeout-driven auth rotation/cooldown on timeouts causes premature provider fallback (proposal: retry/backoff same profile) #23317

@phenomenoner

Description

@phenomenoner

Timeout-driven auth rotation causes premature model/provider fallback

Problem

In the embedded runner’s auth-profile failover loop (used by providers that support auth.profiles), a request timeout is currently treated as a strong signal to rotate/cool down the current profile. This can cascade into “no available auth profile” and then into model/provider fallback (when agents.defaults.model.fallbacks is configured), even when the underlying provider may be temporarily slow rather than rate-limited.

Concrete example below uses openai-codex, but the underlying mechanism is provider-agnostic: any provider that participates in the auth-profile loop can hit the same timeout → cooldown/rotate → fallback cascade.

When a request times out, OpenClaw immediately:

  1. Marks the current auth profile as failed (reason: "timeout")
  2. Applies exponential cooldown to that profile
  3. Attempts to rotate to the next account/profile
  4. If all profiles are in cooldown/unavailable, raises No available auth profile for <provider> (all in cooldown or unavailable)
  5. If fallbacks are configured, proceeds to the next model/provider candidate

This is too aggressive for generic timeouts, which are often transient (network blip, slow streaming, SDK hiccup, temporary provider latency) rather than a reliable rate-limit signal.


Observed log sequence (example: openai-codex)

(Representative messages; wrapper/context may vary, but the key strings are stable.)

Profile openai-codex:default timed out (possible rate limit). Trying next account...
No available auth profile for openai-codex (all in cooldown or unavailable).
... provider=openai model=gpt-5.2 ...   # run continues using agents.defaults.model.fallbacks

Root cause (code pointers)

src/agents/pi-embedded-runner/run.ts

  • Timeout is treated as a rotation condition (no retry gate)
  • On rotate: calls markAuthProfileFailure(reason: "timeout"), then advanceAuthProfile()
  • Emits:
    • Profile ${lastProfileId} timed out (possible rate limit). Trying next account...
    • No available auth profile for ${provider} (all in cooldown or unavailable).

src/agents/auth-profiles/usage.ts

  • calculateAuthProfileCooldownMs(errorCount) applies the same exponential schedule for timeout and explicit rate-limit failures alike (~1m → 5m → 25m → 1h cap).

Expected behavior

On a timeout, OpenClaw should retry the same profile at least once with jittered backoff before applying cooldown or rotating. Aggressive cooldown and rotation should be reserved for strong rate-limit signals (HTTP 429 or provider-specific rate-limit codes).


Proposed fix

Minimal (recommended first step)

Add a per-reason retry gate before calling markAuthProfileFailure(reason="timeout") and rotating:

  1. If failureReason === "timeout" and consecutiveTimeouts < retrySameProfileOnTimeout:
    • wait jitter(retryBackoffMs)
    • re-issue the request on the same profile
    • do not write cooldown state yet
  2. Only after retries are exhausted: apply cooldown + rotate as today

Suggested config knob (additive / backward-compatible):

agents: {
  defaults: {
    modelFailover: {
      retrySameProfileOnTimeout: 1,
      retryBackoffMs: [300, 1200]
    }
  }
}

Optional extensions

  • Per-reason cooldown schedules (lighter backoff for timeouts)
  • Minimum consecutive failures before cooldown (don’t cooldown on first timeout)
  • Separate “rotate on” criteria per failure reason

Acceptance criteria

  • A single timeout retries the same auth profile with jittered delay; no cooldown entry is written.
  • Cooldown for timeout only applies after configured retries are exhausted.
  • Explicit rate-limit failures (429 / provider error codes) still trigger immediate cooldown + rotation.
  • Logs/telemetry show: retry attempt #, delay, and whether cooldown was applied.

Test ideas

  • Unit: timeout → retry same profile → success; assert no cooldown written, no rotation.
  • Unit: timeout → timeout (retries exhausted) → cooldown+rotate.
  • Mocked e2e: multiple profiles + intermittent timeouts should not immediately exhaust all profiles.

Workarounds (today)

  • Increase agents.defaults.timeoutSeconds to reduce timeout frequency.
  • Disable or narrow agents.defaults.model.fallbacks if provider switching is undesirable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions