Timeout-driven auth rotation/cooldown on timeouts causes premature provider fallback (proposal: retry/backoff same profile)

## Timeout-driven auth rotation causes premature model/provider fallback

### Problem

In the embedded runner’s **auth-profile failover loop** (used by providers that support `auth.profiles`), a request timeout is currently treated as a strong signal to rotate/cool down the current profile. This can cascade into **“no available auth profile”** and then into **model/provider fallback** (when `agents.defaults.model.fallbacks` is configured), even when the underlying provider may be temporarily slow rather than rate-limited.

> Concrete example below uses `openai-codex`, but the underlying mechanism is provider-agnostic: any provider that participates in the auth-profile loop can hit the same timeout → cooldown/rotate → fallback cascade.

When a request times out, OpenClaw immediately:

1. Marks the current auth profile as failed (`reason: "timeout"`)
2. Applies exponential cooldown to that profile
3. Attempts to rotate to the next account/profile
4. If all profiles are in cooldown/unavailable, raises `No available auth profile for <provider> (all in cooldown or unavailable)`
5. If fallbacks are configured, proceeds to the next model/provider candidate

This is too aggressive for **generic timeouts**, which are often transient (network blip, slow streaming, SDK hiccup, temporary provider latency) rather than a reliable rate-limit signal.

---

### Observed log sequence (example: `openai-codex`)

(Representative messages; wrapper/context may vary, but the key strings are stable.)

```
Profile openai-codex:default timed out (possible rate limit). Trying next account...
No available auth profile for openai-codex (all in cooldown or unavailable).
... provider=openai model=gpt-5.2 ...   # run continues using agents.defaults.model.fallbacks
```

---

### Root cause (code pointers)

**`src/agents/pi-embedded-runner/run.ts`**
- Timeout is treated as a rotation condition (no retry gate)
- On rotate: calls `markAuthProfileFailure(reason: "timeout")`, then `advanceAuthProfile()`
- Emits:
  - `Profile ${lastProfileId} timed out (possible rate limit). Trying next account...`
  - `No available auth profile for ${provider} (all in cooldown or unavailable).`

**`src/agents/auth-profiles/usage.ts`**
- `calculateAuthProfileCooldownMs(errorCount)` applies the same exponential schedule for `timeout` and explicit rate-limit failures alike (`~1m → 5m → 25m → 1h cap`).

---

### Expected behavior

On a timeout, OpenClaw should **retry the same profile at least once with jittered backoff** before applying cooldown or rotating. Aggressive cooldown and rotation should be reserved for strong rate-limit signals (HTTP 429 or provider-specific rate-limit codes).

---

### Proposed fix

#### Minimal (recommended first step)

Add a per-reason retry gate before calling `markAuthProfileFailure(reason="timeout")` and rotating:

1. If `failureReason === "timeout"` and `consecutiveTimeouts < retrySameProfileOnTimeout`:
   - wait `jitter(retryBackoffMs)`
   - re-issue the request on the **same** profile
   - do **not** write cooldown state yet
2. Only after retries are exhausted: apply cooldown + rotate as today

Suggested config knob (additive / backward-compatible):

```json5
agents: {
  defaults: {
    modelFailover: {
      retrySameProfileOnTimeout: 1,
      retryBackoffMs: [300, 1200]
    }
  }
}
```

#### Optional extensions

- Per-reason cooldown schedules (lighter backoff for timeouts)
- Minimum consecutive failures before cooldown (don’t cooldown on first timeout)
- Separate “rotate on” criteria per failure reason

---

### Acceptance criteria

- [ ] A single timeout retries the **same** auth profile with jittered delay; no cooldown entry is written.
- [ ] Cooldown for `timeout` only applies after configured retries are exhausted.
- [ ] Explicit rate-limit failures (429 / provider error codes) still trigger immediate cooldown + rotation.
- [ ] Logs/telemetry show: retry attempt #, delay, and whether cooldown was applied.

---

### Test ideas

- Unit: timeout → retry same profile → success; assert no cooldown written, no rotation.
- Unit: timeout → timeout (retries exhausted) → cooldown+rotate.
- Mocked e2e: multiple profiles + intermittent timeouts should not immediately exhaust all profiles.

---

### Workarounds (today)

- Increase `agents.defaults.timeoutSeconds` to reduce timeout frequency.
- Disable or narrow `agents.defaults.model.fallbacks` if provider switching is undesirable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Timeout-driven auth rotation/cooldown on timeouts causes premature provider fallback (proposal: retry/backoff same profile) #23317

Timeout-driven auth rotation causes premature model/provider fallback

Problem

Observed log sequence (example: `openai-codex`)

Root cause (code pointers)

Expected behavior

Proposed fix

Minimal (recommended first step)

Optional extensions

Acceptance criteria

Test ideas

Workarounds (today)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Timeout-driven auth rotation/cooldown on timeouts causes premature provider fallback (proposal: retry/backoff same profile) #23317

Description

Timeout-driven auth rotation causes premature model/provider fallback

Problem

Observed log sequence (example: openai-codex)

Root cause (code pointers)

Expected behavior

Proposed fix

Minimal (recommended first step)

Optional extensions

Acceptance criteria

Test ideas

Workarounds (today)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Observed log sequence (example: `openai-codex`)