Bug: Auth profile cooldowns cascade across sessions and providers, causing total agent failure

## Summary

Auth profile cooldowns contaminate across **providers**, **models**, and **sessions**. A single sub-agent timeout on one provider can cascade into total system failure — killing the main agent, all sub-agents, and blocking healthy providers that were never involved in the original failure.

Related to #13623 (model-level cooldown within a single provider), but this covers the broader **cross-session** and **cross-provider** contamination.

## The Cascade Chain

1. Sub-agent hits Anthropic timeout (e.g. Opus + `thinking=high` + large context)
2. OpenClaw records failure → sets `cooldownUntil` on the shared auth profile
3. Main agent tries Opus → profile in cooldown → skips to fallback
4. Fallback provider (e.g. Gemini) → 429 (quota exhausted from prior sub-agent usage)
5. All models on all providers fail → user gets "Agent failed before reply"
6. Agent is completely dead until cooldown manually cleared

## Three Contamination Vectors

### 1. Cross-Provider Contamination

A failure on one provider blocks unrelated providers sharing the same auth profile:

```
Provider A hits 429
  → cooldown set on auth profile usageStats
  → Provider B: blocked ❌ (was working fine, different provider entirely)
```

### 2. Cross-Session Contamination

Cooldown state is stored in `auth-profiles.json`, shared across:
- The main agent session
- All sub-agent sessions spawned by that agent

A sub-agent timeout poisons the shared auth profile → the main agent can't use any model → total failure.

### 3. Cross-Model Contamination

Within the same provider, cooldowns are not model-scoped:

```
Opus times out (thinking=high, slow response)
  → cooldown set on "anthropic" provider
  → Sonnet: blocked ❌ (fast model, would have worked fine as fallback)
```

## Dead Fallback Compounding

If fallback providers are unavailable (expired tokens, quota exhausted), every primary timeout guarantees total failure and maximizes cooldown contamination:

```
1. Primary model times out
2. Fallback #1 tried → 429 (dead provider)
3. Fallback #2 tried → 429 (dead provider)
4. All models failed → cooldown set across ALL providers
5. Next request: healthy providers blocked by cooldown from dead ones
```

## Reproduction

1. Configure multiple agents sharing the same auth profile
2. Configure primary model + fallbacks across different providers
3. Run sub-agents with `thinking=high` on complex tasks (increases timeout risk)
4. Wait for a timeout on the primary model

**Expected:** Graceful fallback to another model; sub-agent failures don't affect parent.
**Actual:** Provider-wide cooldown blocks all models across all sessions. Total agent death.

## Workarounds

### Reduce cooldown duration
```json
{
  "auth": {
    "cooldowns": {
      "billingBackoffHours": 0.005,
      "billingMaxHours": 0.01,
      "failureWindowHours": 0.005
    }
  }
}
```

### Periodic cooldown clearing
A launchd/cron timer that clears `usageStats` from all `auth-profiles.json` files every 60 seconds.

### Remove all dead providers
Strip non-working providers from auth profiles and fallback config entirely.

## Suggested Fixes

1. **Isolate cooldowns per provider** — A failure on provider A should never block provider B _(critical)_
2. **Isolate cooldowns per model** — An Opus timeout should not block Sonnet (#13623)
3. **Session-scoped cooldowns** — Sub-agent failures should not contaminate the parent agent's auth profile
4. **Configurable cooldown behavior** — Allow disabling cooldowns entirely, or per-provider overrides
5. **Smarter fallback logic** — Skip persistently-failing fallbacks adaptively

## Environment

- OpenClaw 2026.2.17
- macOS (arm64)
- Multi-agent setup with shared auth profiles
- Anthropic as primary provider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Auth profile cooldowns cascade across sessions and providers, causing total agent failure #23815

Summary

The Cascade Chain

Three Contamination Vectors

1. Cross-Provider Contamination

2. Cross-Session Contamination

3. Cross-Model Contamination

Dead Fallback Compounding

Reproduction

Workarounds

Reduce cooldown duration

Periodic cooldown clearing

Remove all dead providers

Suggested Fixes

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Bug: Auth profile cooldowns cascade across sessions and providers, causing total agent failure #23815

Description

Summary

The Cascade Chain

Three Contamination Vectors

1. Cross-Provider Contamination

2. Cross-Session Contamination

3. Cross-Model Contamination

Dead Fallback Compounding

Reproduction

Workarounds

Reduce cooldown duration

Periodic cooldown clearing

Remove all dead providers

Suggested Fixes

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions