Skip to content

Bug: Auth profile cooldowns cascade across sessions and providers, causing total agent failure #23815

@0xSatoriSync

Description

@0xSatoriSync

Summary

Auth profile cooldowns contaminate across providers, models, and sessions. A single sub-agent timeout on one provider can cascade into total system failure — killing the main agent, all sub-agents, and blocking healthy providers that were never involved in the original failure.

Related to #13623 (model-level cooldown within a single provider), but this covers the broader cross-session and cross-provider contamination.

The Cascade Chain

  1. Sub-agent hits Anthropic timeout (e.g. Opus + thinking=high + large context)
  2. OpenClaw records failure → sets cooldownUntil on the shared auth profile
  3. Main agent tries Opus → profile in cooldown → skips to fallback
  4. Fallback provider (e.g. Gemini) → 429 (quota exhausted from prior sub-agent usage)
  5. All models on all providers fail → user gets "Agent failed before reply"
  6. Agent is completely dead until cooldown manually cleared

Three Contamination Vectors

1. Cross-Provider Contamination

A failure on one provider blocks unrelated providers sharing the same auth profile:

Provider A hits 429
  → cooldown set on auth profile usageStats
  → Provider B: blocked ❌ (was working fine, different provider entirely)

2. Cross-Session Contamination

Cooldown state is stored in auth-profiles.json, shared across:

  • The main agent session
  • All sub-agent sessions spawned by that agent

A sub-agent timeout poisons the shared auth profile → the main agent can't use any model → total failure.

3. Cross-Model Contamination

Within the same provider, cooldowns are not model-scoped:

Opus times out (thinking=high, slow response)
  → cooldown set on "anthropic" provider
  → Sonnet: blocked ❌ (fast model, would have worked fine as fallback)

Dead Fallback Compounding

If fallback providers are unavailable (expired tokens, quota exhausted), every primary timeout guarantees total failure and maximizes cooldown contamination:

1. Primary model times out
2. Fallback #1 tried → 429 (dead provider)
3. Fallback #2 tried → 429 (dead provider)
4. All models failed → cooldown set across ALL providers
5. Next request: healthy providers blocked by cooldown from dead ones

Reproduction

  1. Configure multiple agents sharing the same auth profile
  2. Configure primary model + fallbacks across different providers
  3. Run sub-agents with thinking=high on complex tasks (increases timeout risk)
  4. Wait for a timeout on the primary model

Expected: Graceful fallback to another model; sub-agent failures don't affect parent.
Actual: Provider-wide cooldown blocks all models across all sessions. Total agent death.

Workarounds

Reduce cooldown duration

{
  "auth": {
    "cooldowns": {
      "billingBackoffHours": 0.005,
      "billingMaxHours": 0.01,
      "failureWindowHours": 0.005
    }
  }
}

Periodic cooldown clearing

A launchd/cron timer that clears usageStats from all auth-profiles.json files every 60 seconds.

Remove all dead providers

Strip non-working providers from auth profiles and fallback config entirely.

Suggested Fixes

  1. Isolate cooldowns per provider — A failure on provider A should never block provider B (critical)
  2. Isolate cooldowns per model — An Opus timeout should not block Sonnet ([Bug]: Model-Specific google-gemini-cli 429 Error Triggers Global Auth Profile Cooldown (Killing Fallbacks) #13623)
  3. Session-scoped cooldowns — Sub-agent failures should not contaminate the parent agent's auth profile
  4. Configurable cooldown behavior — Allow disabling cooldowns entirely, or per-provider overrides
  5. Smarter fallback logic — Skip persistently-failing fallbacks adaptively

Environment

  • OpenClaw 2026.2.17
  • macOS (arm64)
  • Multi-agent setup with shared auth profiles
  • Anthropic as primary provider

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions