-
-
Notifications
You must be signed in to change notification settings - Fork 54.5k
Description
Summary
Auth profile cooldowns contaminate across providers, models, and sessions. A single sub-agent timeout on one provider can cascade into total system failure — killing the main agent, all sub-agents, and blocking healthy providers that were never involved in the original failure.
Related to #13623 (model-level cooldown within a single provider), but this covers the broader cross-session and cross-provider contamination.
The Cascade Chain
- Sub-agent hits Anthropic timeout (e.g. Opus +
thinking=high+ large context) - OpenClaw records failure → sets
cooldownUntilon the shared auth profile - Main agent tries Opus → profile in cooldown → skips to fallback
- Fallback provider (e.g. Gemini) → 429 (quota exhausted from prior sub-agent usage)
- All models on all providers fail → user gets "Agent failed before reply"
- Agent is completely dead until cooldown manually cleared
Three Contamination Vectors
1. Cross-Provider Contamination
A failure on one provider blocks unrelated providers sharing the same auth profile:
Provider A hits 429
→ cooldown set on auth profile usageStats
→ Provider B: blocked ❌ (was working fine, different provider entirely)
2. Cross-Session Contamination
Cooldown state is stored in auth-profiles.json, shared across:
- The main agent session
- All sub-agent sessions spawned by that agent
A sub-agent timeout poisons the shared auth profile → the main agent can't use any model → total failure.
3. Cross-Model Contamination
Within the same provider, cooldowns are not model-scoped:
Opus times out (thinking=high, slow response)
→ cooldown set on "anthropic" provider
→ Sonnet: blocked ❌ (fast model, would have worked fine as fallback)
Dead Fallback Compounding
If fallback providers are unavailable (expired tokens, quota exhausted), every primary timeout guarantees total failure and maximizes cooldown contamination:
1. Primary model times out
2. Fallback #1 tried → 429 (dead provider)
3. Fallback #2 tried → 429 (dead provider)
4. All models failed → cooldown set across ALL providers
5. Next request: healthy providers blocked by cooldown from dead ones
Reproduction
- Configure multiple agents sharing the same auth profile
- Configure primary model + fallbacks across different providers
- Run sub-agents with
thinking=highon complex tasks (increases timeout risk) - Wait for a timeout on the primary model
Expected: Graceful fallback to another model; sub-agent failures don't affect parent.
Actual: Provider-wide cooldown blocks all models across all sessions. Total agent death.
Workarounds
Reduce cooldown duration
{
"auth": {
"cooldowns": {
"billingBackoffHours": 0.005,
"billingMaxHours": 0.01,
"failureWindowHours": 0.005
}
}
}Periodic cooldown clearing
A launchd/cron timer that clears usageStats from all auth-profiles.json files every 60 seconds.
Remove all dead providers
Strip non-working providers from auth profiles and fallback config entirely.
Suggested Fixes
- Isolate cooldowns per provider — A failure on provider A should never block provider B (critical)
- Isolate cooldowns per model — An Opus timeout should not block Sonnet ([Bug]: Model-Specific google-gemini-cli 429 Error Triggers Global Auth Profile Cooldown (Killing Fallbacks) #13623)
- Session-scoped cooldowns — Sub-agent failures should not contaminate the parent agent's auth profile
- Configurable cooldown behavior — Allow disabling cooldowns entirely, or per-provider overrides
- Smarter fallback logic — Skip persistently-failing fallbacks adaptively
Environment
- OpenClaw 2026.2.17
- macOS (arm64)
- Multi-agent setup with shared auth profiles
- Anthropic as primary provider