Skip to content

feat: provider cooldown / circuit breaker#5442

Open
kshitijk4poor wants to merge 1 commit into
NousResearch:mainfrom
kshitijk4poor:feat/provider-cooldown-circuit-breaker
Open

feat: provider cooldown / circuit breaker#5442
kshitijk4poor wants to merge 1 commit into
NousResearch:mainfrom
kshitijk4poor:feat/provider-cooldown-circuit-breaker

Conversation

@kshitijk4poor

@kshitijk4poor kshitijk4poor commented Apr 6, 2026

Copy link
Copy Markdown
Collaborator

Fixes #5436. Also addresses #5451 (provider runtime health observability).

Summary

Adds a process-scoped provider cooldown tracker (circuit breaker) and runtime health observability.

1. Provider cooldown / circuit breaker

Problem: When a provider returns 401/403/429, Hermes retries then possibly falls back, but on the next iteration it hits the same broken provider again with no memory of prior failures.

Introduces agent/provider_cooldown.py with:

  • CooldownEntry dataclass tracking error count, cooldown expiry, failure reason
  • ProviderCooldownTracker (thread-safe singleton) with escalating backoff:
    • Transient errors (rate_limit, overloaded, auth): 30s → 60s → 5min
    • Permanent errors (auth_permanent, billing): 5min → 10min → 30min
  • Circuit breaker semantics: record_failure() opens/escalates, record_success() closes

Integration: checked before each API call → fallback or wait. Reset on success.

2. Provider health observability

Extends the tracker with ProviderHealthStats:

  • Tracks per-provider success count, error count, average latency, last error reason
  • get_health_summary() returns a compact dict suitable for /status display
  • Health stats survive cooldown expiry (session-scoped)
  • Latency captured from actual API call timing

Independent of #5441 (P0 error classification) — uses plain string reason codes.

Tests

63 tests covering escalating backoff, circuit close on success, thread safety, expired auto-clear, singleton pattern, health stats accumulation/persistence/filtering, and diagnostics.

@kshitijk4poor kshitijk4poor force-pushed the feat/provider-cooldown-circuit-breaker branch from 45e1ec5 to b2ac277 Compare April 6, 2026 07:49
@alt-glitch alt-glitch added comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have type/feature New feature or request labels May 1, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to merged #15134 (consolidated retry fix with cooldown logic). Verify overlap before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provider cooldown / circuit breaker for persistent failures

2 participants