Skip to content

feat: structured provider error classification#5441

Open
kshitijk4poor wants to merge 1 commit into
NousResearch:mainfrom
kshitijk4poor:feat/structured-error-classification
Open

feat: structured provider error classification#5441
kshitijk4poor wants to merge 1 commit into
NousResearch:mainfrom
kshitijk4poor:feat/structured-error-classification

Conversation

@kshitijk4poor

@kshitijk4poor kshitijk4poor commented Apr 6, 2026

Copy link
Copy Markdown
Collaborator

Fixes #5435. Also addresses #5449 (rate limit header pre-emption) and adds retry jitter.

Summary

Extracts the fragile inline string-matching error classification from run_agent.py's retry loop into a typed, testable module. Also adds proactive rate limit tracking and retry jitter.

1. Structured error classification

Current state: Error classification is ~90 lines of inline any(phrase in error_msg for phrase in [...]) scattered across the retry loop. New providers with different wording can slip through, causing retryable errors to be treated as permanent or vice versa.

Introduces agent/provider_errors.py with:

  • ProviderErrorReason enum (13 typed reasons: AUTH, AUTH_PERMANENT, RATE_LIMIT, OVERLOADED, BILLING, MODEL_NOT_FOUND, CONTEXT_OVERFLOW, PAYLOAD_TOO_LARGE, FORMAT_ERROR, TIMEOUT, SERVER_ERROR, STREAM_DROP, UNKNOWN)
  • ProviderError dataclass with reason, status_code, retryable flag
  • classify_provider_error() function preserving ALL existing classification logic
  • is_retryable() and suggested_action() helpers

~87 lines of inline matching removed, replaced by single classify call. Zero behavior change for classification.

2. Rate limit header pre-emption

Problem: Hermes only reacts to rate limits after receiving a 429. Response headers like X-RateLimit-Remaining are ignored.

Adds RateLimitState tracker and parse_rate_limit_headers():

  • Parses X-RateLimit-Remaining, X-RateLimit-Limit, X-RateLimit-Reset from successful responses
  • When remaining quota drops below 10%, inserts a small adaptive delay before the next API call
  • Prevents hard 429s by smoothing out request rate proactively

3. Retry jitter

Adds 0–25% random jitter to both retry backoff paths to prevent thundering-herd when multiple sessions hit the same provider limit.

Tests

129 tests covering every classification path, rate limit header parsing, throttle decisions, edge cases, priority ordering, retryability, and suggested actions. All 226 existing test_run_agent.py tests pass.

@kshitijk4poor kshitijk4poor force-pushed the feat/structured-error-classification branch from b86a04c to 23e69cc Compare April 6, 2026 07:49
@alt-glitch alt-glitch added comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have type/refactor Code restructuring, no behavior change labels May 1, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Note: #6514 (structured API error classification for smart failover) was already merged. Verify whether this PR adds value beyond what #6514 delivered, or is a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have type/refactor Code restructuring, no behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Structured provider error classification

2 participants