feat: structured provider error classification#5441
Open
kshitijk4poor wants to merge 1 commit into
Open
Conversation
This was referenced Apr 6, 2026
7538e40 to
b86a04c
Compare
b86a04c to
23e69cc
Compare
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #5435. Also addresses #5449 (rate limit header pre-emption) and adds retry jitter.
Summary
Extracts the fragile inline string-matching error classification from
run_agent.py's retry loop into a typed, testable module. Also adds proactive rate limit tracking and retry jitter.1. Structured error classification
Current state: Error classification is ~90 lines of inline
any(phrase in error_msg for phrase in [...])scattered across the retry loop. New providers with different wording can slip through, causing retryable errors to be treated as permanent or vice versa.Introduces
agent/provider_errors.pywith:ProviderErrorReasonenum (13 typed reasons: AUTH, AUTH_PERMANENT, RATE_LIMIT, OVERLOADED, BILLING, MODEL_NOT_FOUND, CONTEXT_OVERFLOW, PAYLOAD_TOO_LARGE, FORMAT_ERROR, TIMEOUT, SERVER_ERROR, STREAM_DROP, UNKNOWN)ProviderErrordataclass with reason, status_code, retryable flagclassify_provider_error()function preserving ALL existing classification logicis_retryable()andsuggested_action()helpers~87 lines of inline matching removed, replaced by single classify call. Zero behavior change for classification.
2. Rate limit header pre-emption
Problem: Hermes only reacts to rate limits after receiving a 429. Response headers like
X-RateLimit-Remainingare ignored.Adds
RateLimitStatetracker andparse_rate_limit_headers():X-RateLimit-Remaining,X-RateLimit-Limit,X-RateLimit-Resetfrom successful responses3. Retry jitter
Adds 0–25% random jitter to both retry backoff paths to prevent thundering-herd when multiple sessions hit the same provider limit.
Tests
129 tests covering every classification path, rate limit header parsing, throttle decisions, edge cases, priority ordering, retryability, and suggested actions. All 226 existing
test_run_agent.pytests pass.