Skip to content

feat: structured API error classification for smart failover#6514

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-0429963a
Apr 9, 2026
Merged

feat: structured API error classification for smart failover#6514
teknium1 merged 1 commit into
mainfrom
hermes/hermes-0429963a

Conversation

@teknium1

@teknium1 teknium1 commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds agent/error_classifier.py — a structured error classification system inspired by OpenClaw's failover error taxonomy. Replaces scattered inline string-matching in run_agent.py's retry loop with a centralized, priority-ordered classification pipeline.

What changed

New: agent/error_classifier.py (690 lines)

  • FailoverReason enum with 14 error categories
  • ClassifiedError dataclass with recovery action hints
  • classify_api_error() priority-ordered pipeline:
    1. Provider-specific patterns (Anthropic thinking sig, long-context tier)
    2. HTTP status code + message-aware refinement
    3. Structured error codes from body
    4. Message pattern matching (billing vs rate_limit vs context overflow)
    5. Server disconnect + large session → context overflow heuristic
    6. Transport error heuristics
    7. Fallback: unknown (retryable)

Modified: run_agent.py (67 insertions, 122 deletions = net -55 lines)

  • Classification call after error context extraction
  • 6 inline detection blocks replaced with classifier checks
  • All recovery actions (pool rotation, fallback, compression) unchanged

New: tests/agent/test_error_classifier.py (65 tests)

Key improvements

Error Old behavior New behavior
402 'usage limit, try again' Billing (non-retryable) Rate limit (retryable, backoff)
403 'key limit exceeded' (OpenRouter) Auth error Billing (rotate credential)
402 'insufficient credits' Billing Billing (unchanged, correct)
RemoteProtocolError + large session Timeout (retry) Context overflow (compress)
Error with cause chain wrapping Status code missed Walks cause chain up to 5 levels
SDK body message patterns Invisible to matcher Enriched error_msg includes body

Test results

  • 65 new unit tests: ✔
  • 1670 existing run_agent/agent tests: ✔ (1 pre-existing failure from missing anthropic package)
  • 10 E2E integration tests: ✔
  • Live tests with real OpenAI SDK error objects (AuthenticationError, RateLimitError, NotFoundError, APIStatusError, APIConnectionError) + real httpx transport errors: ✔

Live testing caught 3 issues unit tests missed:

  1. SDK str(error) doesn't include body message → enriched error_msg
  2. Transport type check ordering shadowed disconnect+large-session heuristic
  3. APIConnectionError isn't a Python ConnectionError subclass

@teknium1 teknium1 force-pushed the hermes/hermes-0429963a branch 3 times, most recently from 835d514 to d328f91 Compare April 9, 2026 10:21
Add agent/error_classifier.py with a priority-ordered classification
pipeline that replaces scattered inline string-matching in the retry
loop with structured error taxonomy and recovery hints.

FailoverReason enum (14 categories): auth, auth_permanent, billing,
rate_limit, overloaded, server_error, timeout, context_overflow,
payload_too_large, model_not_found, format_error, thinking_signature,
long_context_tier, unknown.

ClassifiedError dataclass carries reason + recovery action hints
(retryable, should_compress, should_rotate_credential, should_fallback).

Key improvements over inline matching:
- 402 disambiguation: 'insufficient credits' = billing (immediate rotate),
  'usage limit, try again' = rate_limit (backoff first)
- OpenRouter 403 'key limit exceeded' correctly classified as billing
- Error cause chain walking (walks __cause__/__context__ up to 5 levels)
- Body message included in pattern matching (SDK str() misses it)
- Server disconnect + large session check ordered before generic transport
  catch so RemoteProtocolError triggers compression when appropriate
- Chinese error message support for context overflow

run_agent.py: replaced 6 inline detection blocks with classifier calls,
net -55 lines. All recovery actions (pool rotation, fallback activation,
compression, transport recovery) unchanged.

65 new unit tests + 10 E2E tests + live tests with real SDK error objects.
Inspired by OpenClaw's failover error classification system.
@teknium1 teknium1 force-pushed the hermes/hermes-0429963a branch from d328f91 to cab3340 Compare April 9, 2026 10:46
@teknium1 teknium1 merged commit 1a3ae6a into main Apr 9, 2026
4 of 6 checks passed
Tommyeds pushed a commit to Tommyeds/hermes-agent that referenced this pull request Apr 12, 2026
…earch#6514)

Add agent/error_classifier.py with a priority-ordered classification
pipeline that replaces scattered inline string-matching in the retry
loop with structured error taxonomy and recovery hints.

FailoverReason enum (14 categories): auth, auth_permanent, billing,
rate_limit, overloaded, server_error, timeout, context_overflow,
payload_too_large, model_not_found, format_error, thinking_signature,
long_context_tier, unknown.

ClassifiedError dataclass carries reason + recovery action hints
(retryable, should_compress, should_rotate_credential, should_fallback).

Key improvements over inline matching:
- 402 disambiguation: 'insufficient credits' = billing (immediate rotate),
  'usage limit, try again' = rate_limit (backoff first)
- OpenRouter 403 'key limit exceeded' correctly classified as billing
- Error cause chain walking (walks __cause__/__context__ up to 5 levels)
- Body message included in pattern matching (SDK str() misses it)
- Server disconnect + large session check ordered before generic transport
  catch so RemoteProtocolError triggers compression when appropriate
- Chinese error message support for context overflow

run_agent.py: replaced 6 inline detection blocks with classifier calls,
net -55 lines. All recovery actions (pool rotation, fallback activation,
compression, transport recovery) unchanged.

65 new unit tests + 10 E2E tests + live tests with real SDK error objects.
Inspired by OpenClaw's failover error classification system.
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…earch#6514)

Add agent/error_classifier.py with a priority-ordered classification
pipeline that replaces scattered inline string-matching in the retry
loop with structured error taxonomy and recovery hints.

FailoverReason enum (14 categories): auth, auth_permanent, billing,
rate_limit, overloaded, server_error, timeout, context_overflow,
payload_too_large, model_not_found, format_error, thinking_signature,
long_context_tier, unknown.

ClassifiedError dataclass carries reason + recovery action hints
(retryable, should_compress, should_rotate_credential, should_fallback).

Key improvements over inline matching:
- 402 disambiguation: 'insufficient credits' = billing (immediate rotate),
  'usage limit, try again' = rate_limit (backoff first)
- OpenRouter 403 'key limit exceeded' correctly classified as billing
- Error cause chain walking (walks __cause__/__context__ up to 5 levels)
- Body message included in pattern matching (SDK str() misses it)
- Server disconnect + large session check ordered before generic transport
  catch so RemoteProtocolError triggers compression when appropriate
- Chinese error message support for context overflow

run_agent.py: replaced 6 inline detection blocks with classifier calls,
net -55 lines. All recovery actions (pool rotation, fallback activation,
compression, transport recovery) unchanged.

65 new unit tests + 10 E2E tests + live tests with real SDK error objects.
Inspired by OpenClaw's failover error classification system.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…earch#6514)

Add agent/error_classifier.py with a priority-ordered classification
pipeline that replaces scattered inline string-matching in the retry
loop with structured error taxonomy and recovery hints.

FailoverReason enum (14 categories): auth, auth_permanent, billing,
rate_limit, overloaded, server_error, timeout, context_overflow,
payload_too_large, model_not_found, format_error, thinking_signature,
long_context_tier, unknown.

ClassifiedError dataclass carries reason + recovery action hints
(retryable, should_compress, should_rotate_credential, should_fallback).

Key improvements over inline matching:
- 402 disambiguation: 'insufficient credits' = billing (immediate rotate),
  'usage limit, try again' = rate_limit (backoff first)
- OpenRouter 403 'key limit exceeded' correctly classified as billing
- Error cause chain walking (walks __cause__/__context__ up to 5 levels)
- Body message included in pattern matching (SDK str() misses it)
- Server disconnect + large session check ordered before generic transport
  catch so RemoteProtocolError triggers compression when appropriate
- Chinese error message support for context overflow

run_agent.py: replaced 6 inline detection blocks with classifier calls,
net -55 lines. All recovery actions (pool rotation, fallback activation,
compression, transport recovery) unchanged.

65 new unit tests + 10 E2E tests + live tests with real SDK error objects.
Inspired by OpenClaw's failover error classification system.
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…earch#6514)

Add agent/error_classifier.py with a priority-ordered classification
pipeline that replaces scattered inline string-matching in the retry
loop with structured error taxonomy and recovery hints.

FailoverReason enum (14 categories): auth, auth_permanent, billing,
rate_limit, overloaded, server_error, timeout, context_overflow,
payload_too_large, model_not_found, format_error, thinking_signature,
long_context_tier, unknown.

ClassifiedError dataclass carries reason + recovery action hints
(retryable, should_compress, should_rotate_credential, should_fallback).

Key improvements over inline matching:
- 402 disambiguation: 'insufficient credits' = billing (immediate rotate),
  'usage limit, try again' = rate_limit (backoff first)
- OpenRouter 403 'key limit exceeded' correctly classified as billing
- Error cause chain walking (walks __cause__/__context__ up to 5 levels)
- Body message included in pattern matching (SDK str() misses it)
- Server disconnect + large session check ordered before generic transport
  catch so RemoteProtocolError triggers compression when appropriate
- Chinese error message support for context overflow

run_agent.py: replaced 6 inline detection blocks with classifier calls,
net -55 lines. All recovery actions (pool rotation, fallback activation,
compression, transport recovery) unchanged.

65 new unit tests + 10 E2E tests + live tests with real SDK error objects.
Inspired by OpenClaw's failover error classification system.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…earch#6514)

Add agent/error_classifier.py with a priority-ordered classification
pipeline that replaces scattered inline string-matching in the retry
loop with structured error taxonomy and recovery hints.

FailoverReason enum (14 categories): auth, auth_permanent, billing,
rate_limit, overloaded, server_error, timeout, context_overflow,
payload_too_large, model_not_found, format_error, thinking_signature,
long_context_tier, unknown.

ClassifiedError dataclass carries reason + recovery action hints
(retryable, should_compress, should_rotate_credential, should_fallback).

Key improvements over inline matching:
- 402 disambiguation: 'insufficient credits' = billing (immediate rotate),
  'usage limit, try again' = rate_limit (backoff first)
- OpenRouter 403 'key limit exceeded' correctly classified as billing
- Error cause chain walking (walks __cause__/__context__ up to 5 levels)
- Body message included in pattern matching (SDK str() misses it)
- Server disconnect + large session check ordered before generic transport
  catch so RemoteProtocolError triggers compression when appropriate
- Chinese error message support for context overflow

run_agent.py: replaced 6 inline detection blocks with classifier calls,
net -55 lines. All recovery actions (pool rotation, fallback activation,
compression, transport recovery) unchanged.

65 new unit tests + 10 E2E tests + live tests with real SDK error objects.
Inspired by OpenClaw's failover error classification system.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…earch#6514)

Add agent/error_classifier.py with a priority-ordered classification
pipeline that replaces scattered inline string-matching in the retry
loop with structured error taxonomy and recovery hints.

FailoverReason enum (14 categories): auth, auth_permanent, billing,
rate_limit, overloaded, server_error, timeout, context_overflow,
payload_too_large, model_not_found, format_error, thinking_signature,
long_context_tier, unknown.

ClassifiedError dataclass carries reason + recovery action hints
(retryable, should_compress, should_rotate_credential, should_fallback).

Key improvements over inline matching:
- 402 disambiguation: 'insufficient credits' = billing (immediate rotate),
  'usage limit, try again' = rate_limit (backoff first)
- OpenRouter 403 'key limit exceeded' correctly classified as billing
- Error cause chain walking (walks __cause__/__context__ up to 5 levels)
- Body message included in pattern matching (SDK str() misses it)
- Server disconnect + large session check ordered before generic transport
  catch so RemoteProtocolError triggers compression when appropriate
- Chinese error message support for context overflow

run_agent.py: replaced 6 inline detection blocks with classifier calls,
net -55 lines. All recovery actions (pool rotation, fallback activation,
compression, transport recovery) unchanged.

65 new unit tests + 10 E2E tests + live tests with real SDK error objects.
Inspired by OpenClaw's failover error classification system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant