Skip to content

fix(anthropic): retry 429/529 errors and surface error details to users#1585

Merged
teknium1 merged 1 commit into
NousResearch:mainfrom
0xbyt4:fix/anthropic-error-handling
Mar 16, 2026
Merged

fix(anthropic): retry 429/529 errors and surface error details to users#1585
teknium1 merged 1 commit into
NousResearch:mainfrom
0xbyt4:fix/anthropic-error-handling

Conversation

@0xbyt4

@0xbyt4 0xbyt4 commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • 429 rate limit treated as non-retryable: run_agent.py classified all 4xx (except 413) as non-retryable client errors. This meant 429 (rate limit) and 529 (overloaded) caused immediate failure instead of exponential backoff retry. Fixed by adding _RETRYABLE_STATUS_CODES = {413, 429, 529}.
  • Generic error message hid real error: gateway/run.py caught all exceptions and returned "Sorry, I encountered an unexpected error" with zero detail. Now includes error type, message (truncated to 300 chars), and status-specific hints (auth, rate limit, overloaded).
  • Failed agent returned empty response: When run_conversation() returned {"failed": True, "final_response": None}, the gateway sent nothing back to the user. Now surfaces the actual error message.

Changes

  • run_agent.py (+5 lines): Exclude 429/529 from non-retryable client error classification
  • gateway/run.py (+25 lines): Informative error messages + handle None final_response
  • tests/test_anthropic_error_handling.py (new, 8 tests): Full coverage of Anthropic error paths

Test plan

All 8 tests run through the real agent retry loop via _run_agent:

  • 429 rate limit → retried with backoff → recovers
  • 529 overloaded → retried with backoff → recovers
  • 429 always fails → exhausts all 3 retries before raising (not immediate fail)
  • 400 bad request → non-retryable, immediate fail with 1 API call (regression guard)
  • 500 server error → retried with backoff → recovers
  • 401 + credential refresh succeeds → recovers
  • 401 + credential refresh fails → non-retryable
  • "prompt is too long" → triggers context compression → recovers
  • Existing test suite passes (246 tests, 0 regressions)

- 429 rate limit and 529 overloaded were incorrectly treated as
  non-retryable client errors, causing immediate failure instead of
  exponential backoff retry. Users hitting Anthropic rate limits got
  silent failures or no response at all.
- Generic "Sorry, I encountered an unexpected error" now includes
  error type, details, and status-specific hints (auth, rate limit,
  overloaded).
- Failed agent with final_response=None now surfaces the actual
  error message instead of returning an empty response.
@teknium1 teknium1 merged commit e6cf1c9 into NousResearch:main Mar 16, 2026
1 check passed
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…handling

fix(anthropic): retry 429/529 errors and surface error details to users
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…handling

fix(anthropic): retry 429/529 errors and surface error details to users
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…handling

fix(anthropic): retry 429/529 errors and surface error details to users
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…handling

fix(anthropic): retry 429/529 errors and surface error details to users
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants