Skip to content

Amindadgar PR#8486 retry ci fix#13519

Open
amindadgar wants to merge 4 commits into
NousResearch:mainfrom
amindadgar:amindadgar-pr-8486-retry-ci-fix
Open

Amindadgar PR#8486 retry ci fix#13519
amindadgar wants to merge 4 commits into
NousResearch:mainfrom
amindadgar:amindadgar-pr-8486-retry-ci-fix

Conversation

@amindadgar

Copy link
Copy Markdown

What does this PR do?

Finishes the retry/backoff work intended for PR #8486 by wiring configurable retry settings all the way from config into AIAgent, and by fixing the actual retry behavior in run_agent.py.

This change adds:

  • agent.max_api_retries for full API-call retries
  • agent.max_stream_retries for transient streaming reconnect retries

It also fixes the outer retry loop so it no longer uses a hardcoded retry count, honors Retry-After headers, applies smarter rate-limit vs generic retry backoff, and keeps streaming reconnect recovery separate from full-request retries. This approach stays small and idiomatic by extending the existing retry paths instead of introducing a new abstraction.

Related Issue

Fixes #5570

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)

Changes Made

  • Added agent.max_api_retries and agent.max_stream_retries defaults and normalization in hermes_cli/config.py
  • Added CLI-side defaults and runtime plumbing in cli.py
  • Bridged gateway config into env vars in gateway/run.py
  • Updated AIAgent retry config handling and retry behavior in run_agent.py
  • Fixed outer retry loop to honor configured retries instead of a hardcoded value
  • Added capped Retry-After handling and smarter backoff logic for rate limits vs other retryable errors
  • Kept non-streaming requests on the outer retry loop only, and streaming reconnects on the inner loop
  • Ensured stream retry recovery rebuilds the primary OpenAI client when needed
  • Updated retry/config regression tests in:
    • tests/test_api_retry_config.py
    • tests/hermes_cli/test_config.py
    • tests/cli/test_cli_init.py
    • tests/run_agent/test_run_agent.py
  • Updated config/docs examples in:
    • cli-config.yaml.example
    • docs/acp-setup.md

How to Test

  1. Set retry config in ~/.hermes/config.yaml, for example:
    agent:
      max_api_retries: 5
      max_stream_retries: 5
  2. Run the focused regression suite:
    HERMES_HOME=/tmp/hermes-ci-home python -m pytest \
      tests/run_agent/test_run_agent.py \
      tests/run_agent/test_provider_fallback.py \
      tests/run_agent/test_streaming.py \
      tests/run_agent/test_openai_client_lifecycle.py \
      tests/test_api_retry_config.py \
      tests/hermes_cli/test_config.py \
      tests/cli/test_cli_init.py \
      -q -o addopts=''
  3. Verify the suite passes and confirm retry behavior:
    • outer retries respect max_api_retries
    • stream reconnects respect max_stream_retries
    • Retry-After is honored and capped
    • rate-limit and generic retryable errors use different backoff behavior

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Focused verification passed locally:

361 passed in 17.93s

Command used:

HERMES_HOME=/tmp/hermes-ci-home python -m pytest \
  tests/run_agent/test_run_agent.py \
  tests/run_agent/test_provider_fallback.py \
  tests/run_agent/test_streaming.py \
  tests/run_agent/test_openai_client_lifecycle.py \
  tests/test_api_retry_config.py \
  tests/hermes_cli/test_config.py \
  tests/cli/test_cli_init.py \
  -q -o addopts=''

iRonin and others added 4 commits April 12, 2026 11:11
agent.max_api_retries in config.yaml (default 3, user set to 10).

Backoff improvements:
- Respects Retry-After header from API response (capped at 5 min)
- Rate limits: exponential 5s*2^n with ±20% jitter, cap 5 min
- Other errors: exponential 2^n, cap 60s
- Was: fixed min(2**n, 60) for all cases, ignored Retry-After

Usage:
  agent:
    max_api_retries: 10  # in ~/.hermes/config.yaml
…ection errors

agent.max_stream_retries in config.yaml (default 2, means 3 attempts).
Controls inner stream retry loop for ReadTimeout/connection drops.
Works alongside max_api_retries (outer loop) for two-layer retry strategy.

Usage:
  agent:
    max_api_retries: 10     # outer: full API call retries
    max_stream_retries: 5   # inner: stream/connection retries
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery area/config Config system, migrations, profiles labels Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Config system, migrations, profiles comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: configurable max API retries + stream retries with smarter backoff

3 participants