Skip to content

fix(agent): add exponential backoff to inner streaming retry loop#7431

Open
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:fix/7069-stream-retry-backoff
Open

fix(agent): add exponential backoff to inner streaming retry loop#7431
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:fix/7069-stream-retry-backoff

Conversation

@Tranquil-Flow

@Tranquil-Flow Tranquil-Flow commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

The inner streaming retry loop retried immediately after timeout/connection errors, causing an infinite fast-retry loop when a local LLM's prefill time exceeds the stale-stream timeout (default 180s). Adds jittered exponential backoff (5s, 10s, 20s, capped at 30s) between attempts so the backend gets time to complete prompt processing.

Uses chunked sleep with interrupt checking to stay responsive, matching the pattern used by the outer retry loop.

Related Issue

Closes #7069

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • run_agent.py — added interruptible exponential backoff before continue in the inner streaming retry loop

How to Test

  1. Set up Hermes with a large local LLM whose prefill exceeds 180s
  2. Send a complex prompt
  3. Should see "Backing off Xs before stream retry" in logs instead of immediate retry spam

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15 (Darwin 24.6.0)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

N/A — see commit description and PR diff.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 29, 2026
@Tranquil-Flow Tranquil-Flow force-pushed the fix/7069-stream-retry-backoff branch from 8c7ff68 to fd39696 Compare May 18, 2026 21:30
The inner streaming retry loop retried immediately after timeout/
connection errors, causing an infinite fast-retry loop when a local
LLM's prefill time exceeds the stale-stream timeout. Add jittered
exponential backoff (5s, 10s, 20s, capped at 30s) between attempts
so the backend gets time to complete prompt processing.

Uses chunked sleep with interrupt checking to stay responsive,
matching the pattern used by the outer retry loop.

Closes NousResearch#7069
@Tranquil-Flow Tranquil-Flow force-pushed the fix/7069-stream-retry-backoff branch from fd39696 to ffd36ff Compare May 25, 2026 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Infinite retry loop caused by stream stale timeout during local LLM prefill phase

2 participants