feat(agent): make primary-provider API retry count configurable#12013
Closed
alexzhu0 wants to merge 1 commit into
Closed
feat(agent): make primary-provider API retry count configurable#12013alexzhu0 wants to merge 1 commit into
alexzhu0 wants to merge 1 commit into
Conversation
The per-call API retry loop used a hardcoded ``max_retries = 3``. Users with a configured ``fallback_model`` who would rather fail over sooner after an unresponsive primary had no way to shorten the wait — three attempts with backoff can stretch to 15+ minutes on a flapping upstream before the fallback kicks in (see issue #11616). Read the retry budget from ``HERMES_API_MAX_RETRIES`` (env var, default 3, clamped to non-negative). ``0`` disables retries entirely, so one failed call routes directly to the fallback provider. The env-var-based knob matches the existing ``HERMES_API_TIMEOUT`` / ``HERMES_API_CALL_STALE_TIMEOUT`` pattern and avoids opening a config.yaml schema discussion for a single integer. A config.yaml knob under ``agent:`` (as the issue reporter proposed) is a reasonable follow-up if self-hosters ask for it. Also document the new var alongside ``HERMES_API_TIMEOUT`` in the environment-variables reference. Closes #11616
Collaborator
|
Superseded by #14730 (merged) — config key already implements this. This env-var approach is no longer needed. |
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
The per-call API retry loop in `run_agent.py` uses a hardcoded `max_retries = 3`. Users with a configured `fallback_model` who would rather fail over sooner after an unresponsive primary had no way to shorten the wait — three attempts with exponential backoff stretch to 15+ minutes on a flapping upstream before the fallback kicks in.
Issue #11616 logs exactly this scenario on Qwen via OpenRouter:
```⚠️ No response from provider for 180s (model: qwen/qwen3-coder-480b-…). Reconnecting…⚠️ No response from provider for 180s. Reconnecting… ← full retry budget burned
[20:12]
[20:13] ⏳ Retrying in 2.0s (attempt 1/3)…
[20:17] ⏳ Retrying in 4.5s (attempt 2/3)…
[20:20]
```
Change
Read the retry budget from `HERMES_API_MAX_RETRIES` (default `3`, clamped to non-negative, falls back to `3` on malformed values). `0` disables retries entirely, so one failed call routes directly to the fallback provider.
```python
before
max_retries = 3
after
try:
max_retries = max(0, int(os.getenv("HERMES_API_MAX_RETRIES", "3")))
except (TypeError, ValueError):
max_retries = 3
```
The env-var-based knob matches the existing `HERMES_API_TIMEOUT` / `HERMES_API_CALL_STALE_TIMEOUT` pattern in the same file and avoids opening a `config.yaml` schema discussion for a single integer. A nested `agent.max_api_retries` config knob (as the issue reporter proposed in ex-1) is a reasonable follow-up if self-hosters ask for it.
Also document the new variable alongside `HERMES_API_TIMEOUT` in `website/docs/reference/environment-variables.md`.
How to test
```bash
Default behaviour unchanged
unset HERMES_API_MAX_RETRIES
python -c "import os; print(int(os.getenv('HERMES_API_MAX_RETRIES', '3')))" # 3
Lower for fast failover
export HERMES_API_MAX_RETRIES=1
python -c "import os; print(int(os.getenv('HERMES_API_MAX_RETRIES', '3')))" # 1
Zero disables retries entirely
export HERMES_API_MAX_RETRIES=0
python -c "import os; print(int(os.getenv('HERMES_API_MAX_RETRIES', '3')))" # 0
Malformed values fall back to 3
export HERMES_API_MAX_RETRIES=not-a-number
```
No existing tests pin the `max_retries = 3` literal; the default preserves today's behaviour byte-for-byte.
Platforms tested
Related
Closes #11616.