fix(gateway): keep retryable platform reconnects queued by denhubr · Pull Request #13197 · NousResearch/hermes-agent

denhubr · 2026-04-20T21:34:14Z

The reconnect watcher previously gave up after 20 failed attempts, even when the adapter marked the failure as retryable. That caused transient infrastructure issues like DNS failures to become permanent until the gateway was restarted.

Keep retryable failures in the reconnect queue and continue applying backoff until the platform recovers or returns a non-retryable error. Add a regression test covering retryable failures past the previous 20-attempt limit.

What does this PR do?

Fixes a gateway reconnect bug where retryable platform failures were eventually dropped after repeated attempts instead of staying in the reconnect queue. This allows transient outages such as prolonged network loss or DNS failures to recover automatically once connectivity returns.

Related Issue

Fixes #12607

Related: #11241, which keeps retryable fatal failures in-process. This PR fixes the remaining reconnect path where retryable failures were still dropped after 20 attempts.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

Updated gateway/run.py so retryable reconnect failures remain queued instead of being dropped after the previous 20-attempt limit.
Preserved reconnect backoff behavior until the platform recovers or reports a non-retryable error.
Tightened reconnect logging so periodic warnings start at the 10th attempt while normal retry logs include the retry delay.
Added a regression test in tests/gateway/test_platform_reconnect.py covering retryable failures beyond the old limit.

How to Test

Run scripts/run_tests.sh tests/gateway/test_platform_reconnect.py.
Confirm the reconnect watcher tests pass, including the retryable failure case beyond 20 attempts.
Optionally reproduce manually by forcing repeated retryable reconnect failures, then restoring connectivity and confirming the gateway resumes reconnecting without a restart.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: Debian 13 (x86_64)

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Targeted regression test run:

14 passed in 0.74s

Note: upstream main is currently red in GitHub Actions, including at least one unrelated failing test (tests/test_mcp_serve.py::TestEventBridgePollE2E::test_poll_detects_new_message_after_db_write) reproduced on clean upstream/main outside this PR.

The reconnect watcher previously gave up after 20 failed attempts, even when the adapter marked the failure as retryable. That caused transient infrastructure issues like DNS failures to become permanent until the gateway was restarted. Keep retryable failures in the reconnect queue and continue applying backoff until the platform recovers or returns a non-retryable error. Add a regression test covering retryable failures past the previous 20-attempt limit.

denhubr · 2026-04-21T20:03:10Z

GitHub Actions for this fork PR are currently in action_required, so no checks have started yet. Could a maintainer please approve workflow runs for this PR?

denhubr force-pushed the fix/gateway-infinite-retryable-reconnects branch from 091d1d2 to 784edfe Compare April 21, 2026 18:47

alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery labels Apr 21, 2026

This was referenced Apr 23, 2026

fix(gateway): retryable reconnects stop after prolonged network loss #12607

Open

Gateway reconnect watcher permanently stops retryable platforms after 20 failed attempts #17063

Closed

teknium1 mentioned this pull request May 15, 2026

fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform #26600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): keep retryable platform reconnects queued#13197

fix(gateway): keep retryable platform reconnects queued#13197
denhubr wants to merge 1 commit into
NousResearch:mainfrom
denhubr:fix/gateway-infinite-retryable-reconnects

denhubr commented Apr 20, 2026

Uh oh!

denhubr commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

denhubr commented Apr 20, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

denhubr commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants