Skip to content

fix(gateway): keep retryable platform reconnects queued#13197

Open
denhubr wants to merge 1 commit into
NousResearch:mainfrom
denhubr:fix/gateway-infinite-retryable-reconnects
Open

fix(gateway): keep retryable platform reconnects queued#13197
denhubr wants to merge 1 commit into
NousResearch:mainfrom
denhubr:fix/gateway-infinite-retryable-reconnects

Conversation

@denhubr

@denhubr denhubr commented Apr 20, 2026

Copy link
Copy Markdown

The reconnect watcher previously gave up after 20 failed attempts, even when the adapter marked the failure as retryable. That caused transient infrastructure issues like DNS failures to become permanent until the gateway was restarted.

Keep retryable failures in the reconnect queue and continue applying backoff until the platform recovers or returns a non-retryable error. Add a regression test covering retryable failures past the previous 20-attempt limit.

What does this PR do?

Fixes a gateway reconnect bug where retryable platform failures were eventually dropped after repeated attempts instead of staying in the reconnect queue. This allows transient outages such as prolonged network loss or DNS failures to recover automatically once connectivity returns.

Related Issue

Fixes #12607

Related: #11241, which keeps retryable fatal failures in-process. This PR fixes the remaining reconnect path where retryable failures were still dropped after 20 attempts.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Updated gateway/run.py so retryable reconnect failures remain queued instead of being dropped after the previous 20-attempt limit.
  • Preserved reconnect backoff behavior until the platform recovers or reports a non-retryable error.
  • Tightened reconnect logging so periodic warnings start at the 10th attempt while normal retry logs include the retry delay.
  • Added a regression test in tests/gateway/test_platform_reconnect.py covering retryable failures beyond the old limit.

How to Test

  1. Run scripts/run_tests.sh tests/gateway/test_platform_reconnect.py.
  2. Confirm the reconnect watcher tests pass, including the retryable failure case beyond 20 attempts.
  3. Optionally reproduce manually by forcing repeated retryable reconnect failures, then restoring connectivity and confirming the gateway resumes reconnecting without a restart.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Debian 13 (x86_64)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Targeted regression test run:

14 passed in 0.74s

Note: upstream main is currently red in GitHub Actions, including at least one unrelated failing test (tests/test_mcp_serve.py::TestEventBridgePollE2E::test_poll_detects_new_message_after_db_write) reproduced on clean upstream/main outside this PR.

The reconnect watcher previously gave up after 20 failed attempts,
even when the adapter marked the failure as retryable. That caused
transient infrastructure issues like DNS failures to become permanent
until the gateway was restarted.

Keep retryable failures in the reconnect queue and continue applying
backoff until the platform recovers or returns a non-retryable error.
Add a regression test covering retryable failures past the previous
20-attempt limit.
@denhubr denhubr force-pushed the fix/gateway-infinite-retryable-reconnects branch from 091d1d2 to 784edfe Compare April 21, 2026 18:47
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery labels Apr 21, 2026
@denhubr

denhubr commented Apr 21, 2026

Copy link
Copy Markdown
Author

GitHub Actions for this fork PR are currently in action_required, so no checks have started yet. Could a maintainer please approve workflow runs for this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gateway): retryable reconnects stop after prolonged network loss

2 participants