fix(gateway): keep retryable platform reconnects queued#13197
Open
denhubr wants to merge 1 commit into
Open
Conversation
The reconnect watcher previously gave up after 20 failed attempts, even when the adapter marked the failure as retryable. That caused transient infrastructure issues like DNS failures to become permanent until the gateway was restarted. Keep retryable failures in the reconnect queue and continue applying backoff until the platform recovers or returns a non-retryable error. Add a regression test covering retryable failures past the previous 20-attempt limit.
091d1d2 to
784edfe
Compare
Author
|
GitHub Actions for this fork PR are currently in |
This was referenced Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The reconnect watcher previously gave up after 20 failed attempts, even when the adapter marked the failure as retryable. That caused transient infrastructure issues like DNS failures to become permanent until the gateway was restarted.
Keep retryable failures in the reconnect queue and continue applying backoff until the platform recovers or returns a non-retryable error. Add a regression test covering retryable failures past the previous 20-attempt limit.
What does this PR do?
Fixes a gateway reconnect bug where retryable platform failures were eventually dropped after repeated attempts instead of staying in the reconnect queue. This allows transient outages such as prolonged network loss or DNS failures to recover automatically once connectivity returns.
Related Issue
Fixes #12607
Related: #11241, which keeps retryable fatal failures in-process. This PR fixes the remaining reconnect path where retryable failures were still dropped after 20 attempts.
Type of Change
Changes Made
gateway/run.pyso retryable reconnect failures remain queued instead of being dropped after the previous 20-attempt limit.tests/gateway/test_platform_reconnect.pycovering retryable failures beyond the old limit.How to Test
scripts/run_tests.sh tests/gateway/test_platform_reconnect.py.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AScreenshots / Logs
Targeted regression test run:
14 passed in 0.74sNote: upstream
mainis currently red in GitHub Actions, including at least one unrelated failing test (tests/test_mcp_serve.py::TestEventBridgePollE2E::test_poll_detects_new_message_after_db_write) reproduced on cleanupstream/mainoutside this PR.