Skip to content

fix(gateway): auto-recover auto-paused platforms after transient failures#35290

Open
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/telegram-auto-pause-recovery
Open

fix(gateway): auto-recover auto-paused platforms after transient failures#35290
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/telegram-auto-pause-recovery

Conversation

@liuhao1024

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds auto-recovery for auto-paused platforms (circuit breaker). When a platform is paused after 10 consecutive failures (e.g., due to a transient DNS outage), the reconnect watcher now attempts a single reconnection every 5 minutes. If the underlying issue has resolved, the platform reconnects automatically without manual intervention.

Related Issue

Fixes #35284

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/run.py: Add pause_mode tracking (auto vs manual) to _pause_failed_platform(), auto-recovery logic in _platform_reconnect_watcher(), and cleanup in _resume_paused_platform(). Manual pauses (/platform pause) are not affected.
  • tests/gateway/test_platform_reconnect.py: Add 4 tests for auto-recovery behavior (auto-paused recovers after interval, manual stays paused, not recovered before interval, failure re-queues). Update existing paused-platform test to set pause_mode="manual".

How to Test

  1. Run python3 -m pytest tests/gateway/test_platform_reconnect.py -q — all 35 tests should pass
  2. Simulate a transient DNS failure in a gateway instance and verify the Telegram adapter auto-recovers after ~5 minutes when DNS resolves again
  3. Manually pause a platform with /platform pause telegram and verify it does NOT auto-recover

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Code Intelligence

  • Analyzed: gateway/run.py:_platform_reconnect_watcher, gateway/run.py:_pause_failed_platform, gateway/run.py:_resume_paused_platform
  • Blast radius: LOW — changes are scoped to the reconnect watcher's pause/resume logic; no new API surface
  • Related patterns: Circuit breaker auto-recovery, exponential backoff with pause/resume lifecycle

…ures

The circuit breaker pauses platforms after 10 consecutive failures, but
never attempts to reconnect them. When the underlying issue (e.g., DNS
outage) is transient, the adapter stays permanently paused until the
user manually runs /platform resume or restarts the gateway.

Add auto-recovery logic: auto-paused platforms (circuit breaker) get a
single reconnection attempt every 5 minutes. If the attempt succeeds,
the platform reconnects normally. If it fails, it stays in the retry
queue with normal backoff. Manually paused platforms (/platform pause)
are not affected — they still require explicit /platform resume.

Fixes NousResearch#35284
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Telegram adapter auto-pause never auto-recovers after transient DNS failure

2 participants