Skip to content

[Bug] Telegram adapter auto-pause never auto-recovers after transient DNS failure #35284

@q3874758

Description

@q3874758

Bug Description

Gateway uses DoH (DNS-over-HTTPS) fallback to resolve api.telegram.org, which works well when DoH providers (Google/Cloudflare) are reachable. However, when both system DNS and DoH providers are blocked/unreachable, the fallback chain degrades as follows:

  1. System DNS (socket.getaddrinfo) fails → getaddrinfo failed reported in logs
  2. DoH to dns.google and cloudflare-dns.com also fails (network-level block)
  3. Falls back to hardcoded seed IP 149.154.167.220
  4. TCP connection to seed IP also fails
  5. After 10 consecutive failures, Telegram adapter is auto-paused (gateway/run.py:_PAUSE_AFTER_FAILURES=10)
  6. Paused adapter stops all retry attempts permanently; requires manual /platform resume telegram to recover

The core problem: When the network recovers (DNS resolves again), the adapter remains permanently paused and never auto-recovers. The user must manually run /platform resume telegram or restart the gateway — which is not obvious from the error message.

Steps to Reproduce

  1. Run hermes gateway with Telegram adapter configured
  2. Simulate DNS failure (e.g., block port 53, or use a network that has no DNS resolution for api.telegram.org)
  3. Observe logs:
DoH discovery yielded no usable IPs (system DNS: unknown); using seed fallback IPs 149.154.167.220
Primary api.telegram.org connection failed ([Errno 11001] getaddrinfo failed); trying fallback IPs 149.154.167.220
Fallback IP 149.154.167.220 failed: All connection attempts failed
  1. After 10 attempts: telegram paused after 10 consecutive failures (telegram connect timed out after 30s) — fix the underlying issue then run /platform resume telegram to retry
  2. When network recovers (DNS resolves again), the adapter stays paused — no auto-recovery

Expected Behavior

When network connectivity recovers (system DNS can resolve api.telegram.org again), the Telegram adapter should automatically reconnect without manual intervention. The circuit breaker (pause after 10 failures) should only stop hammering a permanently failed endpoint, not a temporarily unreachable one that has since recovered.

Actual Behavior

  • Telegram adapter goes into paused state after 10 consecutive failures
  • It stays paused even after network recovers
  • User receives misleading error message: "fix the underlying issue then run /platform resume telegram" even when the underlying issue (DNS failure) has already been resolved
  • Requires manual /platform resume telegram or gateway restart to recover

Root Cause Analysis

File: gateway/run.py lines 5500-5501 and 2603-2638

_BACKOFF_CAP = 300  # 5 minutes max between retries
_PAUSE_AFTER_FAILURES = 10  # circuit-breaker threshold

The _pause_failed_platform() method sets info["paused"] = True and pushes next_retry to now + 300s. The reconnect watcher (_platform_reconnect_watcher) skips platforms that are paused:

# gateway/run.py — reconnect watcher loop
if info.get("paused"):
    # circuit breaker: don't hammer a known-bad platform
    continue

The logic gap: The circuit breaker correctly stops hammering a failed endpoint, but it never detects when the endpoint becomes reachable again. On a machine behind a GFW, the network may be flaky — DNS fails for minutes, then recovers, but the adapter never wakes up.

Proposed Fix

When a platform is in paused state, the reconnect watcher should still periodically poll system DNS to detect if the endpoint has become reachable again. Specifically:

  1. Add a DNS probe phase for paused platforms (e.g., every 5 minutes) that checks if the platform's host can be resolved
  2. If system DNS resolves successfully, auto-resume the platform (reset attempt counter, schedule immediate reconnect)
  3. This is a targeted fix — the circuit breaker still protects against hammering a permanently unreachable endpoint, but recovered endpoints auto-heal

Affected file: gateway/run.py_platform_reconnect_watcher() method

OS / Environment

  • OS: Windows 10 (native, Git Bash / MSYS shell)
  • Hermes version: latest (as of May 30, 2026)
  • Telegram adapter with no proxy configured
  • Network: ISP-level DNS occasionally fails for api.telegram.org

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliveryplatform/telegramTelegram bot adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions