Bug Description
Gateway uses DoH (DNS-over-HTTPS) fallback to resolve api.telegram.org, which works well when DoH providers (Google/Cloudflare) are reachable. However, when both system DNS and DoH providers are blocked/unreachable, the fallback chain degrades as follows:
- System DNS (
socket.getaddrinfo) fails → getaddrinfo failed reported in logs
- DoH to
dns.google and cloudflare-dns.com also fails (network-level block)
- Falls back to hardcoded seed IP
149.154.167.220
- TCP connection to seed IP also fails
- After 10 consecutive failures, Telegram adapter is auto-paused (
gateway/run.py:_PAUSE_AFTER_FAILURES=10)
- Paused adapter stops all retry attempts permanently; requires manual
/platform resume telegram to recover
The core problem: When the network recovers (DNS resolves again), the adapter remains permanently paused and never auto-recovers. The user must manually run /platform resume telegram or restart the gateway — which is not obvious from the error message.
Steps to Reproduce
- Run
hermes gateway with Telegram adapter configured
- Simulate DNS failure (e.g., block port 53, or use a network that has no DNS resolution for
api.telegram.org)
- Observe logs:
DoH discovery yielded no usable IPs (system DNS: unknown); using seed fallback IPs 149.154.167.220
Primary api.telegram.org connection failed ([Errno 11001] getaddrinfo failed); trying fallback IPs 149.154.167.220
Fallback IP 149.154.167.220 failed: All connection attempts failed
- After 10 attempts:
telegram paused after 10 consecutive failures (telegram connect timed out after 30s) — fix the underlying issue then run /platform resume telegram to retry
- When network recovers (DNS resolves again), the adapter stays paused — no auto-recovery
Expected Behavior
When network connectivity recovers (system DNS can resolve api.telegram.org again), the Telegram adapter should automatically reconnect without manual intervention. The circuit breaker (pause after 10 failures) should only stop hammering a permanently failed endpoint, not a temporarily unreachable one that has since recovered.
Actual Behavior
- Telegram adapter goes into
paused state after 10 consecutive failures
- It stays paused even after network recovers
- User receives misleading error message: "fix the underlying issue then run
/platform resume telegram" even when the underlying issue (DNS failure) has already been resolved
- Requires manual
/platform resume telegram or gateway restart to recover
Root Cause Analysis
File: gateway/run.py lines 5500-5501 and 2603-2638
_BACKOFF_CAP = 300 # 5 minutes max between retries
_PAUSE_AFTER_FAILURES = 10 # circuit-breaker threshold
The _pause_failed_platform() method sets info["paused"] = True and pushes next_retry to now + 300s. The reconnect watcher (_platform_reconnect_watcher) skips platforms that are paused:
# gateway/run.py — reconnect watcher loop
if info.get("paused"):
# circuit breaker: don't hammer a known-bad platform
continue
The logic gap: The circuit breaker correctly stops hammering a failed endpoint, but it never detects when the endpoint becomes reachable again. On a machine behind a GFW, the network may be flaky — DNS fails for minutes, then recovers, but the adapter never wakes up.
Proposed Fix
When a platform is in paused state, the reconnect watcher should still periodically poll system DNS to detect if the endpoint has become reachable again. Specifically:
- Add a DNS probe phase for paused platforms (e.g., every 5 minutes) that checks if the platform's host can be resolved
- If system DNS resolves successfully, auto-resume the platform (reset attempt counter, schedule immediate reconnect)
- This is a targeted fix — the circuit breaker still protects against hammering a permanently unreachable endpoint, but recovered endpoints auto-heal
Affected file: gateway/run.py — _platform_reconnect_watcher() method
OS / Environment
- OS: Windows 10 (native, Git Bash / MSYS shell)
- Hermes version: latest (as of May 30, 2026)
- Telegram adapter with no proxy configured
- Network: ISP-level DNS occasionally fails for
api.telegram.org
Bug Description
Gateway uses DoH (DNS-over-HTTPS) fallback to resolve
api.telegram.org, which works well when DoH providers (Google/Cloudflare) are reachable. However, when both system DNS and DoH providers are blocked/unreachable, the fallback chain degrades as follows:socket.getaddrinfo) fails →getaddrinfo failedreported in logsdns.googleandcloudflare-dns.comalso fails (network-level block)149.154.167.220gateway/run.py:_PAUSE_AFTER_FAILURES=10)/platform resume telegramto recoverThe core problem: When the network recovers (DNS resolves again), the adapter remains permanently paused and never auto-recovers. The user must manually run
/platform resume telegramor restart the gateway — which is not obvious from the error message.Steps to Reproduce
hermes gatewaywith Telegram adapter configuredapi.telegram.org)telegram paused after 10 consecutive failures (telegram connect timed out after 30s) — fix the underlying issue then run/platform resume telegramto retryExpected Behavior
When network connectivity recovers (system DNS can resolve
api.telegram.orgagain), the Telegram adapter should automatically reconnect without manual intervention. The circuit breaker (pause after 10 failures) should only stop hammering a permanently failed endpoint, not a temporarily unreachable one that has since recovered.Actual Behavior
pausedstate after 10 consecutive failures/platform resume telegram" even when the underlying issue (DNS failure) has already been resolved/platform resume telegramor gateway restart to recoverRoot Cause Analysis
File:
gateway/run.pylines 5500-5501 and 2603-2638The
_pause_failed_platform()method setsinfo["paused"] = Trueand pushesnext_retrytonow + 300s. The reconnect watcher (_platform_reconnect_watcher) skips platforms that are paused:The logic gap: The circuit breaker correctly stops hammering a failed endpoint, but it never detects when the endpoint becomes reachable again. On a machine behind a GFW, the network may be flaky — DNS fails for minutes, then recovers, but the adapter never wakes up.
Proposed Fix
When a platform is in
pausedstate, the reconnect watcher should still periodically poll system DNS to detect if the endpoint has become reachable again. Specifically:Affected file:
gateway/run.py—_platform_reconnect_watcher()methodOS / Environment
api.telegram.org