Skip to content

Telegram resolver failure caching + missing degraded-state detection after cold-boot resolver exhaustion #5729

@iws17

Description

@iws17

PR: Telegram platform fails silently on cold boot due to insufficient retry budget

Problem

On systems where the Hermes gateway starts at boot time (e.g. macOS launchd, systemd with default ordering), the Telegram platform's connect() routine can fail before the OS network stack is ready:

ERROR telegram.ext.Updater: telegram.error.NetworkError:
  httpx.ConnectError: All connection attempts failed
WARNING gateway.platforms.telegram_network:
  Primary api.telegram.org connection failed
  ([Errno 8] nodename nor servname provided, or not known)

Errno 8 is EAI_NONAME from getaddrinfo() — the resolver itself can't resolve the hostname, which happens briefly at cold boot before network is fully up. Once the Python process is running, subsequent resolution attempts also fail because the resolver has cached the failure state.

The gateway does not exit — it logs the error and keeps running in a degraded state where the Telegram platform is silently dead until a manual restart. This is especially bad because:

  1. Gateway looks healthy to launchctl / systemctl
  2. No Telegram messages are received
  3. KeepAlive on non-zero exit does not help because the process stays alive
  4. Users don't notice until they try to DM the bot

Root cause

Two issues in gateway/platforms/telegram.py connect():

Issue 1: Retry budget is too small for cold boot

# Line ~558-572
_max_connect = 3
for _attempt in range(_max_connect):
    try:
        await self._app.initialize()
        break
    except (NetworkError, TimedOut, OSError) as init_err:
        if _attempt < _max_connect - 1:
            wait = 2 ** _attempt   # 1s, 2s
            await asyncio.sleep(wait)
        else:
            raise

3 attempts with 1s/2s backoff = ~3 seconds total before giving up. At cold boot the network stack can take 10-30 seconds to come up on macOS, and even longer on systems with VPN/proxy startup.

Issue 2: discover_fallback_ips() called once, before the retry loop

# Line ~513-514
if not fallback_ips:
    fallback_ips = await discover_fallback_ips()

Called before the retry loop. At cold boot, DoH queries to dns.google and cloudflare-dns.com also fail, so discover_fallback_ips() falls through to the hardcoded seed 149.154.167.220. The TelegramFallbackTransport is then built with only this stale seed, and every retry inside the loop tries the same dead endpoints. When network finally comes up mid-loop, the gateway already exhausted its budget.

Proposed fix

Two small changes in connect():

  1. Increase retry budget from 3 attempts (~3s) to 8 attempts (~60s) with capped exponential backoff
  2. Move discover_fallback_ips() inside the retry loop so each retry gets a fresh discovery attempt when network finally comes up

Patch

# Replace existing block around lines 510-572:

            # Build the application (base builder without transport; we'll add it per-attempt)
            base_token = self.config.token

            # Start polling — retry initialize() for cold-boot DNS races + transient TLS resets
            try:
                from telegram.error import NetworkError, TimedOut
            except ImportError:
                NetworkError = TimedOut = OSError  # type: ignore[misc,assignment]

            _max_connect = 8  # up from 3 — covers cold-boot network stack delay
            _backoff_cap = 15  # seconds

            last_err: Exception | None = None
            for _attempt in range(_max_connect):
                try:
                    # Re-discover fallback IPs on each attempt — DoH may be down at
                    # cold boot but recover before we exhaust our retry budget.
                    fallback_ips = self._fallback_ips()
                    if not fallback_ips:
                        fallback_ips = await discover_fallback_ips()
                        if _attempt == 0:
                            logger.info(
                                "[%s] Auto-discovered Telegram fallback IPs: %s",
                                self.name,
                                ", ".join(fallback_ips),
                            )

                    builder = Application.builder().token(base_token)
                    if fallback_ips:
                        if _attempt == 0:
                            logger.warning(
                                "[%s] Telegram fallback IPs active: %s",
                                self.name,
                                ", ".join(fallback_ips),
                            )
                        transport = TelegramFallbackTransport(fallback_ips)
                        request = HTTPXRequest(httpx_kwargs={"transport": transport})
                        get_updates_request = HTTPXRequest(httpx_kwargs={"transport": transport})
                        builder = builder.request(request).get_updates_request(get_updates_request)

                    self._app = builder.build()
                    self._bot = self._app.bot

                    # (move handler registration to a helper and call here, or keep inline)
                    self._register_handlers()

                    await self._app.initialize()
                    break
                except (NetworkError, TimedOut, OSError) as init_err:
                    last_err = init_err
                    if _attempt < _max_connect - 1:
                        wait = min(2 ** _attempt, _backoff_cap)  # 1,2,4,8,15,15,15
                        logger.warning(
                            "[%s] Connect attempt %d/%d failed: %s — retrying in %ds",
                            self.name, _attempt + 1, _max_connect, init_err, wait,
                        )
                        await asyncio.sleep(wait)
                    else:
                        logger.error(
                            "[%s] Exhausted %d connect attempts over ~%ds, giving up: %s",
                            self.name, _max_connect, sum(min(2**i, _backoff_cap) for i in range(_max_connect-1)), init_err,
                        )
                        raise

Total retry budget becomes: 1 + 2 + 4 + 8 + 15 + 15 + 15 = 60 seconds, covering typical cold-boot delays.

Handler registration currently lives between builder.build() and initialize() — extracting to _register_handlers(self) keeps the retry loop clean and avoids re-registering handlers on a stale app instance.

Testing

  1. Cold boot simulation: stop networking (sudo ifconfig en0 down), start hermes, wait 20s, re-enable networking. Should see retries in logs, eventual success.
  2. DoH block simulation: block dns.google and cloudflare-dns.com in /etc/hosts, start hermes. Should fall through to seed IPs as before.
  3. Normal boot: should behave identically (first attempt succeeds).

Impact

  • Affects: every Hermes user with ai.hermes.gateway running at boot via launchd/systemd
  • Severity: silent Telegram outage until manual restart — users often don't notice for hours
  • Risk: low — changes only the retry budget and moves one function call inside the loop

Local workaround (already deployed)

As an interim fix while the PR is in review, the gateway can be wrapped in a DNS-readiness shim. Example for macOS launchd:

#!/bin/bash
VENV_PY="/Users/dg/.hermes/hermes-agent/venv/bin/python"
MAX_WAIT=90
WAITED=0
while ! "$VENV_PY" -c "import socket; socket.getaddrinfo('api.telegram.org', 443)" 2>/dev/null; do
    [ "$WAITED" -ge "$MAX_WAIT" ] && break
    sleep 2
    WAITED=$((WAITED + 2))
done
exec "$VENV_PY" -m hermes_cli.main gateway run --replace

This is a workaround only — the upstream fix is correct because it makes Hermes self-healing without requiring users to write launchd wrappers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliveryplatform/telegramTelegram bot adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions