PR: Telegram platform fails silently on cold boot due to insufficient retry budget
Problem
On systems where the Hermes gateway starts at boot time (e.g. macOS launchd, systemd with default ordering), the Telegram platform's connect() routine can fail before the OS network stack is ready:
ERROR telegram.ext.Updater: telegram.error.NetworkError:
httpx.ConnectError: All connection attempts failed
WARNING gateway.platforms.telegram_network:
Primary api.telegram.org connection failed
([Errno 8] nodename nor servname provided, or not known)
Errno 8 is EAI_NONAME from getaddrinfo() — the resolver itself can't resolve the hostname, which happens briefly at cold boot before network is fully up. Once the Python process is running, subsequent resolution attempts also fail because the resolver has cached the failure state.
The gateway does not exit — it logs the error and keeps running in a degraded state where the Telegram platform is silently dead until a manual restart. This is especially bad because:
- Gateway looks healthy to
launchctl / systemctl
- No Telegram messages are received
KeepAlive on non-zero exit does not help because the process stays alive
- Users don't notice until they try to DM the bot
Root cause
Two issues in gateway/platforms/telegram.py connect():
Issue 1: Retry budget is too small for cold boot
# Line ~558-572
_max_connect = 3
for _attempt in range(_max_connect):
try:
await self._app.initialize()
break
except (NetworkError, TimedOut, OSError) as init_err:
if _attempt < _max_connect - 1:
wait = 2 ** _attempt # 1s, 2s
await asyncio.sleep(wait)
else:
raise
3 attempts with 1s/2s backoff = ~3 seconds total before giving up. At cold boot the network stack can take 10-30 seconds to come up on macOS, and even longer on systems with VPN/proxy startup.
Issue 2: discover_fallback_ips() called once, before the retry loop
# Line ~513-514
if not fallback_ips:
fallback_ips = await discover_fallback_ips()
Called before the retry loop. At cold boot, DoH queries to dns.google and cloudflare-dns.com also fail, so discover_fallback_ips() falls through to the hardcoded seed 149.154.167.220. The TelegramFallbackTransport is then built with only this stale seed, and every retry inside the loop tries the same dead endpoints. When network finally comes up mid-loop, the gateway already exhausted its budget.
Proposed fix
Two small changes in connect():
- Increase retry budget from 3 attempts (~3s) to 8 attempts (~60s) with capped exponential backoff
- Move
discover_fallback_ips() inside the retry loop so each retry gets a fresh discovery attempt when network finally comes up
Patch
# Replace existing block around lines 510-572:
# Build the application (base builder without transport; we'll add it per-attempt)
base_token = self.config.token
# Start polling — retry initialize() for cold-boot DNS races + transient TLS resets
try:
from telegram.error import NetworkError, TimedOut
except ImportError:
NetworkError = TimedOut = OSError # type: ignore[misc,assignment]
_max_connect = 8 # up from 3 — covers cold-boot network stack delay
_backoff_cap = 15 # seconds
last_err: Exception | None = None
for _attempt in range(_max_connect):
try:
# Re-discover fallback IPs on each attempt — DoH may be down at
# cold boot but recover before we exhaust our retry budget.
fallback_ips = self._fallback_ips()
if not fallback_ips:
fallback_ips = await discover_fallback_ips()
if _attempt == 0:
logger.info(
"[%s] Auto-discovered Telegram fallback IPs: %s",
self.name,
", ".join(fallback_ips),
)
builder = Application.builder().token(base_token)
if fallback_ips:
if _attempt == 0:
logger.warning(
"[%s] Telegram fallback IPs active: %s",
self.name,
", ".join(fallback_ips),
)
transport = TelegramFallbackTransport(fallback_ips)
request = HTTPXRequest(httpx_kwargs={"transport": transport})
get_updates_request = HTTPXRequest(httpx_kwargs={"transport": transport})
builder = builder.request(request).get_updates_request(get_updates_request)
self._app = builder.build()
self._bot = self._app.bot
# (move handler registration to a helper and call here, or keep inline)
self._register_handlers()
await self._app.initialize()
break
except (NetworkError, TimedOut, OSError) as init_err:
last_err = init_err
if _attempt < _max_connect - 1:
wait = min(2 ** _attempt, _backoff_cap) # 1,2,4,8,15,15,15
logger.warning(
"[%s] Connect attempt %d/%d failed: %s — retrying in %ds",
self.name, _attempt + 1, _max_connect, init_err, wait,
)
await asyncio.sleep(wait)
else:
logger.error(
"[%s] Exhausted %d connect attempts over ~%ds, giving up: %s",
self.name, _max_connect, sum(min(2**i, _backoff_cap) for i in range(_max_connect-1)), init_err,
)
raise
Total retry budget becomes: 1 + 2 + 4 + 8 + 15 + 15 + 15 = 60 seconds, covering typical cold-boot delays.
Handler registration currently lives between builder.build() and initialize() — extracting to _register_handlers(self) keeps the retry loop clean and avoids re-registering handlers on a stale app instance.
Testing
- Cold boot simulation: stop networking (
sudo ifconfig en0 down), start hermes, wait 20s, re-enable networking. Should see retries in logs, eventual success.
- DoH block simulation: block
dns.google and cloudflare-dns.com in /etc/hosts, start hermes. Should fall through to seed IPs as before.
- Normal boot: should behave identically (first attempt succeeds).
Impact
- Affects: every Hermes user with
ai.hermes.gateway running at boot via launchd/systemd
- Severity: silent Telegram outage until manual restart — users often don't notice for hours
- Risk: low — changes only the retry budget and moves one function call inside the loop
Local workaround (already deployed)
As an interim fix while the PR is in review, the gateway can be wrapped in a DNS-readiness shim. Example for macOS launchd:
#!/bin/bash
VENV_PY="/Users/dg/.hermes/hermes-agent/venv/bin/python"
MAX_WAIT=90
WAITED=0
while ! "$VENV_PY" -c "import socket; socket.getaddrinfo('api.telegram.org', 443)" 2>/dev/null; do
[ "$WAITED" -ge "$MAX_WAIT" ] && break
sleep 2
WAITED=$((WAITED + 2))
done
exec "$VENV_PY" -m hermes_cli.main gateway run --replace
This is a workaround only — the upstream fix is correct because it makes Hermes self-healing without requiring users to write launchd wrappers.
PR: Telegram platform fails silently on cold boot due to insufficient retry budget
Problem
On systems where the Hermes gateway starts at boot time (e.g. macOS launchd, systemd with default ordering), the Telegram platform's
connect()routine can fail before the OS network stack is ready:Errno 8isEAI_NONAMEfromgetaddrinfo()— the resolver itself can't resolve the hostname, which happens briefly at cold boot before network is fully up. Once the Python process is running, subsequent resolution attempts also fail because the resolver has cached the failure state.The gateway does not exit — it logs the error and keeps running in a degraded state where the Telegram platform is silently dead until a manual restart. This is especially bad because:
launchctl/systemctlKeepAliveon non-zero exit does not help because the process stays aliveRoot cause
Two issues in
gateway/platforms/telegram.pyconnect():Issue 1: Retry budget is too small for cold boot
3 attempts with 1s/2s backoff = ~3 seconds total before giving up. At cold boot the network stack can take 10-30 seconds to come up on macOS, and even longer on systems with VPN/proxy startup.
Issue 2:
discover_fallback_ips()called once, before the retry loopCalled before the retry loop. At cold boot, DoH queries to
dns.googleandcloudflare-dns.comalso fail, sodiscover_fallback_ips()falls through to the hardcoded seed149.154.167.220. TheTelegramFallbackTransportis then built with only this stale seed, and every retry inside the loop tries the same dead endpoints. When network finally comes up mid-loop, the gateway already exhausted its budget.Proposed fix
Two small changes in
connect():discover_fallback_ips()inside the retry loop so each retry gets a fresh discovery attempt when network finally comes upPatch
Total retry budget becomes: 1 + 2 + 4 + 8 + 15 + 15 + 15 = 60 seconds, covering typical cold-boot delays.
Handler registration currently lives between builder.build() and initialize() — extracting to
_register_handlers(self)keeps the retry loop clean and avoids re-registering handlers on a stale app instance.Testing
sudo ifconfig en0 down), start hermes, wait 20s, re-enable networking. Should see retries in logs, eventual success.dns.googleandcloudflare-dns.comin/etc/hosts, start hermes. Should fall through to seed IPs as before.Impact
ai.hermes.gatewayrunning at boot via launchd/systemdLocal workaround (already deployed)
As an interim fix while the PR is in review, the gateway can be wrapped in a DNS-readiness shim. Example for macOS launchd:
This is a workaround only — the upstream fix is correct because it makes Hermes self-healing without requiring users to write launchd wrappers.