Telegram resolver failure caching + missing degraded-state detection after cold-boot resolver exhaustion


# PR: Telegram platform fails silently on cold boot due to insufficient retry budget

## Problem

On systems where the Hermes gateway starts at boot time (e.g. macOS launchd, systemd with default ordering), the Telegram platform's `connect()` routine can fail before the OS network stack is ready:

```
ERROR telegram.ext.Updater: telegram.error.NetworkError:
  httpx.ConnectError: All connection attempts failed
WARNING gateway.platforms.telegram_network:
  Primary api.telegram.org connection failed
  ([Errno 8] nodename nor servname provided, or not known)
```

`Errno 8` is `EAI_NONAME` from `getaddrinfo()` — the resolver itself can't resolve the hostname, which happens briefly at cold boot before network is fully up. Once the Python process is running, subsequent resolution attempts **also** fail because the resolver has cached the failure state.

The gateway does not exit — it logs the error and keeps running in a degraded state where the Telegram platform is silently dead until a manual restart. This is especially bad because:

1. Gateway looks healthy to `launchctl` / `systemctl`
2. No Telegram messages are received
3. `KeepAlive` on non-zero exit does not help because the process stays alive
4. Users don't notice until they try to DM the bot

## Root cause

Two issues in [`gateway/platforms/telegram.py`](https://github.com/NousResearch/hermes-agent/blob/main/gateway/platforms/telegram.py) `connect()`:

### Issue 1: Retry budget is too small for cold boot

```python
# Line ~558-572
_max_connect = 3
for _attempt in range(_max_connect):
    try:
        await self._app.initialize()
        break
    except (NetworkError, TimedOut, OSError) as init_err:
        if _attempt < _max_connect - 1:
            wait = 2 ** _attempt   # 1s, 2s
            await asyncio.sleep(wait)
        else:
            raise
```

3 attempts with 1s/2s backoff = **~3 seconds total** before giving up. At cold boot the network stack can take 10-30 seconds to come up on macOS, and even longer on systems with VPN/proxy startup.

### Issue 2: `discover_fallback_ips()` called once, before the retry loop

```python
# Line ~513-514
if not fallback_ips:
    fallback_ips = await discover_fallback_ips()
```

Called **before** the retry loop. At cold boot, DoH queries to `dns.google` and `cloudflare-dns.com` also fail, so `discover_fallback_ips()` falls through to the hardcoded seed `149.154.167.220`. The `TelegramFallbackTransport` is then built with only this stale seed, and every retry inside the loop tries the same dead endpoints. When network finally comes up mid-loop, the gateway already exhausted its budget.

## Proposed fix

Two small changes in `connect()`:

1. **Increase retry budget** from 3 attempts (~3s) to 8 attempts (~60s) with capped exponential backoff
2. **Move `discover_fallback_ips()` inside the retry loop** so each retry gets a fresh discovery attempt when network finally comes up

### Patch

```python
# Replace existing block around lines 510-572:

            # Build the application (base builder without transport; we'll add it per-attempt)
            base_token = self.config.token

            # Start polling — retry initialize() for cold-boot DNS races + transient TLS resets
            try:
                from telegram.error import NetworkError, TimedOut
            except ImportError:
                NetworkError = TimedOut = OSError  # type: ignore[misc,assignment]

            _max_connect = 8  # up from 3 — covers cold-boot network stack delay
            _backoff_cap = 15  # seconds

            last_err: Exception | None = None
            for _attempt in range(_max_connect):
                try:
                    # Re-discover fallback IPs on each attempt — DoH may be down at
                    # cold boot but recover before we exhaust our retry budget.
                    fallback_ips = self._fallback_ips()
                    if not fallback_ips:
                        fallback_ips = await discover_fallback_ips()
                        if _attempt == 0:
                            logger.info(
                                "[%s] Auto-discovered Telegram fallback IPs: %s",
                                self.name,
                                ", ".join(fallback_ips),
                            )

                    builder = Application.builder().token(base_token)
                    if fallback_ips:
                        if _attempt == 0:
                            logger.warning(
                                "[%s] Telegram fallback IPs active: %s",
                                self.name,
                                ", ".join(fallback_ips),
                            )
                        transport = TelegramFallbackTransport(fallback_ips)
                        request = HTTPXRequest(httpx_kwargs={"transport": transport})
                        get_updates_request = HTTPXRequest(httpx_kwargs={"transport": transport})
                        builder = builder.request(request).get_updates_request(get_updates_request)

                    self._app = builder.build()
                    self._bot = self._app.bot

                    # (move handler registration to a helper and call here, or keep inline)
                    self._register_handlers()

                    await self._app.initialize()
                    break
                except (NetworkError, TimedOut, OSError) as init_err:
                    last_err = init_err
                    if _attempt < _max_connect - 1:
                        wait = min(2 ** _attempt, _backoff_cap)  # 1,2,4,8,15,15,15
                        logger.warning(
                            "[%s] Connect attempt %d/%d failed: %s — retrying in %ds",
                            self.name, _attempt + 1, _max_connect, init_err, wait,
                        )
                        await asyncio.sleep(wait)
                    else:
                        logger.error(
                            "[%s] Exhausted %d connect attempts over ~%ds, giving up: %s",
                            self.name, _max_connect, sum(min(2**i, _backoff_cap) for i in range(_max_connect-1)), init_err,
                        )
                        raise
```

Total retry budget becomes: 1 + 2 + 4 + 8 + 15 + 15 + 15 = **60 seconds**, covering typical cold-boot delays.

Handler registration currently lives between builder.build() and initialize() — extracting to `_register_handlers(self)` keeps the retry loop clean and avoids re-registering handlers on a stale app instance.

## Testing

1. **Cold boot simulation**: stop networking (`sudo ifconfig en0 down`), start hermes, wait 20s, re-enable networking. Should see retries in logs, eventual success.
2. **DoH block simulation**: block `dns.google` and `cloudflare-dns.com` in `/etc/hosts`, start hermes. Should fall through to seed IPs as before.
3. **Normal boot**: should behave identically (first attempt succeeds).

## Impact

- **Affects**: every Hermes user with `ai.hermes.gateway` running at boot via launchd/systemd
- **Severity**: silent Telegram outage until manual restart — users often don't notice for hours
- **Risk**: low — changes only the retry budget and moves one function call inside the loop

## Local workaround (already deployed)

As an interim fix while the PR is in review, the gateway can be wrapped in a DNS-readiness shim. Example for macOS launchd:

```bash
#!/bin/bash
VENV_PY="/Users/dg/.hermes/hermes-agent/venv/bin/python"
MAX_WAIT=90
WAITED=0
while ! "$VENV_PY" -c "import socket; socket.getaddrinfo('api.telegram.org', 443)" 2>/dev/null; do
    [ "$WAITED" -ge "$MAX_WAIT" ] && break
    sleep 2
    WAITED=$((WAITED + 2))
done
exec "$VENV_PY" -m hermes_cli.main gateway run --replace
```

This is a workaround only — the upstream fix is correct because it makes Hermes self-healing without requiring users to write launchd wrappers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegram resolver failure caching + missing degraded-state detection after cold-boot resolver exhaustion #5729

PR: Telegram platform fails silently on cold boot due to insufficient retry budget

Problem

Root cause

Issue 1: Retry budget is too small for cold boot

Issue 2: `discover_fallback_ips()` called once, before the retry loop

Proposed fix

Patch

Testing

Impact

Local workaround (already deployed)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Telegram resolver failure caching + missing degraded-state detection after cold-boot resolver exhaustion #5729

Description

PR: Telegram platform fails silently on cold boot due to insufficient retry budget

Problem

Root cause

Issue 1: Retry budget is too small for cold boot

Issue 2: discover_fallback_ips() called once, before the retry loop

Proposed fix

Patch

Testing

Impact

Local workaround (already deployed)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue 2: `discover_fallback_ips()` called once, before the retry loop