Skip to content

fix(telegram): increase cold-boot retry budget and refresh fallback IPs#5770

Closed
Bartok9 wants to merge 1 commit into
NousResearch:mainfrom
Bartok9:fix/telegram-coldboot-retry-budget
Closed

fix(telegram): increase cold-boot retry budget and refresh fallback IPs#5770
Bartok9 wants to merge 1 commit into
NousResearch:mainfrom
Bartok9:fix/telegram-coldboot-retry-budget

Conversation

@Bartok9

@Bartok9 Bartok9 commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Problem

On systems where Hermes starts at boot time (macOS launchd, systemd), the Telegram platform's connect() can fail before the OS network stack is ready:

ERROR telegram.ext.Updater: telegram.error.NetworkError:
  httpx.ConnectError: All connection attempts failed

The gateway stays alive but Telegram is silently dead until a manual restart. This is especially bad because:

  • Gateway looks healthy to launchctl / systemctl
  • No Telegram messages are received
  • KeepAlive doesn't help because the process stays alive
  • Users often don't notice for hours

Root Causes

  1. Retry budget too small: 3 attempts with 1s/2s backoff (~3s total) vs 10-30+ seconds for cold boot network stack
  2. discover_fallback_ips() called once before retry loop: At cold boot, DoH queries also fail, caching the failure state for all subsequent attempts

Changes

  • Increase retry budget to 8 attempts (~60s total) with capped exponential backoff (1,2,4,8,15,15,15)
  • Move fallback IP discovery and app building inside the retry loop so each attempt gets fresh network state
  • Log exhaustion clearly before raising

Testing

  1. Cold boot simulation: sudo ifconfig en0 down, start hermes, wait 20s, re-enable. Should see retries, eventual success.
  2. Normal boot: First attempt succeeds, no change in behavior.

Impact

  • Low risk — only changes retry budget and moves function call inside loop
  • Affects every user with Hermes at boot via launchd/systemd

Fixes #5729

On systems where Hermes starts at boot time (launchd, systemd), the
Telegram platform's connect() can fail before the OS network stack is
ready. The gateway stays alive but Telegram is silently dead.

Root causes:
1. Retry budget too small (3 attempts / ~3s) for cold boot (10-30s+)
2. discover_fallback_ips() called once before retry loop, caching the
   failure state for all subsequent attempts

Changes:
- Increase retry budget to 8 attempts (~60s total) with capped backoff
- Move fallback IP discovery and app building inside the retry loop
  so each attempt gets fresh network state
- Log exhaustion clearly before raising

Total backoff: 1+2+4+8+15+15+15 = ~60 seconds, covering typical delays.

Fixes NousResearch#5729
teknium1 pushed a commit that referenced this pull request Apr 16, 2026
Bump connect retry attempts from 3 to 8 and cap exponential backoff at
15 seconds. Old budget: 3 attempts, 1+2+4=7s total — insufficient for
cold boot on slow networks or embedded devices. New budget: 8 attempts,
1+2+4+8+15+15+15=~60s total.

Inspired by PR #5770 by @Bartok9 (re-implemented against current main
since original was 913 commits stale with conflicts).
teknium1 pushed a commit that referenced this pull request Apr 16, 2026
Bump connect retry attempts from 3 to 8 and cap exponential backoff at
15 seconds. Old budget: 3 attempts, 1+2+4=7s total — insufficient for
cold boot on slow networks or embedded devices. New budget: 8 attempts,
1+2+4+8+15+15+15=~60s total.

Inspired by PR #5770 by @Bartok9 (re-implemented against current main
since original was 913 commits stale with conflicts).
@teknium1

Copy link
Copy Markdown
Contributor

Merged via #10947. Your core ideas (more retries + backoff cap) were re-implemented against current main with your authorship preserved. Thanks @Bartok9!

@teknium1 teknium1 closed this Apr 16, 2026
lauchiwa pushed a commit to lauchiwa/hermes-agent that referenced this pull request Apr 17, 2026
Bump connect retry attempts from 3 to 8 and cap exponential backoff at
15 seconds. Old budget: 3 attempts, 1+2+4=7s total — insufficient for
cold boot on slow networks or embedded devices. New budget: 8 attempts,
1+2+4+8+15+15+15=~60s total.

Inspired by PR NousResearch#5770 by @Bartok9 (re-implemented against current main
since original was 913 commits stale with conflicts).

(cherry picked from commit f055907)
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
Bump connect retry attempts from 3 to 8 and cap exponential backoff at
15 seconds. Old budget: 3 attempts, 1+2+4=7s total — insufficient for
cold boot on slow networks or embedded devices. New budget: 8 attempts,
1+2+4+8+15+15+15=~60s total.

Inspired by PR NousResearch#5770 by @Bartok9 (re-implemented against current main
since original was 913 commits stale with conflicts).
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
Bump connect retry attempts from 3 to 8 and cap exponential backoff at
15 seconds. Old budget: 3 attempts, 1+2+4=7s total — insufficient for
cold boot on slow networks or embedded devices. New budget: 8 attempts,
1+2+4+8+15+15+15=~60s total.

Inspired by PR NousResearch#5770 by @Bartok9 (re-implemented against current main
since original was 913 commits stale with conflicts).
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Bump connect retry attempts from 3 to 8 and cap exponential backoff at
15 seconds. Old budget: 3 attempts, 1+2+4=7s total — insufficient for
cold boot on slow networks or embedded devices. New budget: 8 attempts,
1+2+4+8+15+15+15=~60s total.

Inspired by PR NousResearch#5770 by @Bartok9 (re-implemented against current main
since original was 913 commits stale with conflicts).
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Bump connect retry attempts from 3 to 8 and cap exponential backoff at
15 seconds. Old budget: 3 attempts, 1+2+4=7s total — insufficient for
cold boot on slow networks or embedded devices. New budget: 8 attempts,
1+2+4+8+15+15+15=~60s total.

Inspired by PR NousResearch#5770 by @Bartok9 (re-implemented against current main
since original was 913 commits stale with conflicts).
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Bump connect retry attempts from 3 to 8 and cap exponential backoff at
15 seconds. Old budget: 3 attempts, 1+2+4=7s total — insufficient for
cold boot on slow networks or embedded devices. New budget: 8 attempts,
1+2+4+8+15+15+15=~60s total.

Inspired by PR NousResearch#5770 by @Bartok9 (re-implemented against current main
since original was 913 commits stale with conflicts).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram resolver failure caching + missing degraded-state detection after cold-boot resolver exhaustion

2 participants