Skip to content

[Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart #77088

@zhengsx

Description

@zhengsx

Bug type

Design flaw / stability

Summary

In extensions/telegram/src/fetch.ts, the stickyAttemptIndex closure variable is monotonically non-decreasing — once promoted to a fallback transport (IPv4-only, then pinned fallback IP 149.154.167.220), the fetch stack never returns to the default transport even after the upstream network fully recovers. Combined with connections=10 per origin and keepAliveMaxTimeout=600000ms, a transient network blip reliably degrades the Telegram fetch stack into a stuck state until the whole gateway process is restarted.

On my box (macOS 26.2, Node 25.5.0, openclaw 2026.4.29, behind GFW with an occasional DC4 blackhole) the gateway saturates roughly once every 12–24 hours with this exact pattern:

  • [telegram] sendChatAction failed: Network request for 'sendChatAction' failed! (20+ repeats)
  • [telegram] fetch fallback: enabling sticky IPv4-only dispatcher (once)
  • [telegram] fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (once)
  • eventLoopDelayMaxMs=19981.8 eventLoopUtilization=1 cpuCoreRatio=1.004
  • [ws] handshake timeout — in-process WebSocket clients can't connect to the gateway anymore
  • lsof -p <pid> shows 9+ ESTABLISHED sockets to api.telegram.org:443, all presumably stale because the origin pool keepalive is 10 minutes

Once in this state, the process never recovers even when upstream Telegram connectivity is restored — only launchctl kickstart -k fixes it.

Root cause (as I read the code)

File: extensions/telegram/src/fetch.ts (reading from the compiled dist/extensions/telegram/fetch-*.js in 2026.4.29 — original TS file on disk unavailable).

Three compounding design choices:

1. stickyAttemptIndex is monotonic. From resolveTelegramTransport:

let stickyAttemptIndex = 0;
const promoteStickyAttempt = (nextIndex, err, reason) => {
  if (nextIndex <= stickyAttemptIndex || nextIndex >= transportAttempts.length) return false;
  // ...
  stickyAttemptIndex = nextIndex;  // only goes UP, never down
  return true;
};

const resolvedFetch = (async (input, init) => {
  const startIndex = Math.min(stickyAttemptIndex, transportAttempts.length - 1);
  // ... tries startIndex first, on failure walks forward through the list
  for (let nextIndex = startIndex + 1; nextIndex < transportAttempts.length; nextIndex += 1) {
    promoteStickyAttempt(nextIndex, err);
    // ...
  }
});

There is no path back to stickyAttemptIndex = 0 — no success counter, no time-based reset, no periodic probe of the primary transport. Once a single transient failure walks the index to 2 (pinned fallback IP), every subsequent request for the lifetime of the process uses only the fallback IP. If that IP later goes soft-bad (still answering TLS handshake but slow) or the pinned dispatcher's keepalive pool fills with dead sockets, the stack has nowhere to escape to.

2. Connection pool too wide, keepalive too long.

const TELEGRAM_DISPATCHER_KEEP_ALIVE_MAX_TIMEOUT_MS = 6e5;    // 10 minutes
const TELEGRAM_DISPATCHER_CONNECTIONS_PER_ORIGIN = 10;

With 10 connections per origin, when the upstream flaps it's common for several sockets to go into "ESTABLISHED but dead" state (the remote silently dropped them, kernel hasn't noticed). They then occupy slots in the origin pool for up to 10 minutes, during which sendChatAction requests sitting in that agent block on socket acquisition or stall inside await. Across multiple concurrent sessions this drives eventLoopUtilization to 1.0 and produces the multi-second eventLoopDelayMaxMs I quoted above.

3. TELEGRAM_FALLBACK_IPS = ["149.154.167.220"] is a single-point-of-failure fallback. Once the stack has promoted to the fallback-IP attempt, if that single pinned IP also degrades there is no further option in the list — the code returns to the top of the loop with stickyAttemptIndex = 2 and repeats the same broken path forever.

Steps to reproduce

Hard to reproduce deterministically on a clean network, but reliably happens on a host behind a flaky egress (e.g. behind the GFW, or any ISP where DC4 149.154.166.0/24 intermittently blackholes). After ~12h of uptime:

  1. Let the gateway run overnight with a Telegram bot channel configured and at least one active embedded-agent session using it.
  2. During a period where api.telegram.org's DNS result is unreachable for ~30s (a typical GFW flutter), observe the two fetch fallback log lines fire.
  3. Upstream connectivity restores within a minute.
  4. The gateway stays on the fallback path and gradually accumulates [telegram] sendChatAction failed spam and eventLoopDelayMaxMs > 5s until [ws] handshake timeout starts appearing and the gateway becomes unresponsive.

Expected behavior

  • After N consecutive successful fetches (e.g. 5), stickyAttemptIndex should decay back toward 0 so that the cheapest/primary transport is re-probed when the network recovers.
  • Or: a periodic background probe of the primary dispatcher (every 60–120s while sticky > 0) that resets the index on success.
  • Additionally, connections per origin should be lower (2–4 seems plenty for a Telegram bot) and keepAliveMaxTimeout should be much shorter (30–60s) to bound the dead-socket problem.

Actual behavior

Once promoted, the stack stays promoted forever; event loop saturates; gateway requires a manual launchctl kickstart -k.

Suggested fix (willing to PR if direction is agreed)

Minimal invasive change in resolveTelegramTransport:

let stickyAttemptIndex = 0;
let consecutiveSuccessOnSticky = 0;
const STICKY_RESET_THRESHOLD = 5;  // or make this configurable

const demoteStickyAttempt = () => {
  if (stickyAttemptIndex === 0) return;
  consecutiveSuccessOnSticky += 1;
  if (consecutiveSuccessOnSticky >= STICKY_RESET_THRESHOLD) {
    log.info(`telegram fetch stack: resetting sticky index ${stickyAttemptIndex} -> 0 after ${consecutiveSuccessOnSticky} consecutive successes`);
    stickyAttemptIndex = 0;
    consecutiveSuccessOnSticky = 0;
  }
};

// in the success branch of resolvedFetch, after a clean response on the start attempt:
demoteStickyAttempt();

// in promoteStickyAttempt, reset the counter:
stickyAttemptIndex = nextIndex;
consecutiveSuccessOnSticky = 0;

Orthogonally:

  • expose connections / keepAliveMaxTimeout / the reset threshold / the fallback-IP list as channels.telegram.network.* config knobs (or environment vars following the existing OPENCLAW_TELEGRAM_* pattern), so users behind hostile networks can tune without patching dist/.
  • consider adding a second fallback IP in the DC5 range (the current list has one) so that when DC4 is blackholed there is still a second option if .220 degrades.

Related issues (different angle, same blast radius)

The common root across several of these is that the Telegram subsystem has no feedback loop from "upstream is healthy again" back into its internal state — every failure mode is latched.

Environment

  • openclaw 2026.4.29
  • Node v25.5.0
  • macOS 26.2 (arm64)
  • Behind a network that occasionally blackholes the 149.154.166.0/23 DC4 range

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions