[Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart

### Bug type

Design flaw / stability

### Summary

In `extensions/telegram/src/fetch.ts`, the `stickyAttemptIndex` closure variable is **monotonically non-decreasing** — once promoted to a fallback transport (IPv4-only, then pinned fallback IP `149.154.167.220`), the fetch stack **never returns to the default transport** even after the upstream network fully recovers. Combined with `connections=10 per origin` and `keepAliveMaxTimeout=600000ms`, a transient network blip reliably degrades the Telegram fetch stack into a stuck state until the whole gateway process is restarted.

On my box (macOS 26.2, Node 25.5.0, openclaw 2026.4.29, behind GFW with an occasional DC4 blackhole) the gateway saturates roughly once every 12–24 hours with this exact pattern:

- `[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!` (20+ repeats)
- `[telegram] fetch fallback: enabling sticky IPv4-only dispatcher` (once)
- `[telegram] fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP` (once)
- `eventLoopDelayMaxMs=19981.8 eventLoopUtilization=1 cpuCoreRatio=1.004`
- `[ws] handshake timeout` — in-process WebSocket clients can't connect to the gateway anymore
- `lsof -p <pid>` shows 9+ ESTABLISHED sockets to `api.telegram.org:443`, all presumably stale because the origin pool keepalive is 10 minutes

Once in this state, the process never recovers even when upstream Telegram connectivity is restored — only `launchctl kickstart -k` fixes it.

### Root cause (as I read the code)

File: `extensions/telegram/src/fetch.ts` (reading from the compiled `dist/extensions/telegram/fetch-*.js` in 2026.4.29 — original TS file on disk unavailable).

Three compounding design choices:

**1. `stickyAttemptIndex` is monotonic.** From `resolveTelegramTransport`:

```js
let stickyAttemptIndex = 0;
const promoteStickyAttempt = (nextIndex, err, reason) => {
  if (nextIndex <= stickyAttemptIndex || nextIndex >= transportAttempts.length) return false;
  // ...
  stickyAttemptIndex = nextIndex;  // only goes UP, never down
  return true;
};

const resolvedFetch = (async (input, init) => {
  const startIndex = Math.min(stickyAttemptIndex, transportAttempts.length - 1);
  // ... tries startIndex first, on failure walks forward through the list
  for (let nextIndex = startIndex + 1; nextIndex < transportAttempts.length; nextIndex += 1) {
    promoteStickyAttempt(nextIndex, err);
    // ...
  }
});
```

There is **no path back to `stickyAttemptIndex = 0`** — no success counter, no time-based reset, no periodic probe of the primary transport. Once a single transient failure walks the index to 2 (pinned fallback IP), every subsequent request for the lifetime of the process uses only the fallback IP. If that IP later goes soft-bad (still answering TLS handshake but slow) or the pinned dispatcher's keepalive pool fills with dead sockets, the stack has nowhere to escape to.

**2. Connection pool too wide, keepalive too long.**

```js
const TELEGRAM_DISPATCHER_KEEP_ALIVE_MAX_TIMEOUT_MS = 6e5;    // 10 minutes
const TELEGRAM_DISPATCHER_CONNECTIONS_PER_ORIGIN = 10;
```

With 10 connections per origin, when the upstream flaps it's common for several sockets to go into "ESTABLISHED but dead" state (the remote silently dropped them, kernel hasn't noticed). They then occupy slots in the origin pool for up to 10 minutes, during which `sendChatAction` requests sitting in that agent block on socket acquisition or stall inside `await`. Across multiple concurrent sessions this drives `eventLoopUtilization` to 1.0 and produces the multi-second `eventLoopDelayMaxMs` I quoted above.

**3. `TELEGRAM_FALLBACK_IPS = ["149.154.167.220"]` is a single-point-of-failure fallback.** Once the stack has promoted to the fallback-IP attempt, if that single pinned IP also degrades there is no further option in the list — the code returns to the top of the loop with `stickyAttemptIndex = 2` and repeats the same broken path forever.

### Steps to reproduce

Hard to reproduce deterministically on a clean network, but reliably happens on a host behind a flaky egress (e.g. behind the GFW, or any ISP where DC4 `149.154.166.0/24` intermittently blackholes). After ~12h of uptime:

1. Let the gateway run overnight with a Telegram bot channel configured and at least one active embedded-agent session using it.
2. During a period where `api.telegram.org`'s DNS result is unreachable for ~30s (a typical GFW flutter), observe the two `fetch fallback` log lines fire.
3. Upstream connectivity restores within a minute.
4. The gateway **stays** on the fallback path and gradually accumulates `[telegram] sendChatAction failed` spam and `eventLoopDelayMaxMs` > 5s until `[ws] handshake timeout` starts appearing and the gateway becomes unresponsive.

### Expected behavior

- After N consecutive successful fetches (e.g. 5), `stickyAttemptIndex` should decay back toward 0 so that the cheapest/primary transport is re-probed when the network recovers.
- Or: a periodic background probe of the primary dispatcher (every 60–120s while sticky > 0) that resets the index on success.
- Additionally, `connections` per origin should be lower (2–4 seems plenty for a Telegram bot) and `keepAliveMaxTimeout` should be much shorter (30–60s) to bound the dead-socket problem.

### Actual behavior

Once promoted, the stack stays promoted forever; event loop saturates; gateway requires a manual `launchctl kickstart -k`.

### Suggested fix (willing to PR if direction is agreed)

Minimal invasive change in `resolveTelegramTransport`:

```js
let stickyAttemptIndex = 0;
let consecutiveSuccessOnSticky = 0;
const STICKY_RESET_THRESHOLD = 5;  // or make this configurable

const demoteStickyAttempt = () => {
  if (stickyAttemptIndex === 0) return;
  consecutiveSuccessOnSticky += 1;
  if (consecutiveSuccessOnSticky >= STICKY_RESET_THRESHOLD) {
    log.info(`telegram fetch stack: resetting sticky index ${stickyAttemptIndex} -> 0 after ${consecutiveSuccessOnSticky} consecutive successes`);
    stickyAttemptIndex = 0;
    consecutiveSuccessOnSticky = 0;
  }
};

// in the success branch of resolvedFetch, after a clean response on the start attempt:
demoteStickyAttempt();

// in promoteStickyAttempt, reset the counter:
stickyAttemptIndex = nextIndex;
consecutiveSuccessOnSticky = 0;
```

Orthogonally:
- expose `connections` / `keepAliveMaxTimeout` / the reset threshold / the fallback-IP list as `channels.telegram.network.*` config knobs (or environment vars following the existing `OPENCLAW_TELEGRAM_*` pattern), so users behind hostile networks can tune without patching `dist/`.
- consider adding a second fallback IP in the DC5 range (the current list has one) so that when DC4 is blackholed there is still a second option if `.220` degrades.

### Related issues (different angle, same blast radius)

- #45759 Telegram typing keepalive loop lacks circuit breaker
- #56096 Telegram sendChatAction infinite retry loop with no backoff
- #76852 Periodic getMe 10s fetch-timeout storm on networks without IPv6 egress
- #55347 Native gateway self-healing

The common root across several of these is that the Telegram subsystem has no feedback loop from "upstream is healthy again" back into its internal state — every failure mode is latched.

### Environment

- openclaw 2026.4.29
- Node v25.5.0
- macOS 26.2 (arm64)
- Behind a network that occasionally blackholes the `149.154.166.0/23` DC4 range


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart #77088

Bug type

Summary

Root cause (as I read the code)

Steps to reproduce

Expected behavior

Actual behavior

Suggested fix (willing to PR if direction is agreed)

Related issues (different angle, same blast radius)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart #77088

Description

Bug type

Summary

Root cause (as I read the code)

Steps to reproduce

Expected behavior

Actual behavior

Suggested fix (willing to PR if direction is agreed)

Related issues (different angle, same blast radius)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions