Bug type
Design flaw / stability
Summary
In extensions/telegram/src/fetch.ts, the stickyAttemptIndex closure variable is monotonically non-decreasing — once promoted to a fallback transport (IPv4-only, then pinned fallback IP 149.154.167.220), the fetch stack never returns to the default transport even after the upstream network fully recovers. Combined with connections=10 per origin and keepAliveMaxTimeout=600000ms, a transient network blip reliably degrades the Telegram fetch stack into a stuck state until the whole gateway process is restarted.
On my box (macOS 26.2, Node 25.5.0, openclaw 2026.4.29, behind GFW with an occasional DC4 blackhole) the gateway saturates roughly once every 12–24 hours with this exact pattern:
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed! (20+ repeats)
[telegram] fetch fallback: enabling sticky IPv4-only dispatcher (once)
[telegram] fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (once)
eventLoopDelayMaxMs=19981.8 eventLoopUtilization=1 cpuCoreRatio=1.004
[ws] handshake timeout — in-process WebSocket clients can't connect to the gateway anymore
lsof -p <pid> shows 9+ ESTABLISHED sockets to api.telegram.org:443, all presumably stale because the origin pool keepalive is 10 minutes
Once in this state, the process never recovers even when upstream Telegram connectivity is restored — only launchctl kickstart -k fixes it.
Root cause (as I read the code)
File: extensions/telegram/src/fetch.ts (reading from the compiled dist/extensions/telegram/fetch-*.js in 2026.4.29 — original TS file on disk unavailable).
Three compounding design choices:
1. stickyAttemptIndex is monotonic. From resolveTelegramTransport:
let stickyAttemptIndex = 0;
const promoteStickyAttempt = (nextIndex, err, reason) => {
if (nextIndex <= stickyAttemptIndex || nextIndex >= transportAttempts.length) return false;
// ...
stickyAttemptIndex = nextIndex; // only goes UP, never down
return true;
};
const resolvedFetch = (async (input, init) => {
const startIndex = Math.min(stickyAttemptIndex, transportAttempts.length - 1);
// ... tries startIndex first, on failure walks forward through the list
for (let nextIndex = startIndex + 1; nextIndex < transportAttempts.length; nextIndex += 1) {
promoteStickyAttempt(nextIndex, err);
// ...
}
});
There is no path back to stickyAttemptIndex = 0 — no success counter, no time-based reset, no periodic probe of the primary transport. Once a single transient failure walks the index to 2 (pinned fallback IP), every subsequent request for the lifetime of the process uses only the fallback IP. If that IP later goes soft-bad (still answering TLS handshake but slow) or the pinned dispatcher's keepalive pool fills with dead sockets, the stack has nowhere to escape to.
2. Connection pool too wide, keepalive too long.
const TELEGRAM_DISPATCHER_KEEP_ALIVE_MAX_TIMEOUT_MS = 6e5; // 10 minutes
const TELEGRAM_DISPATCHER_CONNECTIONS_PER_ORIGIN = 10;
With 10 connections per origin, when the upstream flaps it's common for several sockets to go into "ESTABLISHED but dead" state (the remote silently dropped them, kernel hasn't noticed). They then occupy slots in the origin pool for up to 10 minutes, during which sendChatAction requests sitting in that agent block on socket acquisition or stall inside await. Across multiple concurrent sessions this drives eventLoopUtilization to 1.0 and produces the multi-second eventLoopDelayMaxMs I quoted above.
3. TELEGRAM_FALLBACK_IPS = ["149.154.167.220"] is a single-point-of-failure fallback. Once the stack has promoted to the fallback-IP attempt, if that single pinned IP also degrades there is no further option in the list — the code returns to the top of the loop with stickyAttemptIndex = 2 and repeats the same broken path forever.
Steps to reproduce
Hard to reproduce deterministically on a clean network, but reliably happens on a host behind a flaky egress (e.g. behind the GFW, or any ISP where DC4 149.154.166.0/24 intermittently blackholes). After ~12h of uptime:
- Let the gateway run overnight with a Telegram bot channel configured and at least one active embedded-agent session using it.
- During a period where
api.telegram.org's DNS result is unreachable for ~30s (a typical GFW flutter), observe the two fetch fallback log lines fire.
- Upstream connectivity restores within a minute.
- The gateway stays on the fallback path and gradually accumulates
[telegram] sendChatAction failed spam and eventLoopDelayMaxMs > 5s until [ws] handshake timeout starts appearing and the gateway becomes unresponsive.
Expected behavior
- After N consecutive successful fetches (e.g. 5),
stickyAttemptIndex should decay back toward 0 so that the cheapest/primary transport is re-probed when the network recovers.
- Or: a periodic background probe of the primary dispatcher (every 60–120s while sticky > 0) that resets the index on success.
- Additionally,
connections per origin should be lower (2–4 seems plenty for a Telegram bot) and keepAliveMaxTimeout should be much shorter (30–60s) to bound the dead-socket problem.
Actual behavior
Once promoted, the stack stays promoted forever; event loop saturates; gateway requires a manual launchctl kickstart -k.
Suggested fix (willing to PR if direction is agreed)
Minimal invasive change in resolveTelegramTransport:
let stickyAttemptIndex = 0;
let consecutiveSuccessOnSticky = 0;
const STICKY_RESET_THRESHOLD = 5; // or make this configurable
const demoteStickyAttempt = () => {
if (stickyAttemptIndex === 0) return;
consecutiveSuccessOnSticky += 1;
if (consecutiveSuccessOnSticky >= STICKY_RESET_THRESHOLD) {
log.info(`telegram fetch stack: resetting sticky index ${stickyAttemptIndex} -> 0 after ${consecutiveSuccessOnSticky} consecutive successes`);
stickyAttemptIndex = 0;
consecutiveSuccessOnSticky = 0;
}
};
// in the success branch of resolvedFetch, after a clean response on the start attempt:
demoteStickyAttempt();
// in promoteStickyAttempt, reset the counter:
stickyAttemptIndex = nextIndex;
consecutiveSuccessOnSticky = 0;
Orthogonally:
- expose
connections / keepAliveMaxTimeout / the reset threshold / the fallback-IP list as channels.telegram.network.* config knobs (or environment vars following the existing OPENCLAW_TELEGRAM_* pattern), so users behind hostile networks can tune without patching dist/.
- consider adding a second fallback IP in the DC5 range (the current list has one) so that when DC4 is blackholed there is still a second option if
.220 degrades.
Related issues (different angle, same blast radius)
The common root across several of these is that the Telegram subsystem has no feedback loop from "upstream is healthy again" back into its internal state — every failure mode is latched.
Environment
- openclaw 2026.4.29
- Node v25.5.0
- macOS 26.2 (arm64)
- Behind a network that occasionally blackholes the
149.154.166.0/23 DC4 range
Bug type
Design flaw / stability
Summary
In
extensions/telegram/src/fetch.ts, thestickyAttemptIndexclosure variable is monotonically non-decreasing — once promoted to a fallback transport (IPv4-only, then pinned fallback IP149.154.167.220), the fetch stack never returns to the default transport even after the upstream network fully recovers. Combined withconnections=10 per originandkeepAliveMaxTimeout=600000ms, a transient network blip reliably degrades the Telegram fetch stack into a stuck state until the whole gateway process is restarted.On my box (macOS 26.2, Node 25.5.0, openclaw 2026.4.29, behind GFW with an occasional DC4 blackhole) the gateway saturates roughly once every 12–24 hours with this exact pattern:
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!(20+ repeats)[telegram] fetch fallback: enabling sticky IPv4-only dispatcher(once)[telegram] fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP(once)eventLoopDelayMaxMs=19981.8 eventLoopUtilization=1 cpuCoreRatio=1.004[ws] handshake timeout— in-process WebSocket clients can't connect to the gateway anymorelsof -p <pid>shows 9+ ESTABLISHED sockets toapi.telegram.org:443, all presumably stale because the origin pool keepalive is 10 minutesOnce in this state, the process never recovers even when upstream Telegram connectivity is restored — only
launchctl kickstart -kfixes it.Root cause (as I read the code)
File:
extensions/telegram/src/fetch.ts(reading from the compileddist/extensions/telegram/fetch-*.jsin 2026.4.29 — original TS file on disk unavailable).Three compounding design choices:
1.
stickyAttemptIndexis monotonic. FromresolveTelegramTransport:There is no path back to
stickyAttemptIndex = 0— no success counter, no time-based reset, no periodic probe of the primary transport. Once a single transient failure walks the index to 2 (pinned fallback IP), every subsequent request for the lifetime of the process uses only the fallback IP. If that IP later goes soft-bad (still answering TLS handshake but slow) or the pinned dispatcher's keepalive pool fills with dead sockets, the stack has nowhere to escape to.2. Connection pool too wide, keepalive too long.
With 10 connections per origin, when the upstream flaps it's common for several sockets to go into "ESTABLISHED but dead" state (the remote silently dropped them, kernel hasn't noticed). They then occupy slots in the origin pool for up to 10 minutes, during which
sendChatActionrequests sitting in that agent block on socket acquisition or stall insideawait. Across multiple concurrent sessions this driveseventLoopUtilizationto 1.0 and produces the multi-secondeventLoopDelayMaxMsI quoted above.3.
TELEGRAM_FALLBACK_IPS = ["149.154.167.220"]is a single-point-of-failure fallback. Once the stack has promoted to the fallback-IP attempt, if that single pinned IP also degrades there is no further option in the list — the code returns to the top of the loop withstickyAttemptIndex = 2and repeats the same broken path forever.Steps to reproduce
Hard to reproduce deterministically on a clean network, but reliably happens on a host behind a flaky egress (e.g. behind the GFW, or any ISP where DC4
149.154.166.0/24intermittently blackholes). After ~12h of uptime:api.telegram.org's DNS result is unreachable for ~30s (a typical GFW flutter), observe the twofetch fallbacklog lines fire.[telegram] sendChatAction failedspam andeventLoopDelayMaxMs> 5s until[ws] handshake timeoutstarts appearing and the gateway becomes unresponsive.Expected behavior
stickyAttemptIndexshould decay back toward 0 so that the cheapest/primary transport is re-probed when the network recovers.connectionsper origin should be lower (2–4 seems plenty for a Telegram bot) andkeepAliveMaxTimeoutshould be much shorter (30–60s) to bound the dead-socket problem.Actual behavior
Once promoted, the stack stays promoted forever; event loop saturates; gateway requires a manual
launchctl kickstart -k.Suggested fix (willing to PR if direction is agreed)
Minimal invasive change in
resolveTelegramTransport:Orthogonally:
connections/keepAliveMaxTimeout/ the reset threshold / the fallback-IP list aschannels.telegram.network.*config knobs (or environment vars following the existingOPENCLAW_TELEGRAM_*pattern), so users behind hostile networks can tune without patchingdist/..220degrades.Related issues (different angle, same blast radius)
The common root across several of these is that the Telegram subsystem has no feedback loop from "upstream is healthy again" back into its internal state — every failure mode is latched.
Environment
149.154.166.0/23DC4 range