-
-
Notifications
You must be signed in to change notification settings - Fork 52.7k
Description
Summary
The Telegram long-polling loop can silently stop while the gateway process remains alive. The grammY runner's task() promise resolves without error when maxRetryTime is exceeded, causing the polling while loop to exit via return. No restart logic exists, so the bot becomes permanently unresponsive until manual restart.
A related issue: when a channel's startAccount task crashes or exits, the .catch() logs the error and .finally() sets running: false — but never restarts the channel.
Root Cause
Bug 1: src/telegram/monitor.ts — polling exits on silent resolve
// BEFORE (broken): runner.task() resolves → return exits the while loop forever
await runner.task();
return; // ← this kills the polling loop silentlyWhen grammY's internal retry window (maxRetryTime: 5min) is exhausted (e.g., Telegram API unreachable for >5 min), runner.task() resolves cleanly (no error). The return statement exits monitorTelegramProvider(), and the channel marks running: false. The bot never polls again.
Bug 2: src/gateway/server-channels.ts — no channel restart on crash
When startAccount() throws or resolves unexpectedly, the promise chain logs the error but never restarts the channel. The channel is permanently dead until gateway restart.
Observed Behavior
- Gateway process alive for 9+ hours (PID visible, port listening)
- Zero outbound TCP connections to Telegram API (
lsof -ishows no149.154.x.x) - Last log entry:
"[default] channel exited: Request to 'getUpdates' timed out after 500 seconds" - Bot completely unresponsive to Telegram messages
- Only fix: manual
launchctl stop/start
Proposed Fix (implemented locally, tested)
Fix 1: Auto-restart polling on unexpected stop
Replace return with backoff + continue in the polling loop. Only exit when abortSignal is aborted:
// AFTER: treat silent resolve as recoverable, restart with backoff
await runner.task();
if (opts.abortSignal?.aborted) return; // intentional stop — exit cleanly
restartAttempts += 1;
const delayMs = computeBackoff(TELEGRAM_POLL_RESTART_POLICY, restartAttempts);
log(`Telegram polling stopped unexpectedly; restarting in ${formatDurationMs(delayMs)}.`);
await sleepWithAbort(delayMs, opts.abortSignal);
// continue → loops back to create a new runnerFix 2: Auto-restart channels with exponential backoff
Wrap startAccount() in runAccountWithRestart() — a while loop with exponential backoff (3s initial, 60s max, factor 2, jitter 0.2) that auto-restarts crashed channels up to 20 times:
const runAccountWithRestart = async () => {
let restartAttempts = 0;
while (!abort.signal.aborted) {
try {
await startAccount({ ... });
if (abort.signal.aborted) return;
restartAttempts += 1;
if (restartAttempts > MAX_CHANNEL_RESTART_ATTEMPTS) { /* give up */ return; }
const delayMs = computeBackoff(CHANNEL_RESTART_POLICY, restartAttempts);
log(`channel stopped unexpectedly; restarting in ${formatDurationMs(delayMs)}`);
await sleepWithAbort(delayMs, abort.signal);
} catch (err) {
// same pattern: backoff + restart, give up after MAX attempts
}
}
};Fix 3: Active Telegram health check (new)
Periodically call bot.api.getMe() (every 60s) to verify Telegram API connectivity. After 3 consecutive failures, force-stop the runner — which triggers the existing restart loop:
const stopHealthCheck = startPollingHealthCheck({
bot,
intervalMs: 60_000,
maxFailures: 3,
timeoutMs: 10_000,
onUnhealthy: () => void runner.stop(),
signal: opts.abortSignal,
});Fix 4: Process-level self-watchdog (new)
Detect event loop hangs via setInterval drift detection. If the event loop is unresponsive for >30s, force process.exit(1) so launchd (KeepAlive: true) auto-restarts the gateway:
export function startWatchdog(opts?: WatchdogOptions, signal?: AbortSignal): () => void {
let lastTick = Date.now();
const timer = setInterval(() => {
const delta = Date.now() - lastTick;
if (delta > thresholdMs) process.exit(1);
lastTick = Date.now();
}, intervalMs);
timer.unref();
return () => clearInterval(timer);
}Environment
- macOS (Mac mini, LaunchAgent with KeepAlive)
- Node.js 22
- grammY + @grammyjs/runner
- Long-polling mode (not webhook)
Related Issues
- Gateway crashes on unhandled fetch rejection (TypeError: fetch failed) #4248 — Gateway crashes on unhandled fetch rejection
- [Bug]: Clawdbot Gateway Crashes Repeatedly #3815 — Clawdbot Gateway Crashes Repeatedly
- Multiple Critical Channel Bugs in v2026.1.24-3 (Telegram, Nostr, AbortError) #3646 — Multiple Critical Channel Bugs (Telegram, AbortError)
- Heartbeat stops firing after context compression #2935 — Heartbeat stops after context compression
- Telegram channel plugin periodically goes silent #1964 — Telegram channel goes silent periodically
Files Changed
src/telegram/monitor.ts— polling restart loop + health check integrationsrc/gateway/server-channels.ts—runAccountWithRestart()with backoffsrc/infra/watchdog.ts— new process self-watchdogsrc/gateway/server.impl.ts— watchdog lifecycle integration- Tests:
monitor.test.ts(9 tests),watchdog.test.ts(4 tests)
All tests pass. Build clean. Lint clean.