Skip to content

Telegram polling dies silently — runner.task() resolves without error, no restart #4302

@reallygood83

Description

@reallygood83

Summary

The Telegram long-polling loop can silently stop while the gateway process remains alive. The grammY runner's task() promise resolves without error when maxRetryTime is exceeded, causing the polling while loop to exit via return. No restart logic exists, so the bot becomes permanently unresponsive until manual restart.

A related issue: when a channel's startAccount task crashes or exits, the .catch() logs the error and .finally() sets running: false — but never restarts the channel.

Root Cause

Bug 1: src/telegram/monitor.ts — polling exits on silent resolve

// BEFORE (broken): runner.task() resolves → return exits the while loop forever
await runner.task();
return; // ← this kills the polling loop silently

When grammY's internal retry window (maxRetryTime: 5min) is exhausted (e.g., Telegram API unreachable for >5 min), runner.task() resolves cleanly (no error). The return statement exits monitorTelegramProvider(), and the channel marks running: false. The bot never polls again.

Bug 2: src/gateway/server-channels.ts — no channel restart on crash

When startAccount() throws or resolves unexpectedly, the promise chain logs the error but never restarts the channel. The channel is permanently dead until gateway restart.

Observed Behavior

  • Gateway process alive for 9+ hours (PID visible, port listening)
  • Zero outbound TCP connections to Telegram API (lsof -i shows no 149.154.x.x)
  • Last log entry: "[default] channel exited: Request to 'getUpdates' timed out after 500 seconds"
  • Bot completely unresponsive to Telegram messages
  • Only fix: manual launchctl stop/start

Proposed Fix (implemented locally, tested)

Fix 1: Auto-restart polling on unexpected stop

Replace return with backoff + continue in the polling loop. Only exit when abortSignal is aborted:

// AFTER: treat silent resolve as recoverable, restart with backoff
await runner.task();
if (opts.abortSignal?.aborted) return; // intentional stop — exit cleanly
restartAttempts += 1;
const delayMs = computeBackoff(TELEGRAM_POLL_RESTART_POLICY, restartAttempts);
log(`Telegram polling stopped unexpectedly; restarting in ${formatDurationMs(delayMs)}.`);
await sleepWithAbort(delayMs, opts.abortSignal);
// continue → loops back to create a new runner

Fix 2: Auto-restart channels with exponential backoff

Wrap startAccount() in runAccountWithRestart() — a while loop with exponential backoff (3s initial, 60s max, factor 2, jitter 0.2) that auto-restarts crashed channels up to 20 times:

const runAccountWithRestart = async () => {
  let restartAttempts = 0;
  while (!abort.signal.aborted) {
    try {
      await startAccount({ ... });
      if (abort.signal.aborted) return;
      restartAttempts += 1;
      if (restartAttempts > MAX_CHANNEL_RESTART_ATTEMPTS) { /* give up */ return; }
      const delayMs = computeBackoff(CHANNEL_RESTART_POLICY, restartAttempts);
      log(`channel stopped unexpectedly; restarting in ${formatDurationMs(delayMs)}`);
      await sleepWithAbort(delayMs, abort.signal);
    } catch (err) {
      // same pattern: backoff + restart, give up after MAX attempts
    }
  }
};

Fix 3: Active Telegram health check (new)

Periodically call bot.api.getMe() (every 60s) to verify Telegram API connectivity. After 3 consecutive failures, force-stop the runner — which triggers the existing restart loop:

const stopHealthCheck = startPollingHealthCheck({
  bot,
  intervalMs: 60_000,
  maxFailures: 3,
  timeoutMs: 10_000,
  onUnhealthy: () => void runner.stop(),
  signal: opts.abortSignal,
});

Fix 4: Process-level self-watchdog (new)

Detect event loop hangs via setInterval drift detection. If the event loop is unresponsive for >30s, force process.exit(1) so launchd (KeepAlive: true) auto-restarts the gateway:

export function startWatchdog(opts?: WatchdogOptions, signal?: AbortSignal): () => void {
  let lastTick = Date.now();
  const timer = setInterval(() => {
    const delta = Date.now() - lastTick;
    if (delta > thresholdMs) process.exit(1);
    lastTick = Date.now();
  }, intervalMs);
  timer.unref();
  return () => clearInterval(timer);
}

Environment

  • macOS (Mac mini, LaunchAgent with KeepAlive)
  • Node.js 22
  • grammY + @grammyjs/runner
  • Long-polling mode (not webhook)

Related Issues

Files Changed

  1. src/telegram/monitor.ts — polling restart loop + health check integration
  2. src/gateway/server-channels.tsrunAccountWithRestart() with backoff
  3. src/infra/watchdog.ts — new process self-watchdog
  4. src/gateway/server.impl.ts — watchdog lifecycle integration
  5. Tests: monitor.test.ts (9 tests), watchdog.test.ts (4 tests)

All tests pass. Build clean. Lint clean.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions