Skip to content

Telegram typing keepalive loop lacks circuit breaker, causes gateway crash on network failure #45759

@assistantheinrich-prog

Description

@assistantheinrich-prog

Summary

When the Telegram API becomes unreachable (network blip, DNS timeout, etc.), the typing indicator keepalive loop (createTypingKeepaliveLoop in src/channels/typing-lifecycle.ts) continues firing sendChatAction calls every 6 seconds indefinitely. Each failed call triggers up to 3 retries with exponential backoff (up to 30s). Multiple concurrent typing contexts compound this, saturating the event loop and causing the gateway to become unresponsive.

This results in:

  • lane wait exceeded: waitedMs=317017 (5+ minutes blocked)
  • launchd sends SIGTERM due to unresponsiveness
  • Gateway crash loop (observed 3 crashes in one morning)

Reproduction

  1. Start the gateway with Telegram enabled
  2. Trigger a message that starts a typing indicator loop
  3. Interrupt network connectivity (e.g., disable WiFi, block api.telegram.org)
  4. Observe sendChatAction failed: Network request failed! every 3 seconds in gateway.err.log
  5. Gateway eventually becomes unresponsive and receives SIGTERM

Root Cause

In src/channels/typing-lifecycle.ts, createTypingKeepaliveLoop uses setInterval with no max-consecutive-error check:

const tick = async () => {
    if (tickInFlight) return;
    tickInFlight = true;
    try {
        await params.onTick();
    } finally {
        tickInFlight = false;
    }
};
const start = () => {
    if (params.intervalMs <= 0 || timer) return;
    timer = setInterval(() => { tick(); }, params.intervalMs);
};

The typingTtlMs (2 min) should theoretically stop the loop, but when network errors stall the event loop, the TTL timer doesn't fire cleanly.

Meanwhile withTelegramApiErrorLogging in src/telegram/send.ts logs the error then re-throws, but the caller (sendTypingTelegram) uses createTelegramRequestWithDiag which retries recoverable network errors up to 3 times with 30s max backoff — so each tick can block for ~90s on a dead network.

Suggested Fix

Add a consecutive-error circuit breaker to createTypingKeepaliveLoop:

let consecutiveErrors = 0;
const MAX_CONSECUTIVE_ERRORS = 3;

const tick = async () => {
    if (tickInFlight) return;
    tickInFlight = true;
    try {
        await params.onTick();
        consecutiveErrors = 0;
    } catch {
        consecutiveErrors++;
        if (consecutiveErrors >= MAX_CONSECUTIVE_ERRORS) {
            stop();
            params.onCircuitBreak?.();
        }
    } finally {
        tickInFlight = false;
    }
};

Alternatively, typing indicator calls specifically could use attempts: 1 and a short timeout (5s) since they're purely cosmetic — a failed typing indicator should never block real message processing.

Workaround

Setting agents.defaults.typingMode: "never" in openclaw.json eliminates the crash vector entirely. Additionally reducing channels.telegram.retry.attempts to 1 and timeoutSeconds to 5 limits blast radius.

Environment

  • OpenClaw v2026.3.11
  • macOS (launchd managed gateway)
  • Telegram channel with multiple group topics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions