Skip to content

Cron Telegram live-adapter delivery can silently drop messages after reconnect storms #31165

@Percy2Live

Description

@Percy2Live

Bug Description

Scheduled cron jobs with deliver: telegram:CHAT_ID can stop arriving in Telegram after the gateway has been running through sustained Telegram reconnect storms (Bad Gateway / TimedOut). The scheduler still records the delivery as successful:

  • jobs.json shows last_status: ok and last_delivery_error: null
  • the cron output file contains a full, non-empty response
  • scheduler logs say the job was delivered to telegram:CHAT_ID via live adapter
  • but the Telegram message never reaches the user

Restarting the gateway consistently restores delivery. This points at the long-running TelegramAdapter / python-telegram-bot client entering a bad state after reconnect loops, while cron's live-adapter branch still treats sends as successful.

Affected Components

  • cron/scheduler.py_deliver_result, live-adapter branch using runtime_adapter.send(...) via asyncio.run_coroutine_threadsafe
  • gateway/platforms/telegram.pyTelegramAdapter.send
  • python-telegram-bot 22.7

Observed Behavior

Multiple cron jobs configured with deliver: telegram:... were affected at once, so this does not appear to be job-specific.

Typical evidence from the broken state:

jobs.json: last_status = ok
jobs.json: last_delivery_error = null
~/.hermes/cron/output/{job_id}/...md contains non-empty output
INFO cron.scheduler: Job '{job_id}': delivered to telegram:CHAT_ID via live adapter

Actual result: no Telegram message arrives.

The condition appears after reconnect bursts like:

[Telegram] Telegram network error, scheduling reconnect: Bad Gateway
[Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: Bad Gateway
telegram.error.TimedOut: Timed out

gateway_state.json continues to report platforms.telegram.state == "connected"; its updated_at can remain frozen at the last successful state transition, usually gateway startup.

Expected Behavior

If TelegramAdapter.send() returns SendResult(success=True, message_id="1234"), the message should actually be delivered to the configured chat.

If the live adapter is unhealthy or Telegram refuses/drops the send, the adapter should surface a failure so cron can either:

  1. fall through to the standalone delivery path, or
  2. record a delivery error / retryable delivery failure instead of marking the run as successfully delivered.

Diagnostic Evidence

After a fresh gateway restart, the same manually triggered cron job delivered successfully and diagnostic logging around the live-adapter call showed:

WARNING cron.scheduler: DIAG cron-deliver job=JOB_ID plat=telegram chat=CHAT_ID
  adapter='TelegramAdapter' loop_running=True text_len=890 skip_live=False
WARNING cron.scheduler: DIAG live-adapter-result job=JOB_ID type=SendResult
  repr=SendResult(success=True, message_id='1245', error=None,
                  raw_response={'message_ids': ['1245']},
                  retryable=False, continuation_message_ids=())
  success_attr=True
INFO cron.scheduler: Job 'JOB_ID': delivered to telegram:CHAT_ID via live adapter

That message arrived. In the broken state before restart, the same job and configuration had repeatedly reported last_status: ok with no delivery.

A standalone send in a separate process using the same bot token, chat id, and platform config succeeded while the cron/live-adapter path was the suspected failure point:

from gateway.config import Platform, load_gateway_config
from tools.send_message_tool import _send_to_platform

cfg = load_gateway_config()
pconfig = cfg.platforms.get(Platform.TELEGRAM)
result = await _send_to_platform(Platform.TELEGRAM, pconfig, "CHAT_ID", "ping")
# {'success': True, 'platform': 'telegram', 'chat_id': '...', 'message_id': '1243'}

The standalone message arrived. The later live-adapter cron delivery reported a nearby message_id, confirming the same bot/chat backend was being used.

Workaround

Locally, cron delivery was patched to skip the live-adapter branch for Telegram and always use the standalone path. Since standalone delivery is already used by send_message tool calls, this restored cron Telegram delivery in the affected stack.

This is not a complete upstream fix because some platforms may need a live adapter (for example E2EE-only Matrix/Signal paths), but it suggests Telegram cron delivery should not blindly trust a long-lived adapter that has survived repeated reconnect errors.

Suspected Root Cause

After sustained Bad Gateway / TimedOut reconnect loops, the python-telegram-bot Bot instance held by TelegramAdapter._bot may enter a wedged state where bot.send_message() returns a Message object (so TelegramAdapter.send returns SendResult(success=True, message_id=...)), but the message is not transmitted in a way that reaches the recipient.

The gateway's own state machine still reports Telegram as connected because polling/reconnect state and send-path health are not independently verified.

Possible mechanisms:

  1. PTB/httpx client is wedged on a stale connection and incorrectly reports success.
  2. Polling/getUpdates recovers but sendMessage is not healthy.
  3. The request is accepted against an unexpected chat/topic context, though a standalone probe with the same chat id worked.

Suggested Fix Directions

In order of increasing intrusiveness:

  1. Add periodic Telegram adapter health checks (getMe() or a configured debug-channel self-send) and force a full adapter reconnect/rebuild if checks fail.
  2. Count consecutive Bad Gateway / TimedOut reconnect errors. After a threshold, discard and recreate the PTB Bot and Application objects rather than reusing the same client.
  3. In cron delivery, prefer the standalone Telegram path over the live adapter unless the platform explicitly requires live-adapter semantics.
  4. At minimum, fall through if SendResult lacks raw_response, message_id, or other strong delivery evidence. This will not catch the observed real-looking message_id case, but it is still defensive.

Steps to Reproduce

This is non-deterministic and depends on Telegram/network instability:

  1. Run hermes-gateway continuously for several days with Telegram enabled.
  2. Let Telegram encounter repeated Bad Gateway / timeout reconnect bursts, or simulate intermittent outbound HTTPS failures to api.telegram.org.
  3. Trigger any cron job with deliver: telegram:CHAT_ID.
  4. Observe that cron reports successful live-adapter delivery but the message does not arrive.
  5. Restart the gateway.
  6. Trigger the same cron job again; delivery resumes.

Environment

  • hermes-agent commit b833d8501 / tag v2026.5.7
  • Python 3.11.2
  • python-telegram-bot 22.7
  • Debian 12, Linux 6.1.0-44 amd64
  • Gateway managed by systemd as a non-root user
  • Telegram was the only configured messaging platform in the affected stack

Related / Not Duplicates

Existing related issues cover nearby symptoms but not this exact false-success live-adapter failure mode:

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/cronCron scheduler and job managementcomp/gatewayGateway runner, session dispatch, deliveryplatform/telegramTelegram bot adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions