Cron Telegram live-adapter delivery can silently drop messages after reconnect storms

## Bug Description

Scheduled cron jobs with `deliver: telegram:CHAT_ID` can stop arriving in Telegram after the gateway has been running through sustained Telegram reconnect storms (`Bad Gateway` / `TimedOut`). The scheduler still records the delivery as successful:

- `jobs.json` shows `last_status: ok` and `last_delivery_error: null`
- the cron output file contains a full, non-empty response
- scheduler logs say the job was `delivered to telegram:CHAT_ID via live adapter`
- but the Telegram message never reaches the user

Restarting the gateway consistently restores delivery. This points at the long-running `TelegramAdapter` / `python-telegram-bot` client entering a bad state after reconnect loops, while cron's live-adapter branch still treats sends as successful.

## Affected Components

- `cron/scheduler.py` — `_deliver_result`, live-adapter branch using `runtime_adapter.send(...)` via `asyncio.run_coroutine_threadsafe`
- `gateway/platforms/telegram.py` — `TelegramAdapter.send`
- `python-telegram-bot` 22.7

## Observed Behavior

Multiple cron jobs configured with `deliver: telegram:...` were affected at once, so this does not appear to be job-specific.

Typical evidence from the broken state:

```text
jobs.json: last_status = ok
jobs.json: last_delivery_error = null
~/.hermes/cron/output/{job_id}/...md contains non-empty output
INFO cron.scheduler: Job '{job_id}': delivered to telegram:CHAT_ID via live adapter
```

Actual result: no Telegram message arrives.

The condition appears after reconnect bursts like:

```text
[Telegram] Telegram network error, scheduling reconnect: Bad Gateway
[Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: Bad Gateway
telegram.error.TimedOut: Timed out
```

`gateway_state.json` continues to report `platforms.telegram.state == "connected"`; its `updated_at` can remain frozen at the last successful state transition, usually gateway startup.

## Expected Behavior

If `TelegramAdapter.send()` returns `SendResult(success=True, message_id="1234")`, the message should actually be delivered to the configured chat.

If the live adapter is unhealthy or Telegram refuses/drops the send, the adapter should surface a failure so cron can either:

1. fall through to the standalone delivery path, or
2. record a delivery error / retryable delivery failure instead of marking the run as successfully delivered.

## Diagnostic Evidence

After a fresh gateway restart, the same manually triggered cron job delivered successfully and diagnostic logging around the live-adapter call showed:

```text
WARNING cron.scheduler: DIAG cron-deliver job=JOB_ID plat=telegram chat=CHAT_ID
  adapter='TelegramAdapter' loop_running=True text_len=890 skip_live=False
WARNING cron.scheduler: DIAG live-adapter-result job=JOB_ID type=SendResult
  repr=SendResult(success=True, message_id='1245', error=None,
                  raw_response={'message_ids': ['1245']},
                  retryable=False, continuation_message_ids=())
  success_attr=True
INFO cron.scheduler: Job 'JOB_ID': delivered to telegram:CHAT_ID via live adapter
```

That message arrived. In the broken state before restart, the same job and configuration had repeatedly reported `last_status: ok` with no delivery.

A standalone send in a separate process using the same bot token, chat id, and platform config succeeded while the cron/live-adapter path was the suspected failure point:

```python
from gateway.config import Platform, load_gateway_config
from tools.send_message_tool import _send_to_platform

cfg = load_gateway_config()
pconfig = cfg.platforms.get(Platform.TELEGRAM)
result = await _send_to_platform(Platform.TELEGRAM, pconfig, "CHAT_ID", "ping")
# {'success': True, 'platform': 'telegram', 'chat_id': '...', 'message_id': '1243'}
```

The standalone message arrived. The later live-adapter cron delivery reported a nearby `message_id`, confirming the same bot/chat backend was being used.

## Workaround

Locally, cron delivery was patched to skip the live-adapter branch for Telegram and always use the standalone path. Since standalone delivery is already used by `send_message` tool calls, this restored cron Telegram delivery in the affected stack.

This is not a complete upstream fix because some platforms may need a live adapter (for example E2EE-only Matrix/Signal paths), but it suggests Telegram cron delivery should not blindly trust a long-lived adapter that has survived repeated reconnect errors.

## Suspected Root Cause

After sustained `Bad Gateway` / `TimedOut` reconnect loops, the `python-telegram-bot` `Bot` instance held by `TelegramAdapter._bot` may enter a wedged state where `bot.send_message()` returns a `Message` object (so `TelegramAdapter.send` returns `SendResult(success=True, message_id=...)`), but the message is not transmitted in a way that reaches the recipient.

The gateway's own state machine still reports Telegram as connected because polling/reconnect state and send-path health are not independently verified.

Possible mechanisms:

1. PTB/httpx client is wedged on a stale connection and incorrectly reports success.
2. Polling/getUpdates recovers but `sendMessage` is not healthy.
3. The request is accepted against an unexpected chat/topic context, though a standalone probe with the same chat id worked.

## Suggested Fix Directions

In order of increasing intrusiveness:

1. Add periodic Telegram adapter health checks (`getMe()` or a configured debug-channel self-send) and force a full adapter reconnect/rebuild if checks fail.
2. Count consecutive `Bad Gateway` / `TimedOut` reconnect errors. After a threshold, discard and recreate the PTB `Bot` and `Application` objects rather than reusing the same client.
3. In cron delivery, prefer the standalone Telegram path over the live adapter unless the platform explicitly requires live-adapter semantics.
4. At minimum, fall through if `SendResult` lacks `raw_response`, `message_id`, or other strong delivery evidence. This will not catch the observed real-looking `message_id` case, but it is still defensive.

## Steps to Reproduce

This is non-deterministic and depends on Telegram/network instability:

1. Run `hermes-gateway` continuously for several days with Telegram enabled.
2. Let Telegram encounter repeated `Bad Gateway` / timeout reconnect bursts, or simulate intermittent outbound HTTPS failures to `api.telegram.org`.
3. Trigger any cron job with `deliver: telegram:CHAT_ID`.
4. Observe that cron reports successful live-adapter delivery but the message does not arrive.
5. Restart the gateway.
6. Trigger the same cron job again; delivery resumes.

## Environment

- hermes-agent commit `b833d8501` / tag `v2026.5.7`
- Python 3.11.2
- python-telegram-bot 22.7
- Debian 12, Linux 6.1.0-44 amd64
- Gateway managed by systemd as a non-root user
- Telegram was the only configured messaging platform in the affected stack

## Related / Not Duplicates

Existing related issues cover nearby symptoms but not this exact false-success live-adapter failure mode:

- #3173 — Telegram `Bad Gateway` reconnect loop can make the gateway unresponsive. This issue differs because the gateway and cron keep running, but live-adapter cron delivery silently drops while reporting success.
- #13566 / #8846 — delivery retry/status separation for transient failures. Useful follow-up, but this issue is specifically that cron never sees a failure.
- #20915 — standalone `send_message_tool._send_telegram` lacking Telegram fallback transport on blocked networks. This report is the inverse: standalone send works, live adapter is suspected broken.
- #22773 / #17139 — Telegram cron routing/target-resolution issues. Here the target resolves and scheduler logs successful live-adapter delivery.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cron Telegram live-adapter delivery can silently drop messages after reconnect storms #31165

Bug Description

Affected Components

Observed Behavior

Expected Behavior

Diagnostic Evidence

Workaround

Suspected Root Cause

Suggested Fix Directions

Steps to Reproduce

Environment

Related / Not Duplicates

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cron Telegram live-adapter delivery can silently drop messages after reconnect storms #31165

Description

Bug Description

Affected Components

Observed Behavior

Expected Behavior

Diagnostic Evidence

Workaround

Suspected Root Cause

Suggested Fix Directions

Steps to Reproduce

Environment

Related / Not Duplicates

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions