fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005) by xxxigm · Pull Request #29007 · NousResearch/hermes-agent

xxxigm · 2026-05-20T01:01:03Z

What does this PR do?

Fixes #29005 — both halves of the reported regression:

P0 — QQBot: the three MAX_RECONNECT_ATTEMPTS exit paths in QQAdapter._listen_loop ended with a bare _mark_disconnected() and a return. That writes platform_state=disconnected to the runtime status file and kills the listener task — but does not fire the fatal_error_handler that GatewayRunner._platform_reconnect_watcher is wired through. With no fatal-error event, the watcher never adds the platform to _failed_platforms and the adapter stays permanently dead while the gateway process keeps running. Reporter's log shows exactly this: QQBot exhausts its 100-attempt budget at 07:33 and from then on nothing tries to bring it back. Fix replaces each _mark_disconnected() with _set_fatal_error(..., retryable=True) + await _notify_fatal_error(), so the same gateway watcher that already handles Telegram / WhatsApp / Slack picks the platform up with exponential back-off (cap 5 min, circuit-broken after 10 consecutive failures).

P1 — Telegram: during the polling reconnect ladder (_handle_polling_network_error, default 10 attempts with exponential back-off capped at 60 s ≈ 55 min total), platform_state stays at connected because _mark_connected() was called at startup. Anything reading platforms.telegram.state (Prometheus exporter, kanban dashboard, status MCP) has no way to tell the bot has been unreachable for several minutes. The fatal-state flip only happens at attempt 11. Fix writes platform_state=retrying before each back-off sleep with the attempt counter + last error in error_message, and writes connected back on successful reconnect.

Both halves reuse the existing _set_fatal_error / _write_runtime_status_safe helpers and the existing retrying state string the gateway's fatal-error handler already emits — no new schema, no new env vars, no public-API change.

Related Issue

Fixes #29005.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

gateway/platforms/qqbot/adapter.py (+36/-3) — _listen_loop:
- Replace _mark_disconnected() with _set_fatal_error("qq_reconnect_exhausted", ..., retryable=True) + await _notify_fatal_error() at all three MAX_RECONNECT_ATTEMPTS exits (rate-limit branch, generic QQCloseError branch, generic Exception branch).
- Single shared error code; messages differentiate the root cause so log greps still tell the three sites apart.
gateway/platforms/telegram.py (+27) — _handle_polling_network_error:
- Before sleeping: _write_runtime_status_safe(platform_state="retrying", error_code="telegram_network_error", error_message="Network error retry N/MAX (next in Xs): <err>").
- After successful start_polling(): write platform_state="connected" to clear the retrying state.
tests/gateway/test_platform_reconnect_fatal_notify.py (+288, new) — 7 regression tests across 2 classes:
- TestQQBotReconnectExhaustionNotifies (3 cases) — handler awaited with retryable=True; static guard over _listen_loop source proving at least 3 _set_fatal_error + 3 _notify_fatal_error escalations and the qq_reconnect_exhausted code remain; handler-less adapters don't crash.
- TestTelegramPollingRetryStatus (4 cases) — first-attempt retrying write happens; the write carries the N/MAX counter + telegram_network_error code; successful reconnect emits retrying then connected in order; final-attempt fatal escalation still fires (no regression for Gateway crashes on Telegram Bad Gateway (502) — reconnect loop fails #3173).

No production code outside the two adapter files modified.

How to Test

Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e \".[all,dev]\"

Run the new tests on their own:

scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v

Expected: 7 passed.

Run the surrounding suite to confirm no cross-file regressions:

scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \
  tests/gateway/test_telegram_network_reconnect.py \
  tests/gateway/test_qqbot.py

Expected: 166 passed.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(qqbot): ..., fix(telegram): ..., test(gateway): ...)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix (no unrelated commits)
I've run scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py and all tests pass
I've added tests for my changes
I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — N/A (no public-API change; inline comments call out the gateway-watcher contract + QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status #29005)
I've updated cli-config.yaml.example if I added/changed config keys — N/A (no new config)
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — both adapters' status-write helpers are platform-agnostic; fix is reachable on every OS that runs the gateway
I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v
4 workers [7 items]
============================== 7 passed in 1.17s ===============================

$ scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \
    tests/gateway/test_telegram_network_reconnect.py \
    tests/gateway/test_qqbot.py
4 workers [166 items]
======================== 166 passed, 1 warning in 2.20s ========================

Made with Cursor

…earch#29005) All three ``MAX_RECONNECT_ATTEMPTS`` exit paths in ``QQAdapter._listener_loop`` ended with a bare ``_mark_disconnected()`` and a ``return``. That writes ``platform_state=disconnected`` to the runtime status file and kills the listener task — but does **not** fire the ``fatal_error_handler`` that ``GatewayRunner._platform_reconnect_watcher`` is wired through. With no fatal-error event, the watcher never adds the platform to ``_failed_platforms`` and the adapter stays permanently dead while the gateway process happily keeps running. The reporter's log captures the symptom exactly: QQBot exhausts its 100-attempt budget at 07:33, ``platform_state`` flips to ``disconnected``, and from then on **nothing** in the gateway tries to bring it back. Replace each ``_mark_disconnected()`` exhaustion with the documented escalation: ``_set_fatal_error(..., retryable=True)`` followed by ``await _notify_fatal_error()``. The handler then: * writes ``platform_state=retrying`` so external monitors see the degraded state immediately, * queues the platform in ``_failed_platforms`` with a 30 s next-retry timer, and * lets ``_platform_reconnect_watcher`` keep retrying with the same exponential back-off as every other platform — capped at 5 min and circuit-broken after 10 consecutive failures, just like Telegram / WhatsApp / Slack. A single ``qq_reconnect_exhausted`` error code is reused across the three call sites; the human-readable message differentiates the root cause (rate-limited vs WS close vs unhandled exception) so log greps still tell them apart.

…usResearch#29005) While ``_handle_polling_network_error`` walks the reconnect ladder (default 10 attempts, exponential back-off capped at 60 s ≈ 55 min total), the runtime status file keeps reporting ``platforms.telegram.state = connected``. Anything watching that file — Prometheus exporter, kanban dashboard, status MCP — has no way to tell that the bot has actually been unreachable for the last several minutes. The fatal-state flip only happens after attempt 11, at which point the gateway is already on its way to a process restart. Fill the gap by writing ``platform_state=retrying`` BEFORE the back-off sleep on every attempt, with the attempt counter and last error string in ``error_message`` for diagnostics. On a successful reconnect, write ``connected`` back so monitors stop alerting. Both writes go through ``_write_runtime_status_safe`` (same helper the existing ``_mark_connected`` / ``_set_fatal_error`` use), so a failing status-dir doesn't break the reconnect ladder itself. Reuses the same ``retrying`` value the gateway's ``_handle_adapter_fatal_error`` already writes for retryable failures — no new schema, no new enum.

…sResearch#29005) Seven tests across two classes, covering both halves of the fix. ``TestQQBotReconnectExhaustionNotifies`` * ``test_set_fatal_error_runs_handler_with_retryable`` — drives the end-state the three ``MAX_RECONNECT_ATTEMPTS`` exits reach and verifies the registered ``fatal_error_handler`` is awaited with ``adapter.fatal_error_retryable=True`` (the flag the gateway watcher branches on to keep the platform in its retry queue). * ``test_exhaustion_paths_use_set_fatal_error_not_mark_disconnected`` — static guard over ``QQAdapter._listen_loop`` source: at least three ``_set_fatal_error`` calls and three ``_notify_fatal_error`` awaits, plus the ``qq_reconnect_exhausted`` error code. Reverting any exit to ``_mark_disconnected()`` (the pre-fix behaviour) trips this test. * ``test_notify_fatal_error_noop_without_handler`` — adapters spawned outside a ``GatewayRunner`` (CLI smoke tests, local debug) must not crash when no handler is registered. ``TestTelegramPollingRetryStatus`` * ``test_retry_window_writes_retrying_state`` — first call to ``_handle_polling_network_error`` must emit at least one ``platform_state=retrying`` write before sleeping. * ``test_retrying_write_carries_diagnostic_message`` — the write must include the ``N/MAX`` attempt counter and the ``telegram_network_error`` code so monitors can diagnose without cross-referencing logs. * ``test_successful_reconnect_flips_state_back_to_connected`` — ``retrying`` is followed by ``connected`` (in that order) on a successful reconnect, so external monitors stop alerting. * ``test_final_attempt_still_escalates_to_fatal`` — guards against the new ``retrying`` writes accidentally short-circuiting the existing attempt-11 fatal escalation (no regression for NousResearch#3173). Telegram tests stub ``_write_runtime_status_safe`` and PTB's updater to keep the suite hermetic (no real status file, no real network); the qqbot inspection test uses ``inspect.getsource`` so it can't drift if the listen-loop is refactored as long as the escalation calls still land.

xxxigm added 3 commits May 20, 2026 07:56

alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery platform/qqbot QQ Bot adapter platform/telegram Telegram bot adapter P2 Medium — degraded but workaround exists labels May 20, 2026

This was referenced May 20, 2026

QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status #29005

Open

fix(qqbot): notify gateway on reconnect exhaustion + expose retrying state (#29005) #29268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)#29007

fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)#29007
xxxigm wants to merge 3 commits into
NousResearch:mainfrom
xxxigm:fix/29005-platform-reconnect-fatal-notify

xxxigm commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xxxigm commented May 20, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants