Skip to content

fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)#29007

Open
xxxigm wants to merge 3 commits into
NousResearch:mainfrom
xxxigm:fix/29005-platform-reconnect-fatal-notify
Open

fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)#29007
xxxigm wants to merge 3 commits into
NousResearch:mainfrom
xxxigm:fix/29005-platform-reconnect-fatal-notify

Conversation

@xxxigm

@xxxigm xxxigm commented May 20, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes #29005 — both halves of the reported regression:

P0 — QQBot: the three MAX_RECONNECT_ATTEMPTS exit paths in QQAdapter._listen_loop ended with a bare _mark_disconnected() and a return. That writes platform_state=disconnected to the runtime status file and kills the listener task — but does not fire the fatal_error_handler that GatewayRunner._platform_reconnect_watcher is wired through. With no fatal-error event, the watcher never adds the platform to _failed_platforms and the adapter stays permanently dead while the gateway process keeps running. Reporter's log shows exactly this: QQBot exhausts its 100-attempt budget at 07:33 and from then on nothing tries to bring it back. Fix replaces each _mark_disconnected() with _set_fatal_error(..., retryable=True) + await _notify_fatal_error(), so the same gateway watcher that already handles Telegram / WhatsApp / Slack picks the platform up with exponential back-off (cap 5 min, circuit-broken after 10 consecutive failures).

P1 — Telegram: during the polling reconnect ladder (_handle_polling_network_error, default 10 attempts with exponential back-off capped at 60 s ≈ 55 min total), platform_state stays at connected because _mark_connected() was called at startup. Anything reading platforms.telegram.state (Prometheus exporter, kanban dashboard, status MCP) has no way to tell the bot has been unreachable for several minutes. The fatal-state flip only happens at attempt 11. Fix writes platform_state=retrying before each back-off sleep with the attempt counter + last error in error_message, and writes connected back on successful reconnect.

Both halves reuse the existing _set_fatal_error / _write_runtime_status_safe helpers and the existing retrying state string the gateway's fatal-error handler already emits — no new schema, no new env vars, no public-API change.

Related Issue

Fixes #29005.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/platforms/qqbot/adapter.py (+36/-3) — _listen_loop:
    • Replace _mark_disconnected() with _set_fatal_error("qq_reconnect_exhausted", ..., retryable=True) + await _notify_fatal_error() at all three MAX_RECONNECT_ATTEMPTS exits (rate-limit branch, generic QQCloseError branch, generic Exception branch).
    • Single shared error code; messages differentiate the root cause so log greps still tell the three sites apart.
  • gateway/platforms/telegram.py (+27) — _handle_polling_network_error:
    • Before sleeping: _write_runtime_status_safe(platform_state="retrying", error_code="telegram_network_error", error_message="Network error retry N/MAX (next in Xs): <err>").
    • After successful start_polling(): write platform_state="connected" to clear the retrying state.
  • tests/gateway/test_platform_reconnect_fatal_notify.py (+288, new) — 7 regression tests across 2 classes:
    • TestQQBotReconnectExhaustionNotifies (3 cases) — handler awaited with retryable=True; static guard over _listen_loop source proving at least 3 _set_fatal_error + 3 _notify_fatal_error escalations and the qq_reconnect_exhausted code remain; handler-less adapters don't crash.
    • TestTelegramPollingRetryStatus (4 cases) — first-attempt retrying write happens; the write carries the N/MAX counter + telegram_network_error code; successful reconnect emits retrying then connected in order; final-attempt fatal escalation still fires (no regression for Gateway crashes on Telegram Bad Gateway (502) — reconnect loop fails #3173).

No production code outside the two adapter files modified.

How to Test

  1. Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e \".[all,dev]\"
  2. Run the new tests on their own:
    scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v
    
    Expected: 7 passed.
  3. Run the surrounding suite to confirm no cross-file regressions:
    scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \
      tests/gateway/test_telegram_network_reconnect.py \
      tests/gateway/test_qqbot.py
    
    Expected: 166 passed.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(qqbot): ..., fix(telegram): ..., test(gateway): ...)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A (no public-API change; inline comments call out the gateway-watcher contract + QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status #29005)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (no new config)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — both adapters' status-write helpers are platform-agnostic; fix is reachable on every OS that runs the gateway
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v
4 workers [7 items]
============================== 7 passed in 1.17s ===============================

$ scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \
    tests/gateway/test_telegram_network_reconnect.py \
    tests/gateway/test_qqbot.py
4 workers [166 items]
======================== 166 passed, 1 warning in 2.20s ========================

Made with Cursor

xxxigm added 3 commits May 20, 2026 07:56
…earch#29005)

All three ``MAX_RECONNECT_ATTEMPTS`` exit paths in
``QQAdapter._listener_loop`` ended with a bare ``_mark_disconnected()``
and a ``return``.  That writes ``platform_state=disconnected`` to
the runtime status file and kills the listener task — but does **not**
fire the ``fatal_error_handler`` that
``GatewayRunner._platform_reconnect_watcher`` is wired through.  With
no fatal-error event, the watcher never adds the platform to
``_failed_platforms`` and the adapter stays permanently dead while
the gateway process happily keeps running.

The reporter's log captures the symptom exactly: QQBot exhausts its
100-attempt budget at 07:33, ``platform_state`` flips to
``disconnected``, and from then on **nothing** in the gateway tries
to bring it back.

Replace each ``_mark_disconnected()`` exhaustion with the documented
escalation: ``_set_fatal_error(..., retryable=True)`` followed by
``await _notify_fatal_error()``.  The handler then:

* writes ``platform_state=retrying`` so external monitors see the
  degraded state immediately,
* queues the platform in ``_failed_platforms`` with a 30 s next-retry
  timer, and
* lets ``_platform_reconnect_watcher`` keep retrying with the same
  exponential back-off as every other platform — capped at 5 min
  and circuit-broken after 10 consecutive failures, just like
  Telegram / WhatsApp / Slack.

A single ``qq_reconnect_exhausted`` error code is reused across the
three call sites; the human-readable message differentiates the
root cause (rate-limited vs WS close vs unhandled exception) so log
greps still tell them apart.
…usResearch#29005)

While ``_handle_polling_network_error`` walks the reconnect ladder
(default 10 attempts, exponential back-off capped at 60 s ≈ 55 min
total), the runtime status file keeps reporting
``platforms.telegram.state = connected``.  Anything watching that
file — Prometheus exporter, kanban dashboard, status MCP — has no
way to tell that the bot has actually been unreachable for the last
several minutes.  The fatal-state flip only happens after attempt
11, at which point the gateway is already on its way to a process
restart.

Fill the gap by writing ``platform_state=retrying`` BEFORE the
back-off sleep on every attempt, with the attempt counter and last
error string in ``error_message`` for diagnostics.  On a successful
reconnect, write ``connected`` back so monitors stop alerting.

Both writes go through ``_write_runtime_status_safe`` (same helper
the existing ``_mark_connected`` / ``_set_fatal_error`` use), so a
failing status-dir doesn't break the reconnect ladder itself.

Reuses the same ``retrying`` value the gateway's
``_handle_adapter_fatal_error`` already writes for retryable
failures — no new schema, no new enum.
…sResearch#29005)

Seven tests across two classes, covering both halves of the fix.

``TestQQBotReconnectExhaustionNotifies``
* ``test_set_fatal_error_runs_handler_with_retryable`` — drives the
  end-state the three ``MAX_RECONNECT_ATTEMPTS`` exits reach and
  verifies the registered ``fatal_error_handler`` is awaited with
  ``adapter.fatal_error_retryable=True`` (the flag the gateway
  watcher branches on to keep the platform in its retry queue).
* ``test_exhaustion_paths_use_set_fatal_error_not_mark_disconnected``
  — static guard over ``QQAdapter._listen_loop`` source: at least
  three ``_set_fatal_error`` calls and three ``_notify_fatal_error``
  awaits, plus the ``qq_reconnect_exhausted`` error code.  Reverting
  any exit to ``_mark_disconnected()`` (the pre-fix behaviour) trips
  this test.
* ``test_notify_fatal_error_noop_without_handler`` — adapters
  spawned outside a ``GatewayRunner`` (CLI smoke tests, local
  debug) must not crash when no handler is registered.

``TestTelegramPollingRetryStatus``
* ``test_retry_window_writes_retrying_state`` — first call to
  ``_handle_polling_network_error`` must emit at least one
  ``platform_state=retrying`` write before sleeping.
* ``test_retrying_write_carries_diagnostic_message`` — the write
  must include the ``N/MAX`` attempt counter and the
  ``telegram_network_error`` code so monitors can diagnose without
  cross-referencing logs.
* ``test_successful_reconnect_flips_state_back_to_connected`` —
  ``retrying`` is followed by ``connected`` (in that order) on a
  successful reconnect, so external monitors stop alerting.
* ``test_final_attempt_still_escalates_to_fatal`` — guards against
  the new ``retrying`` writes accidentally short-circuiting the
  existing attempt-11 fatal escalation (no regression for NousResearch#3173).

Telegram tests stub ``_write_runtime_status_safe`` and PTB's updater
to keep the suite hermetic (no real status file, no real network);
the qqbot inspection test uses ``inspect.getsource`` so it can't
drift if the listen-loop is refactored as long as the escalation
calls still land.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/qqbot QQ Bot adapter platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status

2 participants