fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)#29007
Open
xxxigm wants to merge 3 commits into
Open
fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)#29007xxxigm wants to merge 3 commits into
xxxigm wants to merge 3 commits into
Conversation
…earch#29005) All three ``MAX_RECONNECT_ATTEMPTS`` exit paths in ``QQAdapter._listener_loop`` ended with a bare ``_mark_disconnected()`` and a ``return``. That writes ``platform_state=disconnected`` to the runtime status file and kills the listener task — but does **not** fire the ``fatal_error_handler`` that ``GatewayRunner._platform_reconnect_watcher`` is wired through. With no fatal-error event, the watcher never adds the platform to ``_failed_platforms`` and the adapter stays permanently dead while the gateway process happily keeps running. The reporter's log captures the symptom exactly: QQBot exhausts its 100-attempt budget at 07:33, ``platform_state`` flips to ``disconnected``, and from then on **nothing** in the gateway tries to bring it back. Replace each ``_mark_disconnected()`` exhaustion with the documented escalation: ``_set_fatal_error(..., retryable=True)`` followed by ``await _notify_fatal_error()``. The handler then: * writes ``platform_state=retrying`` so external monitors see the degraded state immediately, * queues the platform in ``_failed_platforms`` with a 30 s next-retry timer, and * lets ``_platform_reconnect_watcher`` keep retrying with the same exponential back-off as every other platform — capped at 5 min and circuit-broken after 10 consecutive failures, just like Telegram / WhatsApp / Slack. A single ``qq_reconnect_exhausted`` error code is reused across the three call sites; the human-readable message differentiates the root cause (rate-limited vs WS close vs unhandled exception) so log greps still tell them apart.
…usResearch#29005) While ``_handle_polling_network_error`` walks the reconnect ladder (default 10 attempts, exponential back-off capped at 60 s ≈ 55 min total), the runtime status file keeps reporting ``platforms.telegram.state = connected``. Anything watching that file — Prometheus exporter, kanban dashboard, status MCP — has no way to tell that the bot has actually been unreachable for the last several minutes. The fatal-state flip only happens after attempt 11, at which point the gateway is already on its way to a process restart. Fill the gap by writing ``platform_state=retrying`` BEFORE the back-off sleep on every attempt, with the attempt counter and last error string in ``error_message`` for diagnostics. On a successful reconnect, write ``connected`` back so monitors stop alerting. Both writes go through ``_write_runtime_status_safe`` (same helper the existing ``_mark_connected`` / ``_set_fatal_error`` use), so a failing status-dir doesn't break the reconnect ladder itself. Reuses the same ``retrying`` value the gateway's ``_handle_adapter_fatal_error`` already writes for retryable failures — no new schema, no new enum.
…sResearch#29005) Seven tests across two classes, covering both halves of the fix. ``TestQQBotReconnectExhaustionNotifies`` * ``test_set_fatal_error_runs_handler_with_retryable`` — drives the end-state the three ``MAX_RECONNECT_ATTEMPTS`` exits reach and verifies the registered ``fatal_error_handler`` is awaited with ``adapter.fatal_error_retryable=True`` (the flag the gateway watcher branches on to keep the platform in its retry queue). * ``test_exhaustion_paths_use_set_fatal_error_not_mark_disconnected`` — static guard over ``QQAdapter._listen_loop`` source: at least three ``_set_fatal_error`` calls and three ``_notify_fatal_error`` awaits, plus the ``qq_reconnect_exhausted`` error code. Reverting any exit to ``_mark_disconnected()`` (the pre-fix behaviour) trips this test. * ``test_notify_fatal_error_noop_without_handler`` — adapters spawned outside a ``GatewayRunner`` (CLI smoke tests, local debug) must not crash when no handler is registered. ``TestTelegramPollingRetryStatus`` * ``test_retry_window_writes_retrying_state`` — first call to ``_handle_polling_network_error`` must emit at least one ``platform_state=retrying`` write before sleeping. * ``test_retrying_write_carries_diagnostic_message`` — the write must include the ``N/MAX`` attempt counter and the ``telegram_network_error`` code so monitors can diagnose without cross-referencing logs. * ``test_successful_reconnect_flips_state_back_to_connected`` — ``retrying`` is followed by ``connected`` (in that order) on a successful reconnect, so external monitors stop alerting. * ``test_final_attempt_still_escalates_to_fatal`` — guards against the new ``retrying`` writes accidentally short-circuiting the existing attempt-11 fatal escalation (no regression for NousResearch#3173). Telegram tests stub ``_write_runtime_status_safe`` and PTB's updater to keep the suite hermetic (no real status file, no real network); the qqbot inspection test uses ``inspect.getsource`` so it can't drift if the listen-loop is refactored as long as the escalation calls still land.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #29005 — both halves of the reported regression:
P0 — QQBot: the three
MAX_RECONNECT_ATTEMPTSexit paths inQQAdapter._listen_loopended with a bare_mark_disconnected()and areturn. That writesplatform_state=disconnectedto the runtime status file and kills the listener task — but does not fire thefatal_error_handlerthatGatewayRunner._platform_reconnect_watcheris wired through. With no fatal-error event, the watcher never adds the platform to_failed_platformsand the adapter stays permanently dead while the gateway process keeps running. Reporter's log shows exactly this: QQBot exhausts its 100-attempt budget at 07:33 and from then on nothing tries to bring it back. Fix replaces each_mark_disconnected()with_set_fatal_error(..., retryable=True)+await _notify_fatal_error(), so the same gateway watcher that already handles Telegram / WhatsApp / Slack picks the platform up with exponential back-off (cap 5 min, circuit-broken after 10 consecutive failures).P1 — Telegram: during the polling reconnect ladder (
_handle_polling_network_error, default 10 attempts with exponential back-off capped at 60 s ≈ 55 min total),platform_statestays atconnectedbecause_mark_connected()was called at startup. Anything readingplatforms.telegram.state(Prometheus exporter, kanban dashboard, status MCP) has no way to tell the bot has been unreachable for several minutes. The fatal-state flip only happens at attempt 11. Fix writesplatform_state=retryingbefore each back-off sleep with the attempt counter + last error inerror_message, and writesconnectedback on successful reconnect.Both halves reuse the existing
_set_fatal_error/_write_runtime_status_safehelpers and the existingretryingstate string the gateway's fatal-error handler already emits — no new schema, no new env vars, no public-API change.Related Issue
Fixes #29005.
Type of Change
Changes Made
gateway/platforms/qqbot/adapter.py(+36/-3) —_listen_loop:_mark_disconnected()with_set_fatal_error("qq_reconnect_exhausted", ..., retryable=True)+await _notify_fatal_error()at all threeMAX_RECONNECT_ATTEMPTSexits (rate-limit branch, genericQQCloseErrorbranch, genericExceptionbranch).gateway/platforms/telegram.py(+27) —_handle_polling_network_error:_write_runtime_status_safe(platform_state="retrying", error_code="telegram_network_error", error_message="Network error retry N/MAX (next in Xs): <err>").start_polling(): writeplatform_state="connected"to clear the retrying state.tests/gateway/test_platform_reconnect_fatal_notify.py(+288, new) — 7 regression tests across 2 classes:TestQQBotReconnectExhaustionNotifies(3 cases) — handler awaited withretryable=True; static guard over_listen_loopsource proving at least 3_set_fatal_error+ 3_notify_fatal_errorescalations and theqq_reconnect_exhaustedcode remain; handler-less adapters don't crash.TestTelegramPollingRetryStatus(4 cases) — first-attemptretryingwrite happens; the write carries theN/MAXcounter +telegram_network_errorcode; successful reconnect emitsretryingthenconnectedin order; final-attempt fatal escalation still fires (no regression for Gateway crashes on Telegram Bad Gateway (502) — reconnect loop fails #3173).No production code outside the two adapter files modified.
How to Test
.venvis set up:python3 -m venv .venv && source .venv/bin/activate && pip install -e \".[all,dev]\"Checklist
Code
fix(qqbot): ...,fix(telegram): ...,test(gateway): ...)scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.pyand all tests passDocumentation & Housekeeping
docs/, docstrings) — N/A (no public-API change; inline comments call out the gateway-watcher contract + QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status #29005)cli-config.yaml.exampleif I added/changed config keys — N/A (no new config)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/AScreenshots / Logs
Made with Cursor