fix(qqbot): detect quick disconnect loop in exception path and notify…#15051
Open
Satone7 wants to merge 8 commits into
Open
fix(qqbot): detect quick disconnect loop in exception path and notify…#15051Satone7 wants to merge 8 commits into
Satone7 wants to merge 8 commits into
Conversation
… gateway Adds quick-disconnect detection (matching QQCloseError handler) to the `except Exception` branch of `_listen_loop()`. When reconnection succeeds but the WebSocket immediately closes, the adapter now bounds retries by MAX_QUICK_DISCONNECT_COUNT instead of resetting the counter on every successful reconnect — preventing an infinite retry loop. Also calls _set_fatal_error() when MAX_RECONNECT_ATTEMPTS is exhausted, so the gateway runner is notified instead of the adapter dying silently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uick-disconnect-loop
- Add receive timeout (3x heartbeat interval) in _read_events to detect stale connections where server closed but client is unaware (CLOSE-WAIT) - Add heartbeat failure counting in _heartbeat_loop; force disconnect after 3 consecutive failures to trigger reconnection - Prevents QQ Bot from appearing online but not receiving messages Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Log heartbeat success every 5th heartbeat (~5 min interval) to enable patrol health monitoring. Previously successful heartbeats were silent, causing patrol to incorrectly detect "heartbeat silent" and trigger unnecessary restarts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Apr 30, 2026
After `_reconnect()` opens a new WebSocket, the heartbeat task was not recreated. This caused heartbeats to stop being sent after any reconnect, leading to server timeouts (~60s) and continuous disconnect/reconnect cycles. The fix creates a new heartbeat task after `_open_ws()` if the previous task is done or None. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uick-disconnect-loop # Conflicts: # gateway/platforms/qqbot/adapter.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds quick-disconnect detection (matching QQCloseError handler) to the
except Exceptionbranch of_listen_loop(). When reconnection succeeds but the WebSocket immediately closes, the adapter now bounds retries by MAX_QUICK_DISCONNECT_COUNT instead of resetting the counter on every successful reconnect — preventing an infinite retry loop.Also calls _set_fatal_error() when MAX_RECONNECT_ATTEMPTS is exhausted, so the gateway runner is notified instead of the adapter dying silently.
What does this PR do?
Fixes an infinite retry loop in the QQ Bot adapter's
_listen_loop()when the WebSocket enters a reconnect-succeed-immediately-close cycle (Phase 2 degradation after repeated code 4009 session timeouts).Root cause: In the
except Exceptionbranch,_reconnect()returnsTruewhen TCP/WS establishes, resetting bothbackoff_idxandquick_disconnect_countto 0. But the WS immediately closes —_read_events()raisesRuntimeError("WebSocket closed"), which lands back inexcept Exception. TheMAX_RECONNECT_ATTEMPTSbound is never reached becausebackoff_idxkeeps getting reset. Unlike theQQCloseErrorhandler, this branch had no quick-disconnect detection, so the loop runs unbounded.The fix adds three things to the
except Exceptionhandler:QQCloseError): when connection lasts < 5s,quick_disconnect_countaccumulates; atMAX_QUICK_DISCONNECT_COUNT(3),_set_fatal_error()is called and the loop exitsquick_disconnect_countis no longer reset on reconnect success — it only resets when a connection stays alive ≥ 5s, so repeated quick disconnects correctly accumulate to the limit_set_fatal_error()onMAX_RECONNECT_ATTEMPTSexhaustion, so the gateway runner is notified (fixes the same concern as QQ Bot adapter silently stops reconnecting without notifying gateway #14539)Related Issue
Fixes the root cause behind #12395 (infinite retry loop consuming tokens). Complements #14539 / #14565 (silent return on exhaustion) by also handling the infinite-loop case that prevents exhaustion from ever being reached. Related to #14341 (QQCloseError backoff bound).
Type of Change
Changes Made
gateway/platforms/qqbot/adapter.py:542-596— Added quick-disconnect detection (duration < 5s) to theexcept Exceptionhandler, stopped resettingquick_disconnect_counton reconnect success, and added_set_fatal_error()onMAX_RECONNECT_ATTEMPTSexhaustion.How to Test
RuntimeError("WebSocket closed")in_read_events())_set_fatal_error()and exits the listen loopgateway_state.jsonreflects the fatal error stateAlternatively:
pytest tests/gateway/test_qqbot.py -q(all 71 pass)quick_disconnect_countin logsChecklist
Code
fix(qqbot): ...)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/AScreenshots / Logs
Before the fix: the
except Exceptionhandler looped indefinitely with "WebSocket error: WebSocket closed" every ~62s, resettingbackoff_idxon each successful reconnect. After the fix, 3 quick disconnects trigger_set_fatal_error()and the adapter properly signals failure to the gateway runner.