fix(qqbot): fix 5 reconnect bugs — zombie state, lost close codes, no cooldown, heartbeat timing, missing heartbeat task#19414
Open
Allonz wants to merge 11 commits into
Open
Conversation
- Add media file upload support to _send_qqbot function
- Support chat_type detection from target format (c2c:/group:/guild:)
- Upload media via QQ Bot v2 API (/v2/users/{openid}/files, /v2/groups/{group_openid}/files)
- Map file extensions to QQ Bot file_type (1=image, 2=video, 3=voice, 4=file)
- Include media in message payload via 'file_info' field
- Update error messages to include qqbot in supported platforms
- Update schema description with qqbot target format examples
… attempts Fix 4 issues in QQBot adapter reconnect logic: 1. [P0] _listen_loop exit now calls _set_fatal_error() to notify Gateway When reconnect attempts are exhausted, the adapter now sets a retryable fatal error so _platform_reconnect_watcher can take over. Previously the listen loop would silently die, leaving QQBot in a zombie state where the process is alive but no messages are received. 2. [P1] CLOSED/ERROR WS events now raise QQCloseError instead of plain RuntimeError, preserving close code/reason for proper error classification. 3. [P1] Added 15s cooldown after reconnect failure to give QQ server time to clean up the old session before the next attempt. 4. [P2] Moved _heartbeat_interval reset inside _reconnect() try block so it only resets after a successful connection, not on failure.
Collaborator
1 similar comment
Collaborator
Author
|
Thanks, Siddharth. Yes, I wanted to make sure this fix was comprehensive — preserving the close code for proper error classification and adding the inter-reconnect cooldown were both necessary to avoid the same issue recurring. Glad it covers all the bases. Let me know if any further adjustments are needed. |
…a support) Keep both QQBot and Feishu native media delivery blocks. Preserve QQBot's full chat_type routing and base64 file upload logic from HEAD, while incorporating Feishu's media support and thread_id parameter from upstream/main. Update error/warning strings to list both platforms.
Author
|
Hi, I've resolved the merge conflict flagged on this PR. Here's the summary: Conflict LocationAll conflicts were in
What was kept from upstream
TestsAdded 50 new tests across 2 files:
Test ResultsFull regression suite: DiffReady for re-review. |
…ping/pong Bug 5: _reconnect() opens a new WebSocket but never recreates _heartbeat_task. The old heartbeat task is orphaned on the dead connection — no heartbeat is sent on the new one. QQ server closes the connection after 60s with code=None (no close code because the server drops it cleanly). Reconnect succeeds, but the death loop repeats every 60s: connect → 60s silence → disconnect → reconnect → repeat. Changes: - _open_ws(): add heartbeat=20 to aiohttp ClientWebSocketResponse for WS-level ping/pong, preventing idle disconnects at the transport layer - _reconnect(): cancel old _heartbeat_task before opening new WebSocket - _reconnect(): create_task(_heartbeat_loop()) after successful reconnect
19 tasks
This was referenced May 13, 2026
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
QQBot adapter enters a zombie state after repeated WebSocket disconnections — the adapter process remains alive but stops receiving messages. This was observed in production logs where the connection dropped (code 4009: Session timed out) and then went silent for 93+ minutes.
Root cause: when
_listen_loopexhausts its reconnect attempts, it simplyreturns without notifying the Gateway's_platform_reconnect_watcher. The watcher never knows the adapter has given up, so no higher-level reconnection is attempted.Fixes by Severity
P0 — Zombie state:
_listen_loopexit silently dies without notifying GatewaySymptom: After
MAX_RECONNECT_ATTEMPTSreconnect failures,_listen_loopreturns but the adapter process stays alive. No messages are received. The_platform_reconnect_watcheringateway/run.pyis unaware the adapter has given up.Fix: All three exit paths in
_listen_loopnow call_set_fatal_error("qq_reconnect_exhausted", ..., retryable=True)before returning. This signals the Gateway's reconnect watcher to take over reconnection at the platform management level.Affected paths:
QQCloseError(CLOSE/CLOSED events) exhaustionExceptionexhaustionP1 — CLOSED/ERROR WebSocket events lose close code and reason
Symptom:
WSMsgType.CLOSEDandWSMsgType.ERRORevents raised a plainRuntimeError("WebSocket closed")with no close code or reason. This prevented proper error classification (e.g., distinguishing a server-side 4009 Session Timeout from a network error).Fix: Changed to raise
QQCloseError(msg.data, msg.extra)instead, preserving the close code and reason for downstream error classification logic.P1 — No cooldown period between reconnect failures
Symptom: After a failed reconnect attempt, the adapter immediately tries again. The QQ server may not have finished cleaning up the old session, causing the new attempt to fail with a session conflict.
Fix: Added a 15-second
await asyncio.sleep(15)in_reconnect()after a failed connection attempt, giving the server time to clean up the old session before the next retry.P2 — Heartbeat interval reset on reconnect failure
Symptom:
_heartbeat_intervalwas reset to30.0at the beginning of_reconnect(), before the connection attempt. If reconnect failed, the interval was already reset, potentially causing incorrect heartbeat timing on the next attempt.Fix: Moved
_heartbeat_interval = 30.0inside thetryblock, afterawait self._open_ws(gateway_url)succeeds. Now it only resets after a confirmed successful connection.P1 — Heartbeat task not recreated on reconnect (60s code=None death loop)
Symptom: After
_reconnect()opens a new WebSocket, the old_heartbeat_taskfrom the dead connection is orphaned — no heartbeat task runs on the new connection. QQ server receives no heartbeat ACK for 60s, then drops the connection withcode=None(no close code because the server terminates cleanly without negotiation). Reconnect succeeds via the Gateway watcher, but the same death loop repeats every ~60s.Production log pattern:
Root cause:
_reconnect()calls_open_ws()to establish a new WebSocket but never callsasyncio.create_task(self._heartbeat_loop()). The_heartbeat_taskattribute still references the old task (which died with the old connection's event loop), so no heartbeat is ever sent on the new connection.Fix (two-part):
WS-level ping/pong via aiohttp
heartbeatparameter: Addedheartbeat=20to theaiohttp.ClientSession.ws_connect()call in_open_ws(). This enables aiohttp's built-in ping/pong at the transport layer, sending a WebSocket PING every 20s. If no PONG arrives within the timeout, aiohttp closes the connection with a proper error — preventing silent idle disconnects.Recreate heartbeat task on reconnect: In
_reconnect():_heartbeat_taskwithcancel()and await it (catchesCancelledError)_open_ws()succeeds: create a fresh_heartbeat_taskviaasyncio.create_task(self._heartbeat_loop())These two changes together ensure that every reconnected WebSocket has its own active heartbeat, and the transport layer itself has a safety net for idle connections.
Changes
gateway/platforms/qqbot/adapter.py_listen_loopfatal error signaling,_reconnectcooldown & heartbeat task recreation, CLOSED/ERROR → QQCloseError, WS-level ping/pongTest Results
Reproduction
Bug 1-4 (zombie state): Production logs showed:
After this fix, the adapter properly signals the Gateway, which triggers
_platform_reconnect_watcherto reinitialize the adapter.Bug 5 (60s death loop): Reproduced in staging by:
_reconnect()triggersWebSocket closed: code=None reason=→ reconnect → repeatAfter fix: heartbeat task recreated on every reconnect, WS-level ping/pong prevents idle timeout. Connection stays alive across reconnects.