Skip to content

fix(qqbot): fix 5 reconnect bugs — zombie state, lost close codes, no cooldown, heartbeat timing, missing heartbeat task#19414

Open
Allonz wants to merge 11 commits into
NousResearch:mainfrom
Allonz:feat/qqbot-reconnect-fix
Open

fix(qqbot): fix 5 reconnect bugs — zombie state, lost close codes, no cooldown, heartbeat timing, missing heartbeat task#19414
Allonz wants to merge 11 commits into
NousResearch:mainfrom
Allonz:feat/qqbot-reconnect-fix

Conversation

@Allonz

@Allonz Allonz commented May 3, 2026

Copy link
Copy Markdown

Problem

QQBot adapter enters a zombie state after repeated WebSocket disconnections — the adapter process remains alive but stops receiving messages. This was observed in production logs where the connection dropped (code 4009: Session timed out) and then went silent for 93+ minutes.

Root cause: when _listen_loop exhausts its reconnect attempts, it simply returns without notifying the Gateway's _platform_reconnect_watcher. The watcher never knows the adapter has given up, so no higher-level reconnection is attempted.


Fixes by Severity

P0 — Zombie state: _listen_loop exit silently dies without notifying Gateway

Symptom: After MAX_RECONNECT_ATTEMPTS reconnect failures, _listen_loop returns but the adapter process stays alive. No messages are received. The _platform_reconnect_watcher in gateway/run.py is unaware the adapter has given up.

Fix: All three exit paths in _listen_loop now call _set_fatal_error("qq_reconnect_exhausted", ..., retryable=True) before returning. This signals the Gateway's reconnect watcher to take over reconnection at the platform management level.

Affected paths:

  • Rate limit 4008 backoff exhaustion
  • QQCloseError (CLOSE/CLOSED events) exhaustion
  • Generic Exception exhaustion

P1 — CLOSED/ERROR WebSocket events lose close code and reason

Symptom: WSMsgType.CLOSED and WSMsgType.ERROR events raised a plain RuntimeError("WebSocket closed") with no close code or reason. This prevented proper error classification (e.g., distinguishing a server-side 4009 Session Timeout from a network error).

Fix: Changed to raise QQCloseError(msg.data, msg.extra) instead, preserving the close code and reason for downstream error classification logic.

P1 — No cooldown period between reconnect failures

Symptom: After a failed reconnect attempt, the adapter immediately tries again. The QQ server may not have finished cleaning up the old session, causing the new attempt to fail with a session conflict.

Fix: Added a 15-second await asyncio.sleep(15) in _reconnect() after a failed connection attempt, giving the server time to clean up the old session before the next retry.

P2 — Heartbeat interval reset on reconnect failure

Symptom: _heartbeat_interval was reset to 30.0 at the beginning of _reconnect(), before the connection attempt. If reconnect failed, the interval was already reset, potentially causing incorrect heartbeat timing on the next attempt.

Fix: Moved _heartbeat_interval = 30.0 inside the try block, after await self._open_ws(gateway_url) succeeds. Now it only resets after a confirmed successful connection.

P1 — Heartbeat task not recreated on reconnect (60s code=None death loop)

Symptom: After _reconnect() opens a new WebSocket, the old _heartbeat_task from the dead connection is orphaned — no heartbeat task runs on the new connection. QQ server receives no heartbeat ACK for 60s, then drops the connection with code=None (no close code because the server terminates cleanly without negotiation). Reconnect succeeds via the Gateway watcher, but the same death loop repeats every ~60s.

Production log pattern:

INFO: [QQBot:xxx] Reconnected
... (exactly ~60s later)
WARNING: [QQBot:xxx] WebSocket closed: code=None reason=
INFO: [QQBot:xxx] Reconnecting in 2s (attempt 1)...
INFO: [QQBot:xxx] Reconnected
... (cycle repeats indefinitely)

Root cause: _reconnect() calls _open_ws() to establish a new WebSocket but never calls asyncio.create_task(self._heartbeat_loop()). The _heartbeat_task attribute still references the old task (which died with the old connection's event loop), so no heartbeat is ever sent on the new connection.

Fix (two-part):

  1. WS-level ping/pong via aiohttp heartbeat parameter: Added heartbeat=20 to the aiohttp.ClientSession.ws_connect() call in _open_ws(). This enables aiohttp's built-in ping/pong at the transport layer, sending a WebSocket PING every 20s. If no PONG arrives within the timeout, aiohttp closes the connection with a proper error — preventing silent idle disconnects.

  2. Recreate heartbeat task on reconnect: In _reconnect():

    • Before opening the new WebSocket: cancel the old _heartbeat_task with cancel() and await it (catches CancelledError)
    • After _open_ws() succeeds: create a fresh _heartbeat_task via asyncio.create_task(self._heartbeat_loop())

These two changes together ensure that every reconnected WebSocket has its own active heartbeat, and the transport layer itself has a safety net for idle connections.


Changes

File Lines Description
gateway/platforms/qqbot/adapter.py +33 / -2 _listen_loop fatal error signaling, _reconnect cooldown & heartbeat task recreation, CLOSED/ERROR → QQCloseError, WS-level ping/pong

Test Results

  • 71/71 QQBot adapter tests passed (including new heartbeat task lifecycle tests)
  • 182/182 total gateway tests passed (base adapter, reconnect watcher, runner fatal adapter tests)
  • 0 regressions in adapter base class tests

Reproduction

Bug 1-4 (zombie state): Production logs showed:

WARNING: WebSocket closed: code=4009 reason=Session timed out
WARNING: WebSocket closed: code=4009 reason=Session timed out
... (then silence for 93+ minutes)

After this fix, the adapter properly signals the Gateway, which triggers _platform_reconnect_watcher to reinitialize the adapter.

Bug 5 (60s death loop): Reproduced in staging by:

  1. Connect adapter → heartbeat active on connection
  2. Simulate WebSocket failure (close underlying TCP) → _reconnect() triggers
  3. Observe: new WebSocket opens, "Reconnected" logged, but no heartbeat task created
  4. After 60s: WebSocket closed: code=None reason= → reconnect → repeat

After fix: heartbeat task recreated on every reconnect, WS-level ping/pong prevents idle timeout. Connection stays alive across reconnects.

Allonz added 7 commits May 1, 2026 16:51
- Add media file upload support to _send_qqbot function
- Support chat_type detection from target format (c2c:/group:/guild:)
- Upload media via QQ Bot v2 API (/v2/users/{openid}/files, /v2/groups/{group_openid}/files)
- Map file extensions to QQ Bot file_type (1=image, 2=video, 3=voice, 4=file)
- Include media in message payload via 'file_info' field
- Update error messages to include qqbot in supported platforms
- Update schema description with qqbot target format examples
… attempts

Fix 4 issues in QQBot adapter reconnect logic:

1. [P0] _listen_loop exit now calls _set_fatal_error() to notify Gateway
   When reconnect attempts are exhausted, the adapter now sets a retryable
   fatal error so _platform_reconnect_watcher can take over. Previously
   the listen loop would silently die, leaving QQBot in a zombie state
   where the process is alive but no messages are received.

2. [P1] CLOSED/ERROR WS events now raise QQCloseError instead of plain
   RuntimeError, preserving close code/reason for proper error classification.

3. [P1] Added 15s cooldown after reconnect failure to give QQ server time
   to clean up the old session before the next attempt.

4. [P2] Moved _heartbeat_interval reset inside _reconnect() try block
   so it only resets after a successful connection, not on failure.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists platform/qqbot QQ Bot adapter labels May 3, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Supersedes #17814 and #14565 (same zombie-state root cause: _listen_loop exits without _set_fatal_error). This PR is more comprehensive — also fixes close code loss and adds inter-reconnect cooldown. Related: #14539.

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Supersedes #17814 and #14565 (same zombie-state root cause: _listen_loop exits without _set_fatal_error). This PR is more comprehensive — also fixes close code loss and adds inter-reconnect cooldown. Related: #14539.

@Allonz

Allonz commented May 3, 2026

Copy link
Copy Markdown
Author

Thanks, Siddharth. Yes, I wanted to make sure this fix was comprehensive — preserving the close code for proper error classification and adding the inter-reconnect cooldown were both necessary to avoid the same issue recurring. Glad it covers all the bases. Let me know if any further adjustments are needed.

…a support)

Keep both QQBot and Feishu native media delivery blocks. Preserve
QQBot's full chat_type routing and base64 file upload logic from
HEAD, while incorporating Feishu's media support and thread_id
parameter from upstream/main. Update error/warning strings to
list both platforms.
@Allonz

Allonz commented May 5, 2026

Copy link
Copy Markdown
Author

Hi, I've resolved the merge conflict flagged on this PR. Here's the summary:

Conflict Location

All conflicts were in tools/send_message_tool.py — 6 conflict blocks:

# Area Resolution
1 QQBot / Feishu media delivery blocks Kept both. QQBot's native media upload (upload → REST send) and Feishu's media support coexist side by side.
2 Error string (media-only message) Updated to "...qqbot and feishu" to list both platforms.
3 Warning string (omitted media) Same — both platforms now mentioned.
4 _send_qqbot docstring Merged: describes C2C/group/guild routing + multimedia support.
5 Comment block Merged comments about QQ Bot API endpoints from both sides.
6 _send_qqbot core logic Kept PR's approach — explicit chat_type routing (c2c:, group:, guild: prefix parsing) + base64 file upload via QQ Bot v2 API. Discarded the upstream triple-fallback approach, since our prefix-based routing is more precise.

What was kept from upstream

  • Feishu media delivery block with thread_id parameter support (fully intact).
  • All other upstream commits in send_message_tool.py.

Tests

Added 50 new tests across 2 files:

  • tests/tools/test_send_message_qqbot.py — 31 tests: chat_type prefix parsing, file type detection (image/video/voice/file), endpoint URL construction per chat type, payload building, base64 encoding, Feishu thread_id parameter verification, and error/warning string checks.
  • tests/gateway/test_qqbot_zombie_fix.py — 19 tests: QQCloseError usage in _read_events, _set_fatal_error signaling on reconnect exhaustion, reconnect cooldown (asyncio.sleep(15)), heartbeat interval reset ordering, and _listen_loop exit path coverage.

Test Results

409 passed, 0 failed, 0 skipped (including all existing gateway/send-message tests)

Full regression suite: tests/gateway/test_qqbot.py, test_feishu.py, test_discord_send.py, test_send_retry.py, test_send_image_file.py, test_send_multiple_images.py — all green.

Diff

 gateway/platforms/qqbot/adapter.py     |  23 +-
 tests/gateway/test_qqbot_zombie_fix.py | 190 +++++++
 tests/tools/test_send_message_qqbot.py | 408 +++++++++++++++
 tools/send_message_tool.py             | 150 ++++---
 4 files changed, 738 insertions(+), 33 deletions(-)

Ready for re-review.

Allonz added 3 commits May 6, 2026 21:43
…ping/pong

Bug 5: _reconnect() opens a new WebSocket but never recreates _heartbeat_task.
The old heartbeat task is orphaned on the dead connection — no heartbeat is
sent on the new one. QQ server closes the connection after 60s with code=None
(no close code because the server drops it cleanly). Reconnect succeeds,
but the death loop repeats every 60s: connect → 60s silence → disconnect
→ reconnect → repeat.

Changes:
- _open_ws(): add heartbeat=20 to aiohttp ClientWebSocketResponse for
  WS-level ping/pong, preventing idle disconnects at the transport layer
- _reconnect(): cancel old _heartbeat_task before opening new WebSocket
- _reconnect(): create_task(_heartbeat_loop()) after successful reconnect
@Allonz Allonz changed the title fix(qqbot): prevent zombie state when _listen_loop exhausts reconnect attempts fix(qqbot): fix 5 reconnect bugs — zombie state, lost close codes, no cooldown, heartbeat timing, missing heartbeat task May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Medium — degraded but workaround exists platform/qqbot QQ Bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants