Skip to content

fix(qqbot): detect quick disconnect loop in exception path and notify…#15051

Open
Satone7 wants to merge 8 commits into
NousResearch:mainfrom
Satone7:fix/qqbot-reconnect-quick-disconnect-loop
Open

fix(qqbot): detect quick disconnect loop in exception path and notify…#15051
Satone7 wants to merge 8 commits into
NousResearch:mainfrom
Satone7:fix/qqbot-reconnect-quick-disconnect-loop

Conversation

@Satone7

@Satone7 Satone7 commented Apr 24, 2026

Copy link
Copy Markdown

Description

Adds quick-disconnect detection (matching QQCloseError handler) to the except Exception branch of _listen_loop(). When reconnection succeeds but the WebSocket immediately closes, the adapter now bounds retries by MAX_QUICK_DISCONNECT_COUNT instead of resetting the counter on every successful reconnect — preventing an infinite retry loop.

Also calls _set_fatal_error() when MAX_RECONNECT_ATTEMPTS is exhausted, so the gateway runner is notified instead of the adapter dying silently.

What does this PR do?

Fixes an infinite retry loop in the QQ Bot adapter's _listen_loop() when the WebSocket enters a reconnect-succeed-immediately-close cycle (Phase 2 degradation after repeated code 4009 session timeouts).

Root cause: In the except Exception branch, _reconnect() returns True when TCP/WS establishes, resetting both backoff_idx and quick_disconnect_count to 0. But the WS immediately closes — _read_events() raises RuntimeError("WebSocket closed"), which lands back in except Exception. The MAX_RECONNECT_ATTEMPTS bound is never reached because backoff_idx keeps getting reset. Unlike the QQCloseError handler, this branch had no quick-disconnect detection, so the loop runs unbounded.

The fix adds three things to the except Exception handler:

  1. Quick-disconnect detection (matching QQCloseError): when connection lasts < 5s, quick_disconnect_count accumulates; at MAX_QUICK_DISCONNECT_COUNT (3), _set_fatal_error() is called and the loop exits
  2. quick_disconnect_count is no longer reset on reconnect success — it only resets when a connection stays alive ≥ 5s, so repeated quick disconnects correctly accumulate to the limit
  3. _set_fatal_error() on MAX_RECONNECT_ATTEMPTS exhaustion, so the gateway runner is notified (fixes the same concern as QQ Bot adapter silently stops reconnecting without notifying gateway #14539)

Related Issue

Fixes the root cause behind #12395 (infinite retry loop consuming tokens). Complements #14539 / #14565 (silent return on exhaustion) by also handling the infinite-loop case that prevents exhaustion from ever being reached. Related to #14341 (QQCloseError backoff bound).

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/platforms/qqbot/adapter.py:542-596 — Added quick-disconnect detection (duration < 5s) to the except Exception handler, stopped resetting quick_disconnect_count on reconnect success, and added _set_fatal_error() on MAX_RECONNECT_ATTEMPTS exhaustion.

How to Test

  1. Run Hermes gateway with QQ Bot enabled
  2. Simulate WebSocket closure without a clean close code (e.g., network interruption that causes RuntimeError("WebSocket closed") in _read_events())
  3. Observe that after 3 quick reconnects (< 5s each), the adapter calls _set_fatal_error() and exits the listen loop
  4. Verify gateway_state.json reflects the fatal error state

Alternatively:

  1. Run existing tests: pytest tests/gateway/test_qqbot.py -q (all 71 pass)
  2. Verify the infinite loop terminates by tracing quick_disconnect_count in logs

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(qqbot): ...)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Linux, systemd user service

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've considered cross-platform impact (Windows, macOS) — or N/A
  • N/A — bug fix only, no config or tool changes

Screenshots / Logs

Before the fix: the except Exception handler looped indefinitely with "WebSocket error: WebSocket closed" every ~62s, resetting backoff_idx on each successful reconnect. After the fix, 3 quick disconnects trigger _set_fatal_error() and the adapter properly signals failure to the gateway runner.

… gateway

Adds quick-disconnect detection (matching QQCloseError handler) to the
`except Exception` branch of `_listen_loop()`. When reconnection succeeds
but the WebSocket immediately closes, the adapter now bounds retries by
MAX_QUICK_DISCONNECT_COUNT instead of resetting the counter on every
successful reconnect — preventing an infinite retry loop.

Also calls _set_fatal_error() when MAX_RECONNECT_ATTEMPTS is exhausted,
so the gateway runner is notified instead of the adapter dying silently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists platform/qqbot QQ Bot adapter comp/gateway Gateway runner, session dispatch, delivery labels Apr 24, 2026
Satone7 and others added 2 commits April 25, 2026 12:25
- Add receive timeout (3x heartbeat interval) in _read_events to detect
  stale connections where server closed but client is unaware (CLOSE-WAIT)
- Add heartbeat failure counting in _heartbeat_loop; force disconnect after
  3 consecutive failures to trigger reconnection
- Prevents QQ Bot from appearing online but not receiving messages

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Log heartbeat success every 5th heartbeat (~5 min interval) to enable
patrol health monitoring. Previously successful heartbeats were silent,
causing patrol to incorrectly detect "heartbeat silent" and trigger
unnecessary restarts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Satone7 and others added 2 commits May 9, 2026 14:53
After `_reconnect()` opens a new WebSocket, the heartbeat task was not
recreated. This caused heartbeats to stop being sent after any reconnect,
leading to server timeouts (~60s) and continuous disconnect/reconnect cycles.

The fix creates a new heartbeat task after `_open_ws()` if the previous
task is done or None.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/qqbot QQ Bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants