Skip to content

fix(qqbot): add backoff upper-bound for QQCloseError reconnect path#13074

Closed
fengtianyu88 wants to merge 1 commit into
NousResearch:mainfrom
fengtianyu88:fix/qqbot-reconnect-backoff
Closed

fix(qqbot): add backoff upper-bound for QQCloseError reconnect path#13074
fengtianyu88 wants to merge 1 commit into
NousResearch:mainfrom
fengtianyu88:fix/qqbot-reconnect-backoff

Conversation

@fengtianyu88

@fengtianyu88 fengtianyu88 commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Summary

The QQCloseError (non-4008) reconnect path in _listen_loop is missing the MAX_RECONNECT_ATTEMPTS upper-bound check that exists in both the Exception handler and the 4008 rate-limit handler. This causes the bot to hang silently for hours after a permanent network failure instead of giving up cleanly.

Bug: Silent Hang After Network Failure

Observed behavior (from real logs, 2026-04-20):

08:12:53  WebSocket error: WebSocket closed
08:12:53  Reconnecting in 2s (attempt 1)...
08:12:55  Reconnect failed: Cannot connect to host api.sgroup.qq.com:443 [Network is unreachable]
08:12:55 ~ 08:51:59  ← ZERO logs — completely silent for ~39 minutes
08:51:59           ← WSL Ubuntu rebooted (old gateway process killed)
12:01:33           ← New gateway started, reconnected successfully

After the initial reconnect failure at 08:12:55, the bot produced no logs for nearly 4 hours until the WSL restart killed the process. The bot was alive (gateway was processing other events) but completely unresponsive on QQ — users would perceive it as dead.

Root Cause

In _listen_loop, after _reconnect() returns False:

# Line 536-537 (QQCloseError non-4008 path):
else:
    backoff_idx += 1  # ← no upper-bound check
# → continue → while self._running loop → sleep(backoff) → _reconnect() again

With no upper-bound check, backoff_idx grows indefinitely (though capped at 4 by RECONNECT_BACKOFF table lookup, giving a constant 60s retry interval). Critically, there is no log written between retry attempts — no "Reconnecting in Xs (attempt N)..." and no "Reconnect failed" — so the bot silently hangs until externally killed.

Fix

Add the same MAX_RECONNECT_ATTEMPTS guard that already exists in the except Exception path (line 546) and the 4008 path (line 486):

# Before (line 536-537):
else:
    backoff_idx += 1

# After:
else:
    backoff_idx += 1
    if backoff_idx >= MAX_RECONNECT_ATTEMPTS:
        logger.error("[%s] Max reconnect attempts reached (QQCloseError)", self._log_tag)
        return

This ensures that after 100 consecutive reconnect failures, the bot logs a clear error message and exits the listen loop cleanly, rather than hanging silently forever.

Trigger Condition

This bug triggers whenever _reconnect() permanently fails for any non-4008 close code (e.g., network unreachable, DNS failure, SSL error). In the observed case, WSL lost network connectivity for ~4 hours.

…path

The QQCloseError (non-4008) reconnect path in _listen_loop was
missing the MAX_RECONNECT_ATTEMPTS upper-bound check that exists
in both the Exception handler (line 546) and the 4008 rate-limit
handler (line 486). Without this check, if _reconnect() fails
permanently for any non-4008 close code, backoff_idx grows
indefinitely and the bot retries forever at 60-second intervals
instead of giving up cleanly.

Fix: add the same guard after backoff_idx += 1 in the general
QQCloseError branch, consistent with the existing Exception path.
jkiwen pushed a commit to jkiwen/hermes-agent that referenced this pull request Apr 21, 2026
Fixes silent hang after permanent network failure.

Cherry-picked from PR NousResearch#13074
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/qqbot QQ Bot adapter duplicate This issue or pull request already exists labels Apr 22, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13461 — identical fix: add MAX_RECONNECT_ATTEMPTS guard to QQCloseError reconnect path in _listen_loop.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks @fengtianyu88! Your fix was cherry-picked and merged via #14341 with your authorship preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists platform/qqbot QQ Bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants