fix(qqbot): add backoff upper-bound for QQCloseError reconnect path#13074
Closed
fengtianyu88 wants to merge 1 commit into
Closed
fix(qqbot): add backoff upper-bound for QQCloseError reconnect path#13074fengtianyu88 wants to merge 1 commit into
fengtianyu88 wants to merge 1 commit into
Conversation
…path The QQCloseError (non-4008) reconnect path in _listen_loop was missing the MAX_RECONNECT_ATTEMPTS upper-bound check that exists in both the Exception handler (line 546) and the 4008 rate-limit handler (line 486). Without this check, if _reconnect() fails permanently for any non-4008 close code, backoff_idx grows indefinitely and the bot retries forever at 60-second intervals instead of giving up cleanly. Fix: add the same guard after backoff_idx += 1 in the general QQCloseError branch, consistent with the existing Exception path.
jkiwen
pushed a commit
to jkiwen/hermes-agent
that referenced
this pull request
Apr 21, 2026
Fixes silent hang after permanent network failure. Cherry-picked from PR NousResearch#13074
Collaborator
|
Likely duplicate of #13461 — identical fix: add MAX_RECONNECT_ATTEMPTS guard to QQCloseError reconnect path in _listen_loop. |
Contributor
|
Thanks @fengtianyu88! Your fix was cherry-picked and merged via #14341 with your authorship preserved. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The QQCloseError (non-4008) reconnect path in
_listen_loopis missing theMAX_RECONNECT_ATTEMPTSupper-bound check that exists in both the Exception handler and the 4008 rate-limit handler. This causes the bot to hang silently for hours after a permanent network failure instead of giving up cleanly.Bug: Silent Hang After Network Failure
Observed behavior (from real logs, 2026-04-20):
After the initial reconnect failure at 08:12:55, the bot produced no logs for nearly 4 hours until the WSL restart killed the process. The bot was alive (gateway was processing other events) but completely unresponsive on QQ — users would perceive it as dead.
Root Cause
In
_listen_loop, after_reconnect()returnsFalse:With no upper-bound check,
backoff_idxgrows indefinitely (though capped at 4 byRECONNECT_BACKOFFtable lookup, giving a constant 60s retry interval). Critically, there is no log written between retry attempts — no "Reconnecting in Xs (attempt N)..." and no "Reconnect failed" — so the bot silently hangs until externally killed.Fix
Add the same
MAX_RECONNECT_ATTEMPTSguard that already exists in theexcept Exceptionpath (line 546) and the 4008 path (line 486):This ensures that after 100 consecutive reconnect failures, the bot logs a clear error message and exits the listen loop cleanly, rather than hanging silently forever.
Trigger Condition
This bug triggers whenever
_reconnect()permanently fails for any non-4008 close code (e.g., network unreachable, DNS failure, SSL error). In the observed case, WSL lost network connectivity for ~4 hours.