Skip to content

fix(qqbot): add gateway URL cache, retry, and rate-limit handling#18172

Open
cxgreat2014 wants to merge 1 commit into
NousResearch:mainfrom
cxgreat2014:fix/qqbot-reconnect-enhance
Open

fix(qqbot): add gateway URL cache, retry, and rate-limit handling#18172
cxgreat2014 wants to merge 1 commit into
NousResearch:mainfrom
cxgreat2014:fix/qqbot-reconnect-enhance

Conversation

@cxgreat2014

@cxgreat2014 cxgreat2014 commented May 1, 2026

Copy link
Copy Markdown

What does this PR do?

Hardens the QQ Bot WebSocket reconnect path by adding gateway URL caching, internal retry, rate-limit detection, and stale HTTP client recovery — preventing the adapter from entering a death loop after transient WebSocket disconnects.

Problem

QQ WebSocket disconnects (especially code 4009 Session timed out) trigger reconnect storms. Each reconnect calls api.sgroup.qq.com/gateway via _get_gateway_url(), which is rate-limited to ~2 calls per time window (confirmed via live API testing). After the first 2 reconnects, all subsequent attempts hit HTTP 400 "接口调用超过频率限制" (frequency limit exceeded).

This creates a death loop:

  1. WebSocket disconnects → _reconnect()_get_gateway_url() → HTTP 400 rate-limit
  2. backoff_idx increments → retry with backoff → same rate-limited endpoint → same 400
  3. After 3 quick disconnects (MAX_QUICK_DISCONNECT_COUNT), _set_fatal_error() kills the adapter
  4. Gateway shuts down → Docker restarts → fresh reconnect → hits rate limit again → loop

Related issues: #17703, #14539, #15490

Changes

gateway/platforms/qqbot/constants.py

  • Add GATEWAY_URL_RETRY_DELAYS = [0.5, 1.5, 3.0] for bounded internal retry
  • Raise MAX_QUICK_DISCONNECT_COUNT from 3 to 6 (reduce false positives during reconnect storms)

gateway/platforms/qqbot/adapter.py

Five targeted fixes, all within existing methods:

  1. Gateway URL cache_last_gateway_url caches the last successfully resolved URL. On subsequent reconnects, cached URL is returned immediately — zero API calls to /gateway.

  2. Internal retry_get_gateway_url() retries up to 3 times (GATEWAY_URL_RETRY_DELAYS) before giving up. Transient network errors no longer cause immediate failure.

  3. Rate-limit detection — If /gateway returns HTTP 400 with "频率限制" in the body, the adapter enters a cooldown (RATE_LIMIT_DELAY) and falls back to the cached URL. If no cache is available, a clear error is raised.

  4. Fresh HTTP client_ensure_fresh_client() rebuilds self._http_client on each reconnect, preventing stale connection-pool exceptions that produce empty str() and unhelpful log messages.

  5. Safe error messages_safe_str() helper ensures exception messages in logs are never empty, falling back to repr() when str() produces an empty string.

How to Test

pytest tests/gateway/test_qqbot.py -q

Additionally verified with a standalone mock test suite simulating 6 scenarios:

  • Normal flow (cache + reuse)
  • Rate limit (cache hit under RL, fail gracefully without cache)
  • Reconnect storm (rate_limit=2, 6 reconnects: 1 API call, 5 cache hits)
  • Cache fallback (all retries fail → cached URL used)
  • Empty body response (JSON parse error → cache fallback)
  • safe_str() (empty RuntimeError → meaningful message)

Type of Change

  • Bug fix
  • Tests

Notes

Builds on and extends PR #17256 by adding:

  • Rate-limit detection (HTTP 400 + frequency-limit message in body)
  • Fresh HTTP client on each reconnect
  • Safe error messages for logging

The gateway URL cache TTL is intentionally omitted — QQ gateway URLs are long-lived, and a simple _last_gateway_url field is sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Medium — degraded but workaround exists platform/qqbot QQ Bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants