Skip to content

QQBot WebSocket _open_ws() hangs indefinitely on stale CLOSE-WAIT connections, freezing entire gateway #18221

@Hx-zh

Description

@Hx-zh

Description

The QQBot adapter's _open_ws() method can permanently block when cleaning up a stale WebSocket connection in CLOSE-WAIT state. This freezes the entire gateway process — it remains alive but stops processing events, writing logs, or responding to health checks.

This is distinct from but related to #17703, #14539, and #15490.

Environment

hermes-agent version: v0.11.0 (commit 454d883)

Python: 3.12

OS: Ubuntu 22.04 (WSL2)

aiohttp: 3.11.x

Symptoms

Process state: S (sleeping) (not dead, not spinning)

TCP connections stuck in CLOSE-WAIT:

CLOSE-WAIT 25 0 172.24.40.238:35638 101.91.34.226:443
CLOSE-WAIT 32 0 172.24.40.238:47022 101.91.19.174:443

Logs stop updating entirely

Occurs reliably after ~10 minutes of stable operation

Process does NOT exit; must be killed manually

Root Cause

In _open_ws() (gateway/platforms/qqbot/adapter.py:
387-393):

async def _open_ws(self, gateway_url: str) -> None:
if self._ws and not self._ws.closed:
await self._ws.close() # ← BLOCKS HERE
self._ws = None
if self._session and not self._session.closed:
await self._session.close() # ← ALSO BLOCKS
self._session = None

When the QQ server sends a FIN (normal connection teardown), the old WebSocket enters CLOSE-WAIT state. At this point:

self._ws.closed returns False (close handshake incomplete
)

await self._ws.close() attempts a graceful close

But _read_events() has already exited, so no one consumes the close frame from the read buffer

close() waits indefinitely for the close handshake to complete

The asyncio event loop is blocked; gateway freezes

Reproduction

Start hermes gateway run with QQBot enabled

Wait for QQ server to send a connection-level FIN (~10 minutes in our environment)

Observe process enters S state with CLOSE-WAIT connections

Suggested Fix

Option A: Add timeout to close operations

async def _open_ws(self,
gateway_url: str) -> None:
if self._ws:
try:
await asyncio.wait_for(self._ws.close(), timeout=5.0)
except asyncio.TimeoutError:
pass # Force abandon
self._ws = None
if self._session:
try:
await asyncio.wait_for(self._session.close(), timeout=5.0)
except asyncio.TimeoutError:
pass
self._session = None

Option B: Force close without handshake for dead connections

if self._ws:
self._ws._closing = True
self._ws = None

Related Issues

#17703 — QQBot stops reconnecting after failed reconnect leaves websocket closed

#14539 — QQ Bot adapter silently stops reconnecting without notifying gateway

#15490 — qqbot adapter silently dies on network outage during reconnect

PR #18172 — fix(qqbot): add gateway URL cache, retry, and rate-limit handling (addresses reconnect storms but not this cleanup hang)

Impact

In our production environment, this caused 50+ freeze-restart cycles within 10 hours. The issue was mitigated by an external watchdog that kills and restarts the gateway process.
This issue was created with the assistance of AI analysis. The root cause, diagnostic steps, and suggested fixes were identified through automated log analysis and code review.

I was troubled by repeated restarts, so I had my agent write an external monitoring script to mitigate and help diagnose this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliveryplatform/qqbotQQ Bot adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions