QQBot WebSocket _open_ws() hangs indefinitely on stale CLOSE-WAIT connections, freezing entire gateway

Description

The QQBot adapter's _open_ws() method can permanently block when cleaning up a stale WebSocket connection in CLOSE-WAIT state. This freezes the entire gateway process — it remains alive but stops processing events, writing logs, or responding to health checks.

This is distinct from but related to #17703, #14539, and #15490.



Environment


hermes-agent version: v0.11.0 (commit 454d883e)

Python: 3.12

OS: Ubuntu 22.04 (WSL2)

aiohttp: 3.11.x



Symptoms


Process state: S (sleeping) (not dead, not spinning)

TCP connections stuck in CLOSE-WAIT:



      

CLOSE-WAIT 25  0  172.24.40.238:35638  101.91.34.226:443
CLOSE-WAIT 32  0  172.24.40.238:47022  101.91.19.174:443


      



Logs stop updating entirely

Occurs reliably after ~10 minutes of stable operation

Process does NOT exit; must be killed manually



Root Cause

In _open_ws() (gateway/platforms/qqbot/adapter.py:
387-393):




      

async def _open_ws(self, gateway_url: str) -> None:
    if self._ws and not self._ws.closed:
        await self._ws.close()      # ← BLOCKS HERE
    self._ws = None
    if self._session and not self._session.closed:
        await self._session.close()  # ← ALSO BLOCKS
    self._session = None


      

When the QQ server sends a FIN (normal connection teardown), the old WebSocket enters CLOSE-WAIT state. At this point:

self._ws.closed returns False (close handshake incomplete
)

await self._ws.close() attempts a graceful close

But _read_events() has already exited, so no one consumes the close frame from the read buffer

close() waits indefinitely for the close handshake to complete

The asyncio event loop is blocked; gateway freezes



Reproduction


Start hermes gateway run with QQBot enabled

Wait for QQ server to send a connection-level FIN (~10 minutes in our environment)

Observe process enters S state with CLOSE-WAIT connections



Suggested Fix

Option A: Add timeout to close operations


      

async def _open_ws(self,
 gateway_url: str) -> None:
    if self._ws:
        try:
            await asyncio.wait_for(self._ws.close(), timeout=5.0)
        except asyncio.TimeoutError:
            pass  # Force abandon
        self._ws = None
    if self._session:
        try:
            await asyncio.wait_for(self._session.close(), timeout=5.0)
        except asyncio.TimeoutError:
            pass
        self._session = None


      

Option B: Force close without handshake for dead connections


      

if self._ws:
    self._ws._closing = True
    self._ws = None


      



Related Issues


#17703 — QQBot stops reconnecting after failed reconnect leaves websocket closed

#14539 — QQ Bot adapter silently stops reconnecting without notifying gateway

#15490 — qqbot adapter silently dies on network outage during reconnect

PR #18172 — fix(qqbot): add gateway URL cache, retry, and rate-limit handling (addresses reconnect storms but not this cleanup hang)

Impact

In our production environment, this caused 50+ freeze-restart cycles within 10 hours. The issue was mitigated by an external watchdog that kills and restarts the gateway process.
*This issue was created with the assistance of AI analysis. The root cause, diagnostic steps, and suggested fixes were identified through automated log analysis and code review.*

I was troubled by repeated restarts, so I had my agent write an external monitoring script to mitigate and help diagnose this issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QQBot WebSocket _open_ws() hangs indefinitely on stale CLOSE-WAIT connections, freezing entire gateway #18221

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

QQBot WebSocket _open_ws() hangs indefinitely on stale CLOSE-WAIT connections, freezing entire gateway #18221

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions