Description
The QQBot adapter's _open_ws() method can permanently block when cleaning up a stale WebSocket connection in CLOSE-WAIT state. This freezes the entire gateway process — it remains alive but stops processing events, writing logs, or responding to health checks.
This is distinct from but related to #17703, #14539, and #15490.
Environment
hermes-agent version: v0.11.0 (commit 454d883)
Python: 3.12
OS: Ubuntu 22.04 (WSL2)
aiohttp: 3.11.x
Symptoms
Process state: S (sleeping) (not dead, not spinning)
TCP connections stuck in CLOSE-WAIT:
CLOSE-WAIT 25 0 172.24.40.238:35638 101.91.34.226:443
CLOSE-WAIT 32 0 172.24.40.238:47022 101.91.19.174:443
Logs stop updating entirely
Occurs reliably after ~10 minutes of stable operation
Process does NOT exit; must be killed manually
Root Cause
In _open_ws() (gateway/platforms/qqbot/adapter.py:
387-393):
async def _open_ws(self, gateway_url: str) -> None:
if self._ws and not self._ws.closed:
await self._ws.close() # ← BLOCKS HERE
self._ws = None
if self._session and not self._session.closed:
await self._session.close() # ← ALSO BLOCKS
self._session = None
When the QQ server sends a FIN (normal connection teardown), the old WebSocket enters CLOSE-WAIT state. At this point:
self._ws.closed returns False (close handshake incomplete
)
await self._ws.close() attempts a graceful close
But _read_events() has already exited, so no one consumes the close frame from the read buffer
close() waits indefinitely for the close handshake to complete
The asyncio event loop is blocked; gateway freezes
Reproduction
Start hermes gateway run with QQBot enabled
Wait for QQ server to send a connection-level FIN (~10 minutes in our environment)
Observe process enters S state with CLOSE-WAIT connections
Suggested Fix
Option A: Add timeout to close operations
async def _open_ws(self,
gateway_url: str) -> None:
if self._ws:
try:
await asyncio.wait_for(self._ws.close(), timeout=5.0)
except asyncio.TimeoutError:
pass # Force abandon
self._ws = None
if self._session:
try:
await asyncio.wait_for(self._session.close(), timeout=5.0)
except asyncio.TimeoutError:
pass
self._session = None
Option B: Force close without handshake for dead connections
if self._ws:
self._ws._closing = True
self._ws = None
Related Issues
#17703 — QQBot stops reconnecting after failed reconnect leaves websocket closed
#14539 — QQ Bot adapter silently stops reconnecting without notifying gateway
#15490 — qqbot adapter silently dies on network outage during reconnect
PR #18172 — fix(qqbot): add gateway URL cache, retry, and rate-limit handling (addresses reconnect storms but not this cleanup hang)
Impact
In our production environment, this caused 50+ freeze-restart cycles within 10 hours. The issue was mitigated by an external watchdog that kills and restarts the gateway process.
This issue was created with the assistance of AI analysis. The root cause, diagnostic steps, and suggested fixes were identified through automated log analysis and code review.
I was troubled by repeated restarts, so I had my agent write an external monitoring script to mitigate and help diagnose this issue
Description
The QQBot adapter's _open_ws() method can permanently block when cleaning up a stale WebSocket connection in CLOSE-WAIT state. This freezes the entire gateway process — it remains alive but stops processing events, writing logs, or responding to health checks.
This is distinct from but related to #17703, #14539, and #15490.
Environment
hermes-agent version: v0.11.0 (commit 454d883)
Python: 3.12
OS: Ubuntu 22.04 (WSL2)
aiohttp: 3.11.x
Symptoms
Process state: S (sleeping) (not dead, not spinning)
TCP connections stuck in CLOSE-WAIT:
CLOSE-WAIT 25 0 172.24.40.238:35638 101.91.34.226:443
CLOSE-WAIT 32 0 172.24.40.238:47022 101.91.19.174:443
Logs stop updating entirely
Occurs reliably after ~10 minutes of stable operation
Process does NOT exit; must be killed manually
Root Cause
In _open_ws() (gateway/platforms/qqbot/adapter.py:
387-393):
async def _open_ws(self, gateway_url: str) -> None:
if self._ws and not self._ws.closed:
await self._ws.close() # ← BLOCKS HERE
self._ws = None
if self._session and not self._session.closed:
await self._session.close() # ← ALSO BLOCKS
self._session = None
When the QQ server sends a FIN (normal connection teardown), the old WebSocket enters CLOSE-WAIT state. At this point:
self._ws.closed returns False (close handshake incomplete
)
await self._ws.close() attempts a graceful close
But _read_events() has already exited, so no one consumes the close frame from the read buffer
close() waits indefinitely for the close handshake to complete
The asyncio event loop is blocked; gateway freezes
Reproduction
Start hermes gateway run with QQBot enabled
Wait for QQ server to send a connection-level FIN (~10 minutes in our environment)
Observe process enters S state with CLOSE-WAIT connections
Suggested Fix
Option A: Add timeout to close operations
async def _open_ws(self,
gateway_url: str) -> None:
if self._ws:
try:
await asyncio.wait_for(self._ws.close(), timeout=5.0)
except asyncio.TimeoutError:
pass # Force abandon
self._ws = None
if self._session:
try:
await asyncio.wait_for(self._session.close(), timeout=5.0)
except asyncio.TimeoutError:
pass
self._session = None
Option B: Force close without handshake for dead connections
if self._ws:
self._ws._closing = True
self._ws = None
Related Issues
#17703 — QQBot stops reconnecting after failed reconnect leaves websocket closed
#14539 — QQ Bot adapter silently stops reconnecting without notifying gateway
#15490 — qqbot adapter silently dies on network outage during reconnect
PR #18172 — fix(qqbot): add gateway URL cache, retry, and rate-limit handling (addresses reconnect storms but not this cleanup hang)
Impact
In our production environment, this caused 50+ freeze-restart cycles within 10 hours. The issue was mitigated by an external watchdog that kills and restarts the gateway process.
This issue was created with the assistance of AI analysis. The root cause, diagnostic steps, and suggested fixes were identified through automated log analysis and code review.
I was troubled by repeated restarts, so I had my agent write an external monitoring script to mitigate and help diagnose this issue