qqbot adapter silently dies on network outage during reconnect; gateway has no task watchdog

## Summary

When the host's network briefly goes down, the **QQ bot platform adapter** silently dies during a reconnect attempt. The Gateway parent task does not detect the failure or restart the adapter, so QQ stays offline indefinitely until the container is manually restarted. Telegram, in the same container subjected to the same network event, recovers automatically.

## Environment

- Hermes Agent **v0.11.0** (`nousresearch/hermes-agent:latest`, image sha256 `550ae16a17b3`)
- Docker on a NAS (China region), behind a clash HTTP/HTTPS proxy at `http://<local-clash-proxy>` set via `HTTP_PROXY` / `HTTPS_PROXY` env vars
- Platforms enabled: `telegram` + `qqbot` (both reach their endpoints through the same proxy)

## What happened (production observation)

1. Host network started degrading; clash proxy began dropping idle WebSocket connections.
2. QQ bot adapter lost its WS to `wss://api.sgroup.qq.com/websocket` **every ~60 s**. Each cycle the adapter logged `WebSocket error: WebSocket closed`, reconnected, sent Resume, and succeeded.
3. After ~5 such cycles, the host network dropped fully for a short window.
4. The 6th reconnect attempt triggered an exception at the **httpx/httpcore** layer (TCP/TLS handshake through proxy), which appears **not to be caught** by the qqbot adapter's reconnect coroutine.
5. The qqbot task quietly exited — **no traceback in `agent.log`, no `ERROR` entry, no further `qqbot` log lines for over an hour**.
6. Meanwhile Telegram experienced the same network event but its retry loop survived and reconnected automatically once network was back.
7. `hermes gateway status` continued to report `Gateway is running` (PID alive) and Telegram kept serving. QQ remained permanently offline until `docker restart`.

## Excerpted log (timestamps UTC)

```
2026-04-24 23:50:19 WARNING [QQBot:xxx] WebSocket error: WebSocket closed
2026-04-24 23:50:21 INFO    [QQBot:xxx] Reconnected
2026-04-24 23:50:21 INFO    [QQBot:xxx] Session resumed
2026-04-24 23:51:21 WARNING [QQBot:xxx] WebSocket error: WebSocket closed   # exact 60s cycle
2026-04-24 23:51:24 INFO    [QQBot:xxx] Reconnected
... [3 more cycles, all succeeding] ...
2026-04-24 23:54:31 INFO    [QQBot:xxx] Session resumed (seq=232)           # last qqbot log
                                                                             # ~1h 6m of zero qqbot activity at all
2026-04-25 01:00:30 WARNING [Telegram] network error, scheduling reconnect: httpx.ConnectError
... [Telegram retry loop runs to completion and recovers] ...
2026-04-25 01:57:29 INFO    [Telegram] Connected to Telegram (polling mode)
                                                                             # qqbot still silent — no reconnect attempted
```

## Expected behavior

The Gateway should either:

- Wrap each platform adapter's main loop in a supervisor that restarts the adapter on unhandled exception (or at minimum logs the traceback at `ERROR` level so silent death is visible), **and/or**
- The qqbot adapter's reconnect coroutine should catch transport-layer exceptions (`httpx.ConnectError`, `httpcore.ConnectError`, `OSError`, TLS handshake failures, proxy `CONNECT` failures) the same way it currently handles `WebSocket closed`.

## Suspected fix locations

- `gateway/platforms/qqbot/adapter.py` — broaden the `except` around the reconnect / Resume path to include `httpx.ConnectError`, `httpcore.ConnectError`, `OSError`, `ssl.SSLError`, etc.
- `gateway/run.py` — add a per-platform task supervisor that restarts a dead adapter task, or at least emits a high-severity log + alert when a platform task exits unexpectedly.

The 60-second WebSocket cycle is likely a clash idle-connection timeout (client-side proxy issue, not your bug), but the silent death after that is the actual bug — a healthy adapter should not be killable by a transient network event.

Happy to provide more logs / the full silent-death window if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qqbot adapter silently dies on network outage during reconnect; gateway has no task watchdog #15490

Summary

Environment

What happened (production observation)

Excerpted log (timestamps UTC)

Expected behavior

Suspected fix locations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

qqbot adapter silently dies on network outage during reconnect; gateway has no task watchdog #15490

Description

Summary

Environment

What happened (production observation)

Excerpted log (timestamps UTC)

Expected behavior

Suspected fix locations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions