Discord: health monitor restart loop — post-connection zombie sessions evade circuit breakers

## Problem

Discord connections successfully complete the handshake (HELLO → READY/RESUMED), run for a period, then become zombie/unstable. The health monitor detects this ("stuck" / "stale-socket") and restarts the provider, but the cycle repeats indefinitely — creating an effective infinite restart loop where **messages sent during unstable periods are silently dropped**.

This is distinct from #13688 (HELLO never received / infinite resume loop), which appears fixed in v2026.3.2.

## Environment

- **OpenClaw:** v2026.3.2 (stable)
- **OS:** Ubuntu 24.04
- **Discord:** Single private server, one user
- **Other channels:** Telegram and WhatsApp remain operational throughout

## Production evidence (Mar 7, 2026)

Health monitor restarted Discord provider **every 5–15 minutes** for 4+ hours:

```
02:15 [health-monitor] [discord:default] restarting (reason: stale-socket)
02:30 [health-monitor] [discord:default] restarting (reason: stuck)
02:45 [health-monitor] [discord:default] restarting (reason: stuck)
03:00 [health-monitor] [discord:default] restarting (reason: stuck)
03:10 [health-monitor] [discord:default] restarting (reason: stuck)
03:20 [health-monitor] [discord:default] restarting (reason: stuck)
03:25 [gateway/channels] restarting discord channel
03:44 [gateway/channels] restarting discord channel (×4 in 8 seconds)
04:50 [health-monitor] [discord:default] restarting (reason: stale-socket)
05:00 [health-monitor] [discord:default] restarting (reason: stuck)
```

Key observations:
- **Zero `connection stalled: no HELLO` messages** — the HELLO stall circuit breaker never fires because HELLO IS received
- WebSocket code 1005 disconnects appear between restarts: `gateway: Attempting resume with backoff: 1000ms` / `WebSocket connection closed with code 1005`
- After each health monitor restart, Discord reconnects successfully (`channels resolved`, `logged in to discord`) but eventually goes unstable again

## Root cause analysis

The health monitor restart creates a **new provider lifecycle**, which resets two protective mechanisms:

### 1. `consecutiveHelloStalls` counter (local variable)

```javascript
// Inside runDiscordGatewayLifecycle — recreated on every restart
let consecutiveHelloStalls = 0;
```

Each health monitor restart creates a new lifecycle with a fresh counter at 0. Even if HELLO stalls did occur, the counter can never accumulate across restarts.

### 2. `reconnectStallWatchdog` (disarmed on socket open)

The watchdog is armed on disconnect/hello-timeout but **disarmed on every `WebSocket connection opened` event**. In a rapid open→close→open cycle, the watchdog timeout (5 min) is never reached because each new `open` resets it.

### Result

Neither circuit breaker can protect against the pattern: "connection succeeds → runs briefly → dies → health monitor restarts → repeat." The health monitor's `DEFAULT_MAX_RESTARTS_PER_HOUR = 10` limits restart frequency but doesn't break the cycle.

## Impact

Messages sent to the bot during "stuck" periods are silently lost — no error, no retry, no notification to the user. The bot appears online (Discord presence) but doesn't respond.

## Suggested fixes

1. **Persist restart context across provider lifecycles** — track "number of health monitor restarts in last N minutes" at the monitor level (not inside the lifecycle). After a threshold (e.g., 5 restarts in 30 min), take more drastic action:
   - Force a fresh IDENTIFY (clear session state before restart)
   - Apply a longer cooldown before reconnecting
   - Emit a warning event for monitoring/alerting

2. **Don't disarm `reconnectStallWatchdog` on socket `open`** — disarm it on `READY`/`RESUMED` instead, so rapid open→close cycles are caught.

3. **Track Discord-specific health separately from the generic channel monitor** — the WebSocket code 1005/1006 pattern and zombie heartbeat detection could inform a Discord-specific circuit breaker.

## Relationship to other issues

- **#13688** — HELLO stall infinite resume loop: appears fixed in v2026.3.2 (circuit breaker + Carbon `reconnectAttempts` fix)
- **#14758** — Health monitor addition: working as designed, but masks rather than fixes this issue
- **PR #15762** — Circuit breaker PR (closed): the fix it proposed is now shipped, but addresses the HELLO stall case, not this post-connection instability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discord: health monitor restart loop — post-connection zombie sessions evade circuit breakers #38596

Problem

Environment

Production evidence (Mar 7, 2026)

Root cause analysis

1. `consecutiveHelloStalls` counter (local variable)

2. `reconnectStallWatchdog` (disarmed on socket open)

Result

Impact

Suggested fixes

Relationship to other issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Discord: health monitor restart loop — post-connection zombie sessions evade circuit breakers #38596

Description

Problem

Environment

Production evidence (Mar 7, 2026)

Root cause analysis

1. consecutiveHelloStalls counter (local variable)

2. reconnectStallWatchdog (disarmed on socket open)

Result

Impact

Suggested fixes

Relationship to other issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `consecutiveHelloStalls` counter (local variable)

2. `reconnectStallWatchdog` (disarmed on socket open)