Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt

**Version:** gbrain 0.42.10.0 (Postgres engine, transaction-mode pooler port 6543, `prepare:false`; minion `jobs supervisor` + 1 child `jobs work`, concurrency 3, `--max-rss 16384`)
**Severity:** High — brain processing silently halted for ~15h; no crash, no alert, no self-heal.

## TL;DR

A child worker's DB pool died (`write CONNECTION_ENDED` on every background-interval tick) and **never recovered**, but the worker process **stayed alive the entire time**. Because the supervisor's restart logic only fires on child *exit*, an alive-but-wedged worker is invisible to it. The supervisor's own health check (separate connection) kept succeeding, so it emitted `health_warn reason=no_recent_completions` **once a minute for 914 consecutive minutes** — and took **zero corrective action**. Jobs piled to 57 waiting / 0 active, and every long-running `autopilot-cycle` dead-lettered with `max stalled count exceeded`. Manual fix: kill the supervisor tree so the respawn path rebuilds a fresh worker.

This is **distinct from** the existing connection bugs:
- #1720 / #1745 = crash-**loop**: worker exits `code=1` and respawns repeatedly (in-process reconnect missing).
- #1678 = RSS watchdog SIGTERM-kills the worker mid-cycle.
- **This one = no crash at all.** The worker never exits, so no respawn path is ever entered. The dead pool just sits there.

## Observed (sanitized)

`/tmp/gbrain-worker.log`, the same lines every ~minute for 15h:

```
[supervisor 05:01:11] health_warn reason=no_recent_completions waiting_count=57 minutes_since_completion=913 queue=default
Stall detection error: write CONNECTION_ENDED aws-1-...pooler.supabase.com:6543
Timeout detection error: write CONNECTION_ENDED aws-1-...pooler.supabase.com:6543
Wall-clock timeout detection error: write CONNECTION_ENDED aws-1-...pooler.supabase.com:6543
```

`jobs stats` during the wedge: `Queue health: 57 waiting, 0 active, 0 stalled` — and the dead-letter pile all reading `Error: max stalled count exceeded`. `minutes_since_completion` climbed monotonically 0 → 914. Both supervisor and worker processes were alive the whole time (passed `ps`/`kill -0`), which is exactly why every count/liveness-based watchdog (ours included) failed to act.

## Root cause

Two compounding gaps:

**1. The worker's background interval loops (`promoteDelayed`, stall-detection, timeout-detection, wall-clock detection) spam the dead pool forever and never reconnect or escalate.** Same module-level singleton-pool failure mode as #1720, but here it manifests *without* a process exit — the detection loops just catch-and-log `CONNECTION_ENDED` indefinitely. A worker whose pool is permanently dead but whose event loop is still spinning is, functionally, a zombie that no liveness check catches.

**2. The supervisor detects the symptom and does nothing with it.** `src/core/minions/supervisor.ts:582-589`:

```ts
if (waitingCount > 0 && minutesSinceCompletion !== null && minutesSinceCompletion > 30) {
  this.emit('health_warn', {
    reason: 'no_recent_completions',
    waiting_count: waitingCount,
    minutes_since_completion: minutesSinceCompletion,
    queue: this.opts.queue,
  });
}
```

`no_recent_completions` is emit-and-forget. There is **no escalation path** that converts "waiting>0 AND active==0 AND no completion in N minutes AND child claims to be alive" into a forced child restart. The supervisor's own `db_connection_degraded` → `reconnect()` path (`supervisor.ts:~605`) only triggers when the *supervisor's* own health query fails ≥3×. But the supervisor uses a different connection than the worker pool, so it keeps succeeding — the supervisor thinks the DB is fine while the worker it's supposed to babysit is wedged against the same pooler.

Net: the one signal that actually caught the incident (`no_recent_completions`, 914×) is wired to a log line instead of a recovery action.

## Proposed fix

1. **Escalate `no_recent_completions` into a forced child restart (the real fix).** When `waiting > 0 && active == 0 && minutesSinceCompletion > threshold` persists across K consecutive health checks while the child is nominally alive, treat the child as wedged: `killChild('SIGTERM')` (then `SIGKILL` after grace) so the spawn-and-respawn loop rebuilds a fresh pool. This is the "progress watchdog" the supervisor is missing — it currently watches *liveness*, not *forward progress*. Make threshold/K configurable; default conservative (e.g. >15 min stall + zero active jobs ⇒ restart).

2. **Give the worker's background interval loops a reconnect-or-die policy.** On repeated `CONNECTION_ENDED`/`CONNECTION_CLOSED` from the `promoteDelayed`/stall/timeout/wall-clock detectors, attempt in-process pool reconnect (shared with the #1720/#1669 work); if reconnect fails N times, exit non-zero so the supervisor's existing crash-respawn path takes over instead of spinning forever on a dead pool. A worker that cannot reach the DB should crash loudly, not idle quietly.

3. **Make the stall obvious in `jobs stats`/`doctor`.** `0 active + >0 waiting + last_completed age > 15m` is an unambiguous wedged-queue signature — surface it as a health *error* (not just a buried log warn) so operators and the daily doctor catch it in minutes, not 15 hours.

## Impact

Silent 15h processing halt with no crash and no alert. Any deployment behind a Supabase pooler that periodically drops sockets can hit this; the alive-but-wedged worker defeats every process-liveness watchdog (gbrain's supervisor, our external `minion-watchdog.sh`, container health). Fix #1 alone closes the incident; #2 and #3 are defense-in-depth.

---
_Filed from production incident 2026-06-03. Manual remediation: killed supervisor PID tree → fresh supervisor+worker respawned (clean pool) → `jobs retry` on the 40 dead-lettered jobs → queue drained normally._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt #1801

TL;DR

Observed (sanitized)

Root cause

Proposed fix

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt #1801

Description

TL;DR

Observed (sanitized)

Root cause

Proposed fix

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions