Skip to content

Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt #1801

@garrytan-agents

Description

@garrytan-agents

Version: gbrain 0.42.10.0 (Postgres engine, transaction-mode pooler port 6543, prepare:false; minion jobs supervisor + 1 child jobs work, concurrency 3, --max-rss 16384)
Severity: High — brain processing silently halted for ~15h; no crash, no alert, no self-heal.

TL;DR

A child worker's DB pool died (write CONNECTION_ENDED on every background-interval tick) and never recovered, but the worker process stayed alive the entire time. Because the supervisor's restart logic only fires on child exit, an alive-but-wedged worker is invisible to it. The supervisor's own health check (separate connection) kept succeeding, so it emitted health_warn reason=no_recent_completions once a minute for 914 consecutive minutes — and took zero corrective action. Jobs piled to 57 waiting / 0 active, and every long-running autopilot-cycle dead-lettered with max stalled count exceeded. Manual fix: kill the supervisor tree so the respawn path rebuilds a fresh worker.

This is distinct from the existing connection bugs:

Observed (sanitized)

/tmp/gbrain-worker.log, the same lines every ~minute for 15h:

[supervisor 05:01:11] health_warn reason=no_recent_completions waiting_count=57 minutes_since_completion=913 queue=default
Stall detection error: write CONNECTION_ENDED aws-1-...pooler.supabase.com:6543
Timeout detection error: write CONNECTION_ENDED aws-1-...pooler.supabase.com:6543
Wall-clock timeout detection error: write CONNECTION_ENDED aws-1-...pooler.supabase.com:6543

jobs stats during the wedge: Queue health: 57 waiting, 0 active, 0 stalled — and the dead-letter pile all reading Error: max stalled count exceeded. minutes_since_completion climbed monotonically 0 → 914. Both supervisor and worker processes were alive the whole time (passed ps/kill -0), which is exactly why every count/liveness-based watchdog (ours included) failed to act.

Root cause

Two compounding gaps:

1. The worker's background interval loops (promoteDelayed, stall-detection, timeout-detection, wall-clock detection) spam the dead pool forever and never reconnect or escalate. Same module-level singleton-pool failure mode as #1720, but here it manifests without a process exit — the detection loops just catch-and-log CONNECTION_ENDED indefinitely. A worker whose pool is permanently dead but whose event loop is still spinning is, functionally, a zombie that no liveness check catches.

2. The supervisor detects the symptom and does nothing with it. src/core/minions/supervisor.ts:582-589:

if (waitingCount > 0 && minutesSinceCompletion !== null && minutesSinceCompletion > 30) {
  this.emit('health_warn', {
    reason: 'no_recent_completions',
    waiting_count: waitingCount,
    minutes_since_completion: minutesSinceCompletion,
    queue: this.opts.queue,
  });
}

no_recent_completions is emit-and-forget. There is no escalation path that converts "waiting>0 AND active==0 AND no completion in N minutes AND child claims to be alive" into a forced child restart. The supervisor's own db_connection_degradedreconnect() path (supervisor.ts:~605) only triggers when the supervisor's own health query fails ≥3×. But the supervisor uses a different connection than the worker pool, so it keeps succeeding — the supervisor thinks the DB is fine while the worker it's supposed to babysit is wedged against the same pooler.

Net: the one signal that actually caught the incident (no_recent_completions, 914×) is wired to a log line instead of a recovery action.

Proposed fix

  1. Escalate no_recent_completions into a forced child restart (the real fix). When waiting > 0 && active == 0 && minutesSinceCompletion > threshold persists across K consecutive health checks while the child is nominally alive, treat the child as wedged: killChild('SIGTERM') (then SIGKILL after grace) so the spawn-and-respawn loop rebuilds a fresh pool. This is the "progress watchdog" the supervisor is missing — it currently watches liveness, not forward progress. Make threshold/K configurable; default conservative (e.g. >15 min stall + zero active jobs ⇒ restart).

  2. Give the worker's background interval loops a reconnect-or-die policy. On repeated CONNECTION_ENDED/CONNECTION_CLOSED from the promoteDelayed/stall/timeout/wall-clock detectors, attempt in-process pool reconnect (shared with the autopilot worker crash-loops on pooler CONNECTION_CLOSED: background queue loops have no in-process reconnect #1720/fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED #1669 work); if reconnect fails N times, exit non-zero so the supervisor's existing crash-respawn path takes over instead of spinning forever on a dead pool. A worker that cannot reach the DB should crash loudly, not idle quietly.

  3. Make the stall obvious in jobs stats/doctor. 0 active + >0 waiting + last_completed age > 15m is an unambiguous wedged-queue signature — surface it as a health error (not just a buried log warn) so operators and the daily doctor catch it in minutes, not 15 hours.

Impact

Silent 15h processing halt with no crash and no alert. Any deployment behind a Supabase pooler that periodically drops sockets can hit this; the alive-but-wedged worker defeats every process-liveness watchdog (gbrain's supervisor, our external minion-watchdog.sh, container health). Fix #1 alone closes the incident; #2 and #3 are defense-in-depth.


Filed from production incident 2026-06-03. Manual remediation: killed supervisor PID tree → fresh supervisor+worker respawned (clean pool) → jobs retry on the 40 dead-lettered jobs → queue drained normally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions