You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Version: gbrain 0.42.10.0 (Postgres engine, transaction-mode pooler port 6543, prepare:false; minion jobs supervisor + 1 child jobs work, concurrency 3, --max-rss 16384) Severity: High — brain processing silently halted for ~15h; no crash, no alert, no self-heal.
TL;DR
A child worker's DB pool died (write CONNECTION_ENDED on every background-interval tick) and never recovered, but the worker process stayed alive the entire time. Because the supervisor's restart logic only fires on child exit, an alive-but-wedged worker is invisible to it. The supervisor's own health check (separate connection) kept succeeding, so it emitted health_warn reason=no_recent_completionsonce a minute for 914 consecutive minutes — and took zero corrective action. Jobs piled to 57 waiting / 0 active, and every long-running autopilot-cycle dead-lettered with max stalled count exceeded. Manual fix: kill the supervisor tree so the respawn path rebuilds a fresh worker.
This is distinct from the existing connection bugs:
jobs stats during the wedge: Queue health: 57 waiting, 0 active, 0 stalled — and the dead-letter pile all reading Error: max stalled count exceeded. minutes_since_completion climbed monotonically 0 → 914. Both supervisor and worker processes were alive the whole time (passed ps/kill -0), which is exactly why every count/liveness-based watchdog (ours included) failed to act.
Root cause
Two compounding gaps:
1. The worker's background interval loops (promoteDelayed, stall-detection, timeout-detection, wall-clock detection) spam the dead pool forever and never reconnect or escalate. Same module-level singleton-pool failure mode as #1720, but here it manifests without a process exit — the detection loops just catch-and-log CONNECTION_ENDED indefinitely. A worker whose pool is permanently dead but whose event loop is still spinning is, functionally, a zombie that no liveness check catches.
2. The supervisor detects the symptom and does nothing with it.src/core/minions/supervisor.ts:582-589:
no_recent_completions is emit-and-forget. There is no escalation path that converts "waiting>0 AND active==0 AND no completion in N minutes AND child claims to be alive" into a forced child restart. The supervisor's own db_connection_degraded → reconnect() path (supervisor.ts:~605) only triggers when the supervisor's own health query fails ≥3×. But the supervisor uses a different connection than the worker pool, so it keeps succeeding — the supervisor thinks the DB is fine while the worker it's supposed to babysit is wedged against the same pooler.
Net: the one signal that actually caught the incident (no_recent_completions, 914×) is wired to a log line instead of a recovery action.
Proposed fix
Escalate no_recent_completions into a forced child restart (the real fix). When waiting > 0 && active == 0 && minutesSinceCompletion > threshold persists across K consecutive health checks while the child is nominally alive, treat the child as wedged: killChild('SIGTERM') (then SIGKILL after grace) so the spawn-and-respawn loop rebuilds a fresh pool. This is the "progress watchdog" the supervisor is missing — it currently watches liveness, not forward progress. Make threshold/K configurable; default conservative (e.g. >15 min stall + zero active jobs ⇒ restart).
Make the stall obvious in jobs stats/doctor.0 active + >0 waiting + last_completed age > 15m is an unambiguous wedged-queue signature — surface it as a health error (not just a buried log warn) so operators and the daily doctor catch it in minutes, not 15 hours.
Impact
Silent 15h processing halt with no crash and no alert. Any deployment behind a Supabase pooler that periodically drops sockets can hit this; the alive-but-wedged worker defeats every process-liveness watchdog (gbrain's supervisor, our external minion-watchdog.sh, container health). Fix #1 alone closes the incident; #2 and #3 are defense-in-depth.
Filed from production incident 2026-06-03. Manual remediation: killed supervisor PID tree → fresh supervisor+worker respawned (clean pool) → jobs retry on the 40 dead-lettered jobs → queue drained normally.
Version: gbrain 0.42.10.0 (Postgres engine, transaction-mode pooler port 6543,
prepare:false; minionjobs supervisor+ 1 childjobs work, concurrency 3,--max-rss 16384)Severity: High — brain processing silently halted for ~15h; no crash, no alert, no self-heal.
TL;DR
A child worker's DB pool died (
write CONNECTION_ENDEDon every background-interval tick) and never recovered, but the worker process stayed alive the entire time. Because the supervisor's restart logic only fires on child exit, an alive-but-wedged worker is invisible to it. The supervisor's own health check (separate connection) kept succeeding, so it emittedhealth_warn reason=no_recent_completionsonce a minute for 914 consecutive minutes — and took zero corrective action. Jobs piled to 57 waiting / 0 active, and every long-runningautopilot-cycledead-lettered withmax stalled count exceeded. Manual fix: kill the supervisor tree so the respawn path rebuilds a fresh worker.This is distinct from the existing connection bugs:
code=1and respawns repeatedly (in-process reconnect missing).Observed (sanitized)
/tmp/gbrain-worker.log, the same lines every ~minute for 15h:jobs statsduring the wedge:Queue health: 57 waiting, 0 active, 0 stalled— and the dead-letter pile all readingError: max stalled count exceeded.minutes_since_completionclimbed monotonically 0 → 914. Both supervisor and worker processes were alive the whole time (passedps/kill -0), which is exactly why every count/liveness-based watchdog (ours included) failed to act.Root cause
Two compounding gaps:
1. The worker's background interval loops (
promoteDelayed, stall-detection, timeout-detection, wall-clock detection) spam the dead pool forever and never reconnect or escalate. Same module-level singleton-pool failure mode as #1720, but here it manifests without a process exit — the detection loops just catch-and-logCONNECTION_ENDEDindefinitely. A worker whose pool is permanently dead but whose event loop is still spinning is, functionally, a zombie that no liveness check catches.2. The supervisor detects the symptom and does nothing with it.
src/core/minions/supervisor.ts:582-589:no_recent_completionsis emit-and-forget. There is no escalation path that converts "waiting>0 AND active==0 AND no completion in N minutes AND child claims to be alive" into a forced child restart. The supervisor's owndb_connection_degraded→reconnect()path (supervisor.ts:~605) only triggers when the supervisor's own health query fails ≥3×. But the supervisor uses a different connection than the worker pool, so it keeps succeeding — the supervisor thinks the DB is fine while the worker it's supposed to babysit is wedged against the same pooler.Net: the one signal that actually caught the incident (
no_recent_completions, 914×) is wired to a log line instead of a recovery action.Proposed fix
Escalate
no_recent_completionsinto a forced child restart (the real fix). Whenwaiting > 0 && active == 0 && minutesSinceCompletion > thresholdpersists across K consecutive health checks while the child is nominally alive, treat the child as wedged:killChild('SIGTERM')(thenSIGKILLafter grace) so the spawn-and-respawn loop rebuilds a fresh pool. This is the "progress watchdog" the supervisor is missing — it currently watches liveness, not forward progress. Make threshold/K configurable; default conservative (e.g. >15 min stall + zero active jobs ⇒ restart).Give the worker's background interval loops a reconnect-or-die policy. On repeated
CONNECTION_ENDED/CONNECTION_CLOSEDfrom thepromoteDelayed/stall/timeout/wall-clock detectors, attempt in-process pool reconnect (shared with the autopilot worker crash-loops on pooler CONNECTION_CLOSED: background queue loops have no in-process reconnect #1720/fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED #1669 work); if reconnect fails N times, exit non-zero so the supervisor's existing crash-respawn path takes over instead of spinning forever on a dead pool. A worker that cannot reach the DB should crash loudly, not idle quietly.Make the stall obvious in
jobs stats/doctor.0 active + >0 waiting + last_completed age > 15mis an unambiguous wedged-queue signature — surface it as a health error (not just a buried log warn) so operators and the daily doctor catch it in minutes, not 15 hours.Impact
Silent 15h processing halt with no crash and no alert. Any deployment behind a Supabase pooler that periodically drops sockets can hit this; the alive-but-wedged worker defeats every process-liveness watchdog (gbrain's supervisor, our external
minion-watchdog.sh, container health). Fix #1 alone closes the incident; #2 and #3 are defense-in-depth.Filed from production incident 2026-06-03. Manual remediation: killed supervisor PID tree → fresh supervisor+worker respawned (clean pool) →
jobs retryon the 40 dead-lettered jobs → queue drained normally.