v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (#1801) by garrytan · Pull Request #1824 · garrytan/gbrain

garrytan · 2026-06-03T16:07:00Z

v0.42.22.0 — supervisor progress watchdog + worker DB self-defense (closes #1801)

The bug (production incident)

Behind a Supabase transaction pooler, a child jobs work process's DB pool died (write CONNECTION_ENDED on every background tick) and never recovered — but the process stayed alive the whole time. Jobs piled to 57 waiting / 0 active for ~15h; every long autopilot-cycle dead-lettered with max stalled count exceeded. No crash, no alert, no self-heal. The supervisor logged no_recent_completions once a minute for 914 minutes and took zero action. Manual fix: kill the supervisor tree so the respawn rebuilds a fresh pool.

Distinct from #1720/#1745 (crash-loop) and #1678 (RSS watchdog SIGKILL). Here the worker never exits, so no respawn path is ever entered. Every process-liveness watchdog is defeated — the process passes ps/kill -0. The gap: everything watched liveness, nothing watched forward progress.

The fix (two independent layers + surfacing)

Layer 1 — worker DB self-defense under supervision (worker.ts): the worker's SELECT 1 liveness probe was disabled whenever a supervisor was watching. It now runs under supervision too, so a worker whose own pool is dead self-exits db_dead (~3 min) and the supervisor respawns it with a fresh pool. Stall detection stays supervised-off (the supervisor owns that now).

Layer 2 — supervisor progress watchdog (supervisor.ts + child-worker-supervisor.ts): when a queue has claimable work, zero live-lock active jobs, and stale completions across K health checks while the child is alive, the supervisor restarts it. Catches every wedge cause, not just dead pools. Name+queue-scoped (no false-positive on unhandled job types), active_healthy-gated (a worker that died mid-job leaving an expired-lock row does NOT mask the wedge), startup-grace + loop-budget bounded, restart accounted as an intentional self-heal so it never trips max_crashes.

Surfacing (doctor.ts + queue.ts + jobs.ts): gbrain jobs stats prints a WEDGED QUEUE line; gbrain doctor reports a per-queue wedged_queue health error. Also fixed a latent bug: the remote queue_health doctor check queried a non-existent state column (errored every run, silently returned "No queue activity") — corrected to status.

Pre-existing bug fixed in passing: killChild gated on .killed ("signal sent") instead of liveness, so the follow-up SIGKILL after an ignored SIGTERM was a silent no-op — in the new escalation AND the existing shutdown() drain. Now gates on exitCode/signalCode === null.

What's in this PR (4 commits)

fix(minions): supervisor progress watchdog + worker DB self-defense (#1801) — the feature.
Merge origin/master — reconciles with v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685) #1802 (v0.42.16.0 doctor self-heal: worker_oom_loop, pool_reap_health, extract-atoms-drain). Both check sets coexist; they cover different failure modes.
fix(minions): wedge_restart_loop one-shot + … — adversarial-review findings.
chore: bump version and changelog (v0.42.22.0).

Tests

New: supervisor-wedge (decision logic via seam + PGLite SQL semantics for every reviewed edge case), worker-supervised-db-probe, doctor-wedged-queue, queue-getstats-wedge, plus child-worker-supervisor additions (behavioral restart coverage + structural regression guards for the killChild fix). Full suite + bun run verify (30 checks) green.

Review

Plan went through /plan-eng-review + a /codex outside-voice pass (16 findings, all folded in — including the killChild .killed no-op, the expired-lock-suppresses-wedge bug, and the SIGTERM-counts-as-crash bug). A pre-landing adversarial pass on the diff caught the wedge_restart_loop per-tick spam (fixed in commit 3).

🤖 Generated with Claude Code

Documentation

docs/guides/queue-operations-runbook.md: new "The worker is alive but wedged (dead pool)" section — documents the Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt #1801 case, the automatic self-heal (worker self-exit on dead pool + supervisor progress-watchdog restart), the --wedge-restart-* tuning flags, and the new signals (gbrain jobs stats WEDGED line, gbrain doctor wedged_queue check) + manual fix.
docs/architecture/KEY_FILES.md: current-state updates to the worker.ts (DB probe runs under supervision), supervisor.ts (progress watchdog + queryWedgeSignals), and child-worker-supervisor.ts (killChild liveness fix, restartCurrentChild, wedge_restart cause) entries.
CHANGELOG.md: v0.42.22.0 entry (ELI10 lead + how-to + before/after table + review findings).
CLAUDE.md: not touched — the repo keeps key-files detail in docs/architecture/KEY_FILES.md (anti-bloat guard bans version markers in CLAUDE.md). llms.txt/llms-full.txt regenerated, no diff.

Coverage: all shipped surface (supervisor flags, wedged_queue check, jobs stats WEDGED line, env knobs) has reference + how-to coverage. No diagram drift.

…nder supervision (#1801) Alive-but-wedged worker (dead DB pool, process still up) now self-heals in minutes instead of a silent 15h halt. - supervisor: progress watchdog restarts a child that makes no forward progress on claimable work (name+queue-scoped, active_healthy/due-delayed aware, startup-grace + loop-budget bounded); runtime handler-name derivation. - child-worker-supervisor: killChild gates on liveness not .killed (also fixes the existing shutdown SIGKILL no-op); restartCurrentChild kills the captured child ref; intentional restart doesn't count toward max_crashes. - worker: DB-liveness probe runs under supervision (db_dead self-exit), stall detection stays supervised-off. - doctor: standalone per-queue wedged_queue check + state->status fix in the remote queue_health check. - jobs/queue: queue-scoped getStats wedge fields + jobs stats WEDGED line.

…1801 # Conflicts: # src/core/doctor-categories.ts

…+ jobs-stats threshold (review) Pre-landing adversarial review findings: - wedge_restart_loop warn now fires once per exhausted window via a re-arming flag, not every health tick (was flooding the audit log for the full window). - Correct the stale GBRAIN_SUPERVISED comment: the DB probe runs under supervision now; only stall detection is skipped. - jobs stats WEDGED line reads GBRAIN_WEDGED_QUEUE_WARN_MINUTES so it agrees with the doctor wedged_queue threshold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…42.22.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…1801 # Conflicts: # CHANGELOG.md # VERSION # package.json

* upstream/master: v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820) v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810) v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809) v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807) v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802) v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806) v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804) v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797) v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798) v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759) v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)

garrytan and others added 6 commits June 3, 2026 08:01

Merge remote-tracking branch 'origin/master' into garrytan/fix-issue-…

62ac7fe

…1801 # Conflicts: # src/core/doctor-categories.ts

chore: bump version and changelog (v0.42.22.0)

f3eadf3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: queue-ops runbook + KEY_FILES for the #1801 wedge watchdog (v0.…

f812a4c

…42.22.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into garrytan/fix-issue-…

fc090df

…1801 # Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan merged commit f495934 into master Jun 3, 2026
21 checks passed

garrytan-agents mentioned this pull request Jun 4, 2026

jobs supervisor singleton is pidfile-path-keyed (HOME-relative default) → two supervisors run on the same queue with conflicting --max-rss #1849

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (#1801)#1824

v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (#1801)#1824
garrytan merged 6 commits into
masterfrom
garrytan/fix-issue-1801

garrytan commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v0.42.22.0 — supervisor progress watchdog + worker DB self-defense (closes #1801)

The bug (production incident)

The fix (two independent layers + surfacing)

What's in this PR (4 commits)

Tests

Review

Documentation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Jun 3, 2026 •

edited

Loading