v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (#1801)#1824
Merged
Conversation
…nder supervision (#1801) Alive-but-wedged worker (dead DB pool, process still up) now self-heals in minutes instead of a silent 15h halt. - supervisor: progress watchdog restarts a child that makes no forward progress on claimable work (name+queue-scoped, active_healthy/due-delayed aware, startup-grace + loop-budget bounded); runtime handler-name derivation. - child-worker-supervisor: killChild gates on liveness not .killed (also fixes the existing shutdown SIGKILL no-op); restartCurrentChild kills the captured child ref; intentional restart doesn't count toward max_crashes. - worker: DB-liveness probe runs under supervision (db_dead self-exit), stall detection stays supervised-off. - doctor: standalone per-queue wedged_queue check + state->status fix in the remote queue_health check. - jobs/queue: queue-scoped getStats wedge fields + jobs stats WEDGED line.
…1801 # Conflicts: # src/core/doctor-categories.ts
…+ jobs-stats threshold (review) Pre-landing adversarial review findings: - wedge_restart_loop warn now fires once per exhausted window via a re-arming flag, not every health tick (was flooding the audit log for the full window). - Correct the stale GBRAIN_SUPERVISED comment: the DB probe runs under supervision now; only stall detection is skipped. - jobs stats WEDGED line reads GBRAIN_WEDGED_QUEUE_WARN_MINUTES so it agrees with the doctor wedged_queue threshold.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…42.22.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1801 # Conflicts: # CHANGELOG.md # VERSION # package.json
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820) v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810) v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809) v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807) v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802) v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806) v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804) v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797) v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798) v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759) v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v0.42.22.0 — supervisor progress watchdog + worker DB self-defense (closes #1801)
The bug (production incident)
Behind a Supabase transaction pooler, a child
jobs workprocess's DB pool died (write CONNECTION_ENDEDon every background tick) and never recovered — but the process stayed alive the whole time. Jobs piled to 57 waiting / 0 active for ~15h; every longautopilot-cycledead-lettered withmax stalled count exceeded. No crash, no alert, no self-heal. The supervisor loggedno_recent_completionsonce a minute for 914 minutes and took zero action. Manual fix: kill the supervisor tree so the respawn rebuilds a fresh pool.Distinct from #1720/#1745 (crash-loop) and #1678 (RSS watchdog SIGKILL). Here the worker never exits, so no respawn path is ever entered. Every process-liveness watchdog is defeated — the process passes
ps/kill -0. The gap: everything watched liveness, nothing watched forward progress.The fix (two independent layers + surfacing)
Layer 1 — worker DB self-defense under supervision (
worker.ts): the worker'sSELECT 1liveness probe was disabled whenever a supervisor was watching. It now runs under supervision too, so a worker whose own pool is dead self-exitsdb_dead(~3 min) and the supervisor respawns it with a fresh pool. Stall detection stays supervised-off (the supervisor owns that now).Layer 2 — supervisor progress watchdog (
supervisor.ts+child-worker-supervisor.ts): when a queue has claimable work, zero live-lock active jobs, and stale completions across K health checks while the child is alive, the supervisor restarts it. Catches every wedge cause, not just dead pools. Name+queue-scoped (no false-positive on unhandled job types),active_healthy-gated (a worker that died mid-job leaving an expired-lock row does NOT mask the wedge), startup-grace + loop-budget bounded, restart accounted as an intentional self-heal so it never tripsmax_crashes.Surfacing (
doctor.ts+queue.ts+jobs.ts):gbrain jobs statsprints aWEDGED QUEUEline;gbrain doctorreports a per-queuewedged_queuehealth error. Also fixed a latent bug: the remotequeue_healthdoctor check queried a non-existentstatecolumn (errored every run, silently returned "No queue activity") — corrected tostatus.Pre-existing bug fixed in passing:
killChildgated on.killed("signal sent") instead of liveness, so the follow-upSIGKILLafter an ignoredSIGTERMwas a silent no-op — in the new escalation AND the existingshutdown()drain. Now gates onexitCode/signalCode === null.What's in this PR (4 commits)
fix(minions): supervisor progress watchdog + worker DB self-defense (#1801)— the feature.Merge origin/master— reconciles with v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685) #1802 (v0.42.16.0 doctor self-heal:worker_oom_loop,pool_reap_health,extract-atoms-drain). Both check sets coexist; they cover different failure modes.fix(minions): wedge_restart_loop one-shot + …— adversarial-review findings.chore: bump version and changelog (v0.42.22.0).Tests
New:
supervisor-wedge(decision logic via seam + PGLite SQL semantics for every reviewed edge case),worker-supervised-db-probe,doctor-wedged-queue,queue-getstats-wedge, pluschild-worker-supervisoradditions (behavioral restart coverage + structural regression guards for the killChild fix). Full suite +bun run verify(30 checks) green.Review
Plan went through
/plan-eng-review+ a/codexoutside-voice pass (16 findings, all folded in — including the killChild.killedno-op, the expired-lock-suppresses-wedge bug, and the SIGTERM-counts-as-crash bug). A pre-landing adversarial pass on the diff caught thewedge_restart_loopper-tick spam (fixed in commit 3).🤖 Generated with Claude Code
Documentation
docs/guides/queue-operations-runbook.md: new "The worker is alive but wedged (dead pool)" section — documents the Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt #1801 case, the automatic self-heal (worker self-exit on dead pool + supervisor progress-watchdog restart), the--wedge-restart-*tuning flags, and the new signals (gbrain jobs statsWEDGED line,gbrain doctorwedged_queuecheck) + manual fix.docs/architecture/KEY_FILES.md: current-state updates to theworker.ts(DB probe runs under supervision),supervisor.ts(progress watchdog +queryWedgeSignals), andchild-worker-supervisor.ts(killChildliveness fix,restartCurrentChild,wedge_restartcause) entries.CHANGELOG.md: v0.42.22.0 entry (ELI10 lead + how-to + before/after table + review findings).docs/architecture/KEY_FILES.md(anti-bloat guard bans version markers in CLAUDE.md).llms.txt/llms-full.txtregenerated, no diff.Coverage: all shipped surface (supervisor flags,
wedged_queuecheck,jobs statsWEDGED line, env knobs) has reference + how-to coverage. No diagram drift.