Skip to content

v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (#1801)#1824

Merged
garrytan merged 6 commits into
masterfrom
garrytan/fix-issue-1801
Jun 3, 2026
Merged

v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (#1801)#1824
garrytan merged 6 commits into
masterfrom
garrytan/fix-issue-1801

Conversation

@garrytan

@garrytan garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

v0.42.22.0 — supervisor progress watchdog + worker DB self-defense (closes #1801)

The bug (production incident)

Behind a Supabase transaction pooler, a child jobs work process's DB pool died (write CONNECTION_ENDED on every background tick) and never recovered — but the process stayed alive the whole time. Jobs piled to 57 waiting / 0 active for ~15h; every long autopilot-cycle dead-lettered with max stalled count exceeded. No crash, no alert, no self-heal. The supervisor logged no_recent_completions once a minute for 914 minutes and took zero action. Manual fix: kill the supervisor tree so the respawn rebuilds a fresh pool.

Distinct from #1720/#1745 (crash-loop) and #1678 (RSS watchdog SIGKILL). Here the worker never exits, so no respawn path is ever entered. Every process-liveness watchdog is defeated — the process passes ps/kill -0. The gap: everything watched liveness, nothing watched forward progress.

The fix (two independent layers + surfacing)

Layer 1 — worker DB self-defense under supervision (worker.ts): the worker's SELECT 1 liveness probe was disabled whenever a supervisor was watching. It now runs under supervision too, so a worker whose own pool is dead self-exits db_dead (~3 min) and the supervisor respawns it with a fresh pool. Stall detection stays supervised-off (the supervisor owns that now).

Layer 2 — supervisor progress watchdog (supervisor.ts + child-worker-supervisor.ts): when a queue has claimable work, zero live-lock active jobs, and stale completions across K health checks while the child is alive, the supervisor restarts it. Catches every wedge cause, not just dead pools. Name+queue-scoped (no false-positive on unhandled job types), active_healthy-gated (a worker that died mid-job leaving an expired-lock row does NOT mask the wedge), startup-grace + loop-budget bounded, restart accounted as an intentional self-heal so it never trips max_crashes.

Surfacing (doctor.ts + queue.ts + jobs.ts): gbrain jobs stats prints a WEDGED QUEUE line; gbrain doctor reports a per-queue wedged_queue health error. Also fixed a latent bug: the remote queue_health doctor check queried a non-existent state column (errored every run, silently returned "No queue activity") — corrected to status.

Pre-existing bug fixed in passing: killChild gated on .killed ("signal sent") instead of liveness, so the follow-up SIGKILL after an ignored SIGTERM was a silent no-op — in the new escalation AND the existing shutdown() drain. Now gates on exitCode/signalCode === null.

What's in this PR (4 commits)

  1. fix(minions): supervisor progress watchdog + worker DB self-defense (#1801) — the feature.
  2. Merge origin/master — reconciles with v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685) #1802 (v0.42.16.0 doctor self-heal: worker_oom_loop, pool_reap_health, extract-atoms-drain). Both check sets coexist; they cover different failure modes.
  3. fix(minions): wedge_restart_loop one-shot + … — adversarial-review findings.
  4. chore: bump version and changelog (v0.42.22.0).

Tests

New: supervisor-wedge (decision logic via seam + PGLite SQL semantics for every reviewed edge case), worker-supervised-db-probe, doctor-wedged-queue, queue-getstats-wedge, plus child-worker-supervisor additions (behavioral restart coverage + structural regression guards for the killChild fix). Full suite + bun run verify (30 checks) green.

Review

Plan went through /plan-eng-review + a /codex outside-voice pass (16 findings, all folded in — including the killChild .killed no-op, the expired-lock-suppresses-wedge bug, and the SIGTERM-counts-as-crash bug). A pre-landing adversarial pass on the diff caught the wedge_restart_loop per-tick spam (fixed in commit 3).

🤖 Generated with Claude Code

Documentation

  • docs/guides/queue-operations-runbook.md: new "The worker is alive but wedged (dead pool)" section — documents the Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt #1801 case, the automatic self-heal (worker self-exit on dead pool + supervisor progress-watchdog restart), the --wedge-restart-* tuning flags, and the new signals (gbrain jobs stats WEDGED line, gbrain doctor wedged_queue check) + manual fix.
  • docs/architecture/KEY_FILES.md: current-state updates to the worker.ts (DB probe runs under supervision), supervisor.ts (progress watchdog + queryWedgeSignals), and child-worker-supervisor.ts (killChild liveness fix, restartCurrentChild, wedge_restart cause) entries.
  • CHANGELOG.md: v0.42.22.0 entry (ELI10 lead + how-to + before/after table + review findings).
  • CLAUDE.md: not touched — the repo keeps key-files detail in docs/architecture/KEY_FILES.md (anti-bloat guard bans version markers in CLAUDE.md). llms.txt/llms-full.txt regenerated, no diff.

Coverage: all shipped surface (supervisor flags, wedged_queue check, jobs stats WEDGED line, env knobs) has reference + how-to coverage. No diagram drift.

garrytan and others added 6 commits June 3, 2026 08:01
…nder supervision (#1801)

Alive-but-wedged worker (dead DB pool, process still up) now self-heals in
minutes instead of a silent 15h halt.

- supervisor: progress watchdog restarts a child that makes no forward progress
  on claimable work (name+queue-scoped, active_healthy/due-delayed aware,
  startup-grace + loop-budget bounded); runtime handler-name derivation.
- child-worker-supervisor: killChild gates on liveness not .killed (also fixes
  the existing shutdown SIGKILL no-op); restartCurrentChild kills the captured
  child ref; intentional restart doesn't count toward max_crashes.
- worker: DB-liveness probe runs under supervision (db_dead self-exit), stall
  detection stays supervised-off.
- doctor: standalone per-queue wedged_queue check + state->status fix in the
  remote queue_health check.
- jobs/queue: queue-scoped getStats wedge fields + jobs stats WEDGED line.
…1801

# Conflicts:
#	src/core/doctor-categories.ts
…+ jobs-stats threshold (review)

Pre-landing adversarial review findings:
- wedge_restart_loop warn now fires once per exhausted window via a re-arming
  flag, not every health tick (was flooding the audit log for the full window).
- Correct the stale GBRAIN_SUPERVISED comment: the DB probe runs under
  supervision now; only stall detection is skipped.
- jobs stats WEDGED line reads GBRAIN_WEDGED_QUEUE_WARN_MINUTES so it agrees
  with the doctor wedged_queue threshold.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…42.22.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1801

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@garrytan garrytan merged commit f495934 into master Jun 3, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820)
  v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824)
  v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805)
  v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810)
  v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809)
  v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807)
  v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808)
  v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802)
  v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806)
  v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804)
  v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797)
  v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798)
  v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759)
  v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt

1 participant