v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685) by garrytan · Pull Request #1802 · garrytan/gbrain

garrytan · 2026-06-03T05:31:19Z

Closes #1685. Layers the posture work on the mechanical foundation from #1678/#1735 (already in master at v0.42.5.0).

What this delivers

#1685 is the design invariant: an operator (human or agent) should never grep a worker log or read source to learn the brain is unhealthy or why. #1735 fixed the three concrete #1678 bugs (self-identifying RSS watchdog exit, cgroup-aware cap, pooler-reap self-heal, silent-backlog doctor warn). This PR closes the four remaining #1685-distinctive gaps.

A — `worker_oom_loop` doctor check (one line names the cause)

A worker OOM-looping now shows as a single gbrain doctor line: Worker OOM-looping: cap=8192MB, N watchdog kills/24h → raise --max-rss. It unions both worker modes — supervised workers (supervisor rss_watchdog crash bucket, read across current+prev ISO week so a Monday window can't lose Sunday) AND bare gbrain jobs work workers (the minion_jobs aborted: watchdog count, the same source queue_health reads). Cap comes from the breaker alert when present, else falls back to the auto-sized default. No grep, no source-reading.

C — cause-ranked doctor (single source of truth)

gbrain doctor prints a Top issues (ranked by cause) header — root causes above symptoms — and the JSON envelope gains top_issues so agents act on the root, not the cascade. Ranking is tier-only with grounded downstream_of: a symptom is tagged downstream of a root cause ONLY from a small map of known same-source edges, never a co-occurrence guess.

D — per-source auto-drain (self-heal the silent backlog)

When a pack doesn't declare extract_atoms but eligible pages pile up, autopilot now submits a bounded, PROTECTED extract-atoms-drain job per source — gated on autopilot.auto_drain.enabled (default on), a per-source backlog threshold, and a daily spend cap. The idempotency key is UTC-day time-sloted so a source reopens each day instead of being blocked forever.

B — `pool_reap_health` (reaped, and not recovering)

A new pool-recovery audit + pool_reap_health check surface DB pool reaps and, critically, whether reconnect is failing (the actionable signal) vs thrashing (warn at ≥10 reaps/hr).

Tests

New: doctor-worker-oom-loop, doctor-pool-reap-health, doctor-cause-rank, audit/pool-recovery-audit, extract-atoms-drain-handler, autopilot-auto-drain-wiring. Extended: extract-atoms-drain (shared-helper lock contract), connection-resilience (reconnect signature), dream-cli-flags (drain routes through the shared helper), worker-lock-renewal (reconnect error-threading). Typecheck clean; the merge-overlap surface (doctor.ts, jobs.ts, config.ts, doctor-categories.ts, postgres-engine.ts, retry.ts — all touched by both this branch and master's v0.42.6–10 wave) is green.

Reviews

Plan went through /plan-eng-review + a codex outside-voice plan challenge (10 findings, all folded — protected drain job, time-sloted key, source enumeration, union OOM signal, cap fallback, cross-week read, reason-threaded reconnect, evidence-gated downstream, config plumbing). A second codex pass on the implementation diff caught 4 more, 3 fixed before merge: multi-source auto-drain was broken by maxWaiting coalescing across sources; pool_reap_health could false-alarm on a recovered reap + an unrelated reconnect failure; the common lock-renewal reap path wasn't labeled as a reap. Plan + decisions persisted under ~/.claude/plans/.

Deferred (filed in TODOS.md): GAP E secondary-error cause_ref log tagging, and the worker_oom_loop thin-client/remote doctor path.

🤖 Generated with Claude Code

…red drain helper (#1685 GAP B, 5A) - pool-recovery-audit.ts: reap_detected (CONNECTION_ENDED) vs reconnect_other; recovered/failed split - postgres-engine reconnect(ctx?) classifies the triggering error so only true pooler reaps are tagged (CODEX #8) - retry.ts reconnect callback widened to thread the error; retry-matcher isConnectionEndedError - runExtractAtomsDrainForSource shared helper (cycleLockIdFor + withRefreshingLock) — one drain path (5A) - supervisor-audit readRecentSupervisorEvents (current+prev ISO week, CODEX #7) - extract-atoms-drain PROTECTED; autopilot.auto_drain.* config keys

…d top_issues (#1685 GAP A/B/C) - computeWorkerOomLoopCheck: unions supervisor rss_watchdog + minion_jobs watchdog-abort (CODEX #5), cap fallback to resolveDefaultMaxRssMb (CODEX #6) - computePoolReapHealthCheck: reaps-not-recovering fail, thrash warn - doctor-cause-rank rankIssues: tier ordering + grounded downstream_of (CODEX #9) + drift guard (4A) - supervisor causeStr + queue_health cross-reference worker_oom_loop (DRY 1C) - register both checks in doctor-categories ops

…m --drain refactor (#1685 GAP D) - autopilot per-source gate: enabled + !packDeclares + backlog>threshold + daily cap; time-sloted idempotency key (CODEX #2) - extract-atoms-drain Minion handler (thin wrapper, LockUnavailableError -> deferred) - dream --drain routes through the shared helper (5A)

#1685 brain-health-as-solved-problem: cause-ranked doctor, worker_oom_loop line, per-source auto-drain, pool-reap health. Layers on #1678/#1735. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…f-heal # Conflicts: # CHANGELOG.md # VERSION # package.json

…-reap signal, lock-renewal reap labeling - autopilot: drop maxWaiting (coalesces by name+queue not source → only one source drained + cap over-count); pre-check idempotency key so only genuinely-new sources submit+count - pool_reap_health: fail on reconnect FAILURES (the real signal), not reaps>0&&failures>0 (false causality when a recovered reap + unrelated failure co-occur) - lock-renewal-tick threads its triggering error to reconnect() so a CONNECTION_ENDED pooler reap is labeled reap_detected not reconnect_other (pool_reap_health now fires for the #1678 incident path)

Slot collision avoidance per queue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…f-heal # Conflicts: # CHANGELOG.md # CLAUDE.md # TODOS.md # VERSION # llms-full.txt # package.json # src/core/config.ts # src/core/doctor-categories.ts

…x check:doc-history) The master merge wrongly kept the pre-restructure 577KB CLAUDE.md; the check:doc-history guard caps it at 60KB. Take master's slim CLAUDE.md and record the #1685 files (doctor-cause-rank, pool-recovery-audit, worker_oom_loop + pool_reap_health checks, auto-drain, 5A helper) as current-state prose in docs/architecture/KEY_FILES.md (no release markers). llms regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* upstream/master: v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820) v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810) v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809) v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807) v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802) v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806) v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804) v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797) v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798) v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759) v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)

garrytan and others added 8 commits June 2, 2026 21:28

chore: bump version and changelog (v0.42.12.0)

81df044

#1685 brain-health-as-solved-problem: cause-ranked doctor, worker_oom_loop line, per-source auto-drain, pool-reap health. Layers on #1678/#1735. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into garrytan/doctor-sel…

9dd40d2

…f-heal # Conflicts: # CHANGELOG.md # VERSION # package.json

docs(todos): file #1685 GAP E + remote-path follow-ups (v0.42.12.0)

e1729ab

chore: re-version v0.42.12.0 → v0.42.16.0 (#1685)

6dd5e94

Slot collision avoidance per queue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

garrytan and others added 2 commits June 3, 2026 07:10

Merge remote-tracking branch 'origin/master' into garrytan/doctor-sel…

e08b634

…f-heal # Conflicts: # CHANGELOG.md # CLAUDE.md # TODOS.md # VERSION # llms-full.txt # package.json # src/core/config.ts # src/core/doctor-categories.ts

garrytan merged commit 3fe4493 into master Jun 3, 2026
21 checks passed

garrytan mentioned this pull request Jun 3, 2026

v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (#1801) #1824

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685)#1802

v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685)#1802
garrytan merged 10 commits into
masterfrom
garrytan/doctor-self-heal

garrytan commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Jun 3, 2026

What this delivers

A — worker_oom_loop doctor check (one line names the cause)

C — cause-ranked doctor (single source of truth)

D — per-source auto-drain (self-heal the silent backlog)

B — pool_reap_health (reaped, and not recovering)

Tests

Reviews

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

A — `worker_oom_loop` doctor check (one line names the cause)

B — `pool_reap_health` (reaped, and not recovering)