Skip to content

v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685)#1802

Merged
garrytan merged 10 commits into
masterfrom
garrytan/doctor-self-heal
Jun 3, 2026
Merged

v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685)#1802
garrytan merged 10 commits into
masterfrom
garrytan/doctor-self-heal

Conversation

@garrytan

@garrytan garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Closes #1685. Layers the posture work on the mechanical foundation from #1678/#1735 (already in master at v0.42.5.0).

What this delivers

#1685 is the design invariant: an operator (human or agent) should never grep a worker log or read source to learn the brain is unhealthy or why. #1735 fixed the three concrete #1678 bugs (self-identifying RSS watchdog exit, cgroup-aware cap, pooler-reap self-heal, silent-backlog doctor warn). This PR closes the four remaining #1685-distinctive gaps.

A — worker_oom_loop doctor check (one line names the cause)

A worker OOM-looping now shows as a single gbrain doctor line: Worker OOM-looping: cap=8192MB, N watchdog kills/24h → raise --max-rss. It unions both worker modes — supervised workers (supervisor rss_watchdog crash bucket, read across current+prev ISO week so a Monday window can't lose Sunday) AND bare gbrain jobs work workers (the minion_jobs aborted: watchdog count, the same source queue_health reads). Cap comes from the breaker alert when present, else falls back to the auto-sized default. No grep, no source-reading.

C — cause-ranked doctor (single source of truth)

gbrain doctor prints a Top issues (ranked by cause) header — root causes above symptoms — and the JSON envelope gains top_issues so agents act on the root, not the cascade. Ranking is tier-only with grounded downstream_of: a symptom is tagged downstream of a root cause ONLY from a small map of known same-source edges, never a co-occurrence guess.

D — per-source auto-drain (self-heal the silent backlog)

When a pack doesn't declare extract_atoms but eligible pages pile up, autopilot now submits a bounded, PROTECTED extract-atoms-drain job per source — gated on autopilot.auto_drain.enabled (default on), a per-source backlog threshold, and a daily spend cap. The idempotency key is UTC-day time-sloted so a source reopens each day instead of being blocked forever.

B — pool_reap_health (reaped, and not recovering)

A new pool-recovery audit + pool_reap_health check surface DB pool reaps and, critically, whether reconnect is failing (the actionable signal) vs thrashing (warn at ≥10 reaps/hr).

Tests

New: doctor-worker-oom-loop, doctor-pool-reap-health, doctor-cause-rank, audit/pool-recovery-audit, extract-atoms-drain-handler, autopilot-auto-drain-wiring. Extended: extract-atoms-drain (shared-helper lock contract), connection-resilience (reconnect signature), dream-cli-flags (drain routes through the shared helper), worker-lock-renewal (reconnect error-threading). Typecheck clean; the merge-overlap surface (doctor.ts, jobs.ts, config.ts, doctor-categories.ts, postgres-engine.ts, retry.ts — all touched by both this branch and master's v0.42.6–10 wave) is green.

Reviews

Plan went through /plan-eng-review + a codex outside-voice plan challenge (10 findings, all folded — protected drain job, time-sloted key, source enumeration, union OOM signal, cap fallback, cross-week read, reason-threaded reconnect, evidence-gated downstream, config plumbing). A second codex pass on the implementation diff caught 4 more, 3 fixed before merge: multi-source auto-drain was broken by maxWaiting coalescing across sources; pool_reap_health could false-alarm on a recovered reap + an unrelated reconnect failure; the common lock-renewal reap path wasn't labeled as a reap. Plan + decisions persisted under ~/.claude/plans/.

Deferred (filed in TODOS.md): GAP E secondary-error cause_ref log tagging, and the worker_oom_loop thin-client/remote doctor path.

🤖 Generated with Claude Code

garrytan and others added 8 commits June 2, 2026 21:28
…red drain helper (#1685 GAP B, 5A)

- pool-recovery-audit.ts: reap_detected (CONNECTION_ENDED) vs reconnect_other; recovered/failed split
- postgres-engine reconnect(ctx?) classifies the triggering error so only true pooler reaps are tagged (CODEX #8)
- retry.ts reconnect callback widened to thread the error; retry-matcher isConnectionEndedError
- runExtractAtomsDrainForSource shared helper (cycleLockIdFor + withRefreshingLock) — one drain path (5A)
- supervisor-audit readRecentSupervisorEvents (current+prev ISO week, CODEX #7)
- extract-atoms-drain PROTECTED; autopilot.auto_drain.* config keys
…d top_issues (#1685 GAP A/B/C)

- computeWorkerOomLoopCheck: unions supervisor rss_watchdog + minion_jobs watchdog-abort (CODEX #5), cap fallback to resolveDefaultMaxRssMb (CODEX #6)
- computePoolReapHealthCheck: reaps-not-recovering fail, thrash warn
- doctor-cause-rank rankIssues: tier ordering + grounded downstream_of (CODEX #9) + drift guard (4A)
- supervisor causeStr + queue_health cross-reference worker_oom_loop (DRY 1C)
- register both checks in doctor-categories ops
…m --drain refactor (#1685 GAP D)

- autopilot per-source gate: enabled + !packDeclares + backlog>threshold + daily cap; time-sloted idempotency key (CODEX #2)
- extract-atoms-drain Minion handler (thin wrapper, LockUnavailableError -> deferred)
- dream --drain routes through the shared helper (5A)
#1685 brain-health-as-solved-problem: cause-ranked doctor, worker_oom_loop
line, per-source auto-drain, pool-reap health. Layers on #1678/#1735.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…f-heal

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
…-reap signal, lock-renewal reap labeling

- autopilot: drop maxWaiting (coalesces by name+queue not source → only one source drained + cap over-count); pre-check idempotency key so only genuinely-new sources submit+count
- pool_reap_health: fail on reconnect FAILURES (the real signal), not reaps>0&&failures>0 (false causality when a recovered reap + unrelated failure co-occur)
- lock-renewal-tick threads its triggering error to reconnect() so a CONNECTION_ENDED pooler reap is labeled reap_detected not reconnect_other (pool_reap_health now fires for the #1678 incident path)
Slot collision avoidance per queue.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.42.12.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685) Jun 3, 2026
garrytan and others added 2 commits June 3, 2026 07:10
…f-heal

# Conflicts:
#	CHANGELOG.md
#	CLAUDE.md
#	TODOS.md
#	VERSION
#	llms-full.txt
#	package.json
#	src/core/config.ts
#	src/core/doctor-categories.ts
…x check:doc-history)

The master merge wrongly kept the pre-restructure 577KB CLAUDE.md; the
check:doc-history guard caps it at 60KB. Take master's slim CLAUDE.md and
record the #1685 files (doctor-cause-rank, pool-recovery-audit, worker_oom_loop
+ pool_reap_health checks, auto-drain, 5A helper) as current-state prose in
docs/architecture/KEY_FILES.md (no release markers). llms regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 3fe4493 into master Jun 3, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820)
  v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824)
  v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805)
  v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810)
  v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809)
  v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807)
  v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808)
  v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802)
  v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806)
  v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804)
  v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797)
  v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798)
  v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759)
  v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
This was referenced Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise

1 participant