v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685)#1802
Merged
Conversation
…red drain helper (#1685 GAP B, 5A) - pool-recovery-audit.ts: reap_detected (CONNECTION_ENDED) vs reconnect_other; recovered/failed split - postgres-engine reconnect(ctx?) classifies the triggering error so only true pooler reaps are tagged (CODEX #8) - retry.ts reconnect callback widened to thread the error; retry-matcher isConnectionEndedError - runExtractAtomsDrainForSource shared helper (cycleLockIdFor + withRefreshingLock) — one drain path (5A) - supervisor-audit readRecentSupervisorEvents (current+prev ISO week, CODEX #7) - extract-atoms-drain PROTECTED; autopilot.auto_drain.* config keys
…d top_issues (#1685 GAP A/B/C) - computeWorkerOomLoopCheck: unions supervisor rss_watchdog + minion_jobs watchdog-abort (CODEX #5), cap fallback to resolveDefaultMaxRssMb (CODEX #6) - computePoolReapHealthCheck: reaps-not-recovering fail, thrash warn - doctor-cause-rank rankIssues: tier ordering + grounded downstream_of (CODEX #9) + drift guard (4A) - supervisor causeStr + queue_health cross-reference worker_oom_loop (DRY 1C) - register both checks in doctor-categories ops
…f-heal # Conflicts: # CHANGELOG.md # VERSION # package.json
…-reap signal, lock-renewal reap labeling - autopilot: drop maxWaiting (coalesces by name+queue not source → only one source drained + cap over-count); pre-check idempotency key so only genuinely-new sources submit+count - pool_reap_health: fail on reconnect FAILURES (the real signal), not reaps>0&&failures>0 (false causality when a recovered reap + unrelated failure co-occur) - lock-renewal-tick threads its triggering error to reconnect() so a CONNECTION_ENDED pooler reap is labeled reap_detected not reconnect_other (pool_reap_health now fires for the #1678 incident path)
Slot collision avoidance per queue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…f-heal # Conflicts: # CHANGELOG.md # CLAUDE.md # TODOS.md # VERSION # llms-full.txt # package.json # src/core/config.ts # src/core/doctor-categories.ts
…x check:doc-history) The master merge wrongly kept the pre-restructure 577KB CLAUDE.md; the check:doc-history guard caps it at 60KB. Take master's slim CLAUDE.md and record the #1685 files (doctor-cause-rank, pool-recovery-audit, worker_oom_loop + pool_reap_health checks, auto-drain, 5A helper) as current-state prose in docs/architecture/KEY_FILES.md (no release markers). llms regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820) v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810) v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809) v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807) v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802) v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806) v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804) v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797) v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798) v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759) v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
This was referenced Jun 8, 2026
Closed
This was referenced Jun 8, 2026
fix(code-graph): TS/TSX call edges now resolve (const-arrow naming + source_id on edge writes)
#1723
Closed
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1685. Layers the posture work on the mechanical foundation from #1678/#1735 (already in master at v0.42.5.0).
What this delivers
#1685 is the design invariant: an operator (human or agent) should never grep a worker log or read source to learn the brain is unhealthy or why. #1735 fixed the three concrete #1678 bugs (self-identifying RSS watchdog exit, cgroup-aware cap, pooler-reap self-heal, silent-backlog doctor warn). This PR closes the four remaining #1685-distinctive gaps.
A —
worker_oom_loopdoctor check (one line names the cause)A worker OOM-looping now shows as a single
gbrain doctorline:Worker OOM-looping: cap=8192MB, N watchdog kills/24h → raise --max-rss. It unions both worker modes — supervised workers (supervisorrss_watchdogcrash bucket, read across current+prev ISO week so a Monday window can't lose Sunday) AND baregbrain jobs workworkers (theminion_jobsaborted: watchdogcount, the same sourcequeue_healthreads). Cap comes from the breaker alert when present, else falls back to the auto-sized default. No grep, no source-reading.C — cause-ranked doctor (single source of truth)
gbrain doctorprints aTop issues (ranked by cause)header — root causes above symptoms — and the JSON envelope gainstop_issuesso agents act on the root, not the cascade. Ranking is tier-only with groundeddownstream_of: a symptom is tagged downstream of a root cause ONLY from a small map of known same-source edges, never a co-occurrence guess.D — per-source auto-drain (self-heal the silent backlog)
When a pack doesn't declare
extract_atomsbut eligible pages pile up, autopilot now submits a bounded, PROTECTEDextract-atoms-drainjob per source — gated onautopilot.auto_drain.enabled(default on), a per-source backlog threshold, and a daily spend cap. The idempotency key is UTC-day time-sloted so a source reopens each day instead of being blocked forever.B —
pool_reap_health(reaped, and not recovering)A new pool-recovery audit +
pool_reap_healthcheck surface DB pool reaps and, critically, whether reconnect is failing (the actionable signal) vs thrashing (warn at ≥10 reaps/hr).Tests
New:
doctor-worker-oom-loop,doctor-pool-reap-health,doctor-cause-rank,audit/pool-recovery-audit,extract-atoms-drain-handler,autopilot-auto-drain-wiring. Extended:extract-atoms-drain(shared-helper lock contract),connection-resilience(reconnect signature),dream-cli-flags(drain routes through the shared helper),worker-lock-renewal(reconnect error-threading). Typecheck clean; the merge-overlap surface (doctor.ts, jobs.ts, config.ts, doctor-categories.ts, postgres-engine.ts, retry.ts — all touched by both this branch and master's v0.42.6–10 wave) is green.Reviews
Plan went through
/plan-eng-review+ a codex outside-voice plan challenge (10 findings, all folded — protected drain job, time-sloted key, source enumeration, union OOM signal, cap fallback, cross-week read, reason-threaded reconnect, evidence-gated downstream, config plumbing). A second codex pass on the implementation diff caught 4 more, 3 fixed before merge: multi-source auto-drain was broken bymaxWaitingcoalescing across sources;pool_reap_healthcould false-alarm on a recovered reap + an unrelated reconnect failure; the common lock-renewal reap path wasn't labeled as a reap. Plan + decisions persisted under~/.claude/plans/.Deferred (filed in TODOS.md): GAP E secondary-error
cause_reflog tagging, and theworker_oom_loopthin-client/remote doctor path.🤖 Generated with Claude Code