v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678) by garrytan · Pull Request #1735 · garrytan/gbrain

garrytan · 2026-06-01T15:22:35Z

v0.42.2.0 — fixes issue #1678 (worker crash-loop + pooler-reap cascade + silent lens backlog) + a real cycle DB-disconnect bug

A production worker crash-looped ~400×/24h. Every visible log blamed the database (CONNECTION_ENDED, connect() has not been called, lock-renewal-failed) but none were the cause — the RSS watchdog was draining the worker mid-cycle on a too-low default cap, and every DB error was downstream of that. This wave fixes all three systemic problems in #1678, plus a fourth real production bug found while validating against real Postgres.

Problem 1 — RSS watchdog: opaque + footgun default

New WORKER_EXIT_RSS_WATCHDOG exit code: a watchdog drain is no longer indistinguishable from a clean queue-drain. Supervisor classifies it as likely_cause=rss_watchdog with its own sliding-window breaker (independent of the stable-run reset that hid the 400×/day loop) + a loud rss_watchdog_loop alert naming the cap.
resolveDefaultMaxRssMb() replaces the flat 2048 MB default at every spawn site (jobs work, jobs supervisor, autopilot, MinionSupervisor): clamp(0.5 × min(cgroupLimit, totalRAM), 4096, 16384). cgroup-aware so a 4GB container doesn't get a 16GB cap and SIGKILL anyway.
80% soft-warn (peak RSS + in-flight job kind) before the kill.

Problem 2 — pooler-reap cascade had no recovery

CONNECTION_ENDED now classified retryable (code + message).
PostgresEngine.sql getter throws a tailored retryable error on a reaped instance pool instead of the misleading module-singleton fallthrough — and transaction() / withReservedConnection() now route through that same getter (pre-landing review caught they bypassed it).
promoteDelayed reconnect-retries; claim recovers on the next poll tick (no double-claim); the lock-renewal tick gets a bounded reconnect-once dep (no background retry racing the renewal deadline).

Problem 3 — extract_atoms backlog ran silently forever

New extract_atoms_backlog doctor check (page-backlog-only label) warns with the exact --drain command when the active pack doesn't run the phase; pack-gated skips now carry a greppable pack_gated marker.
New gbrain dream --phase extract_atoms --drain [--window N]: single-hold bounded drain that takes the same cycle lock id, rediscovers eligibility each batch, reports {extracted, skipped, remaining}, and exits non-zero while work remains (a failed count also counts as incomplete).

Bonus — the dream cycle could kill its own DB connection (found via E2E)

The lint phase's resolveLintContentSanity created + disconnected a module-style engine to read 4 config values, which cascaded to db.disconnect() and nulled the shared singleton mid-cycle — every later phase then threw connect() has not been called. Triggers whenever loadConfig() reports a connection string (i.e. every production Postgres gbrain dream). Lint now reuses the caller's live engine; standalone gbrain lint keeps the create-own path.

Review + verification

Planned + /plan-eng-review (CLEAN) + Codex outside-voice (12 findings, all absorbed) before implementation; a second Codex adversarial pass on the final diff caught 4 more (2 fixed here: getter-bypass + drain-null-count; 2 documented follow-ups: claim idempotent-recovery, PGLite drain lock-parity — both already mitigated, see TODOS).
bun run verify 29/29 · tsc clean · 12,333 unit pass (1 pre-existing isolation flake fixed) · real-Postgres E2E green incl. cycle/dream (was 3 pass/6 fail → 9/0) and db-singleton-shared-recovery.
~70 new test cases across watchdog, rss-default, retry-matcher, lock-renewal, queue, getter-selfheal, extract-atoms-backlog, drain, and lint-shared-engine.

Merged master's v0.42.1.0 (skillopt) cleanly; re-versioned 0.41.39.0 → 0.42.2.0. Plan: ~/.claude/plans/sorry-it-s-an-issue-tranquil-russell.md.

🤖 Generated with Claude Code

Documentation

CHANGELOG.md — v0.42.2.0 entry (written at ship time, ELI10-lead-first): the watchdog crash-loop story + all three Worker crash-loop (400+/24h) from too-low RSS watchdog default, mis-reported as connection/lock failure + silently-starved lens-phase backlog #1678 fixes + the cycle lint DB-disconnect bonus + the "To take advantage of v0.42.2.0" upgrade block.
CLAUDE.md — Key Files entries added for worker-exit-codes.ts, rss-default.ts, extract-atoms-drain.ts; v0.42.2.0 annotations on worker.ts (watchdog exit + claim-catch), child-worker-supervisor.ts (rss_watchdog breaker), lock-renewal-tick.ts (reconnect dep), and dream.ts (--drain). llms-full.txt regenerated to match (test/build-llms.test.ts gate).
TODOS.md — section re-versioned to v0.42.2.0; documented two Codex pre-landing follow-ups (claim idempotent-recovery, PGLite drain lock-path parity).

Coverage

All new public surface is documented: gbrain dream --phase extract_atoms --drain [--window N] (CHANGELOG + CLAUDE.md), extract_atoms_backlog doctor check (CHANGELOG + CLAUDE.md), auto-sized --max-rss default (CHANGELOG). No architecture diagrams drifted. No documentation debt.

…pooler-reap self-heal (#1678) Problem 1: distinct WORKER_EXIT_RSS_WATCHDOG exit code + cause-keyed supervisor breaker (bypasses the stable-run reset that hid the 400x/24h loop) + rss_watchdog audit bucket + 80% soft-warn; cgroup-aware resolveDefaultMaxRssMb replaces the flat 2048 default at every spawn site. Problem 2: CONNECTION_ENDED classified retryable; postgres-engine sql getter throws a retryable error on a reaped instance pool instead of the misleading module-singleton fallthrough; promoteDelayed reconnect-retry; claim recovers on the next poll tick (no double-claim); lock-renewal tick reconnect-once dep.

… fix lint clobbering the shared DB connection (#1678) Problem 3: extract_atoms_backlog doctor check + pack_gated skip marker + shared countExtractAtomsBacklog; `gbrain dream --phase extract_atoms --drain [--window N]` single-hold bounded drain (same cycleLockIdFor, rediscover each batch, reports remaining, exits non-zero while work remains). Also fixes a real production bug found via E2E: the cycle lint phase's resolveLintContentSanity created + disconnected a module-style engine that nulled the shared db singleton mid-cycle, breaking every later phase with "connect() has not been called". Lint now reuses the caller's live engine (cycle + Minion handlers thread it; standalone CLI keeps the create-own path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…to v0.42.2.0 Resolved VERSION/package.json/CHANGELOG to 0.42.2.0 (fix wave on top of master's 0.42.1.0). jobs.ts + cycle.ts auto-merged cleanly. Renumbered #1678 wave version refs 0.41.39.0 -> 0.42.2.0.

…tion through the sql getter + drain treats failed count as incomplete Codex adversarial review findings: - #2: transaction(), withReservedConnection(), and one other site bypassed the v0.42.2.0 sql-getter self-heal via `this._sql || db.getConnection()`, so a reaped instance pool fell through to the module singleton there. Route all three through `this.sql` so they throw the retryable instance-pool error and recover consistently (MinionQueue.transaction hits this). - #4: `gbrain dream --drain` treated a null backlog count (query failure) as success via `remaining ?? 0`; now null exits EXIT_DRAIN_INCOMPLETE so automation never believes an unverified backlog drained. - #1 (claim orphan) + #3 (PGLite drain lock) documented as follow-ups in TODOS.

Adds Key Files entries for worker-exit-codes.ts, rss-default.ts, and extract-atoms-drain.ts, plus v0.42.2.0 annotations on worker.ts, child-worker-supervisor.ts, lock-renewal-tick.ts, and dream.ts. Regenerated llms-full.txt to match (test/build-llms.test.ts gate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…HANGELOG/docs/comments Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # test/audit/batch-retry-audit.test.ts

# Conflicts: # CHANGELOG.md # VERSION # package.json

…ked doctor + OOM-loop line + auto-drain + pool-reap (#1685) (#1802) * feat(minions): pool-recovery audit + reconnect reason-threading + shared drain helper (#1685 GAP B, 5A) - pool-recovery-audit.ts: reap_detected (CONNECTION_ENDED) vs reconnect_other; recovered/failed split - postgres-engine reconnect(ctx?) classifies the triggering error so only true pooler reaps are tagged (CODEX #8) - retry.ts reconnect callback widened to thread the error; retry-matcher isConnectionEndedError - runExtractAtomsDrainForSource shared helper (cycleLockIdFor + withRefreshingLock) — one drain path (5A) - supervisor-audit readRecentSupervisorEvents (current+prev ISO week, CODEX #7) - extract-atoms-drain PROTECTED; autopilot.auto_drain.* config keys * feat(doctor): worker_oom_loop + pool_reap_health checks + cause-ranked top_issues (#1685 GAP A/B/C) - computeWorkerOomLoopCheck: unions supervisor rss_watchdog + minion_jobs watchdog-abort (CODEX #5), cap fallback to resolveDefaultMaxRssMb (CODEX #6) - computePoolReapHealthCheck: reaps-not-recovering fail, thrash warn - doctor-cause-rank rankIssues: tier ordering + grounded downstream_of (CODEX #9) + drift guard (4A) - supervisor causeStr + queue_health cross-reference worker_oom_loop (DRY 1C) - register both checks in doctor-categories ops * feat(autopilot): per-source extract_atoms auto-drain + handler + dream --drain refactor (#1685 GAP D) - autopilot per-source gate: enabled + !packDeclares + backlog>threshold + daily cap; time-sloted idempotency key (CODEX #2) - extract-atoms-drain Minion handler (thin wrapper, LockUnavailableError -> deferred) - dream --drain routes through the shared helper (5A) * chore: bump version and changelog (v0.42.12.0) #1685 brain-health-as-solved-problem: cause-ranked doctor, worker_oom_loop line, per-source auto-drain, pool-reap health. Layers on #1678/#1735. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(todos): file #1685 GAP E + remote-path follow-ups (v0.42.12.0) * fix(#1685): pre-landing review — multi-source auto-drain, honest pool-reap signal, lock-renewal reap labeling - autopilot: drop maxWaiting (coalesces by name+queue not source → only one source drained + cap over-count); pre-check idempotency key so only genuinely-new sources submit+count - pool_reap_health: fail on reconnect FAILURES (the real signal), not reaps>0&&failures>0 (false causality when a recovered reap + unrelated failure co-occur) - lock-renewal-tick threads its triggering error to reconnect() so a CONNECTION_ENDED pooler reap is labeled reap_detected not reconnect_other (pool_reap_health now fires for the #1678 incident path) * chore: re-version v0.42.12.0 → v0.42.16.0 (#1685) Slot collision avoidance per queue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: restore slim CLAUDE.md + move #1685 entries to KEY_FILES.md (fix check:doc-history) The master merge wrongly kept the pre-restructure 577KB CLAUDE.md; the check:doc-history guard caps it at 60KB. Take master's slim CLAUDE.md and record the #1685 files (doctor-cause-rank, pool-recovery-audit, worker_oom_loop + pool_reap_health checks, auto-drain, 5A helper) as current-state prose in docs/architecture/KEY_FILES.md (no release markers). llms regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* upstream/master: v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (garrytan#1699) (garrytan#1756) v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (garrytan#1696) (garrytan#1755) v0.42.6.0 feat(enrich): gbrain enrich --thin — brain-internal grounded synthesis for stub pages (garrytan#1700) (garrytan#1757) v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (garrytan#1678) (garrytan#1735) v0.42.4.0 fix: think --model fails loud — slash-form ids + never persist empty synthesis (garrytan#1698) (garrytan#1736) v0.42.3.0 feat(search): autocut — score-discontinuity result-sizing (garrytan#1663 wave 1) (garrytan#1682) v0.42.2.0 feat: gbrain connect — one-command Claude Code onboarding from a bearer token (garrytan#1683)

garrytan and others added 7 commits June 1, 2026 08:11

chore: bump version and changelog (v0.41.39.0)

43dd109

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge origin/master (v0.42.1.0 skillopt) into #1678 wave; re-version …

ee4f6ad

…to v0.42.2.0 Resolved VERSION/package.json/CHANGELOG to 0.42.2.0 (fix wave on top of master's 0.42.1.0). jobs.ts + cycle.ts auto-merged cleanly. Renumbered #1678 wave version refs 0.41.39.0 -> 0.42.2.0.

chore: re-version v0.42.2.0 → v0.42.5.0 across VERSION/package.json/C…

8a33353

…HANGELOG/docs/comments Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

garrytan changed the title ~~v0.42.2.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678)~~ v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678) Jun 1, 2026

garrytan added 2 commits June 1, 2026 19:11

Merge remote-tracking branch 'origin/master' into garrytan/merge-pr-1678

3499d71

# Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # test/audit/batch-retry-audit.test.ts

Merge remote-tracking branch 'origin/master' into garrytan/merge-pr-1678

018babb

# Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan merged commit 766604d into master Jun 2, 2026
21 checks passed

This was referenced Jun 2, 2026

dream synthesize writes synthesized pages to 'default' source, ignoring the resolved brain source #1586

Open

resolveLintContentSanity disconnects shared module-level db singleton, killing the cycle's main engine connection #1471

Open

garrytan mentioned this pull request Jun 3, 2026

v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (#1685) #1802

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678)#1735

v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678)#1735
garrytan merged 9 commits into
masterfrom
garrytan/merge-pr-1678

garrytan commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v0.42.2.0 — fixes issue #1678 (worker crash-loop + pooler-reap cascade + silent lens backlog) + a real cycle DB-disconnect bug

Problem 1 — RSS watchdog: opaque + footgun default

Problem 2 — pooler-reap cascade had no recovery

Problem 3 — extract_atoms backlog ran silently forever

Bonus — the dream cycle could kill its own DB connection (found via E2E)

Review + verification

Documentation

Coverage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Jun 1, 2026 •

edited

Loading