Skip to content

v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678)#1735

Merged
garrytan merged 9 commits into
masterfrom
garrytan/merge-pr-1678
Jun 2, 2026
Merged

v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678)#1735
garrytan merged 9 commits into
masterfrom
garrytan/merge-pr-1678

Conversation

@garrytan

@garrytan garrytan commented Jun 1, 2026

Copy link
Copy Markdown
Owner

v0.42.2.0 — fixes issue #1678 (worker crash-loop + pooler-reap cascade + silent lens backlog) + a real cycle DB-disconnect bug

A production worker crash-looped ~400×/24h. Every visible log blamed the database (CONNECTION_ENDED, connect() has not been called, lock-renewal-failed) but none were the cause — the RSS watchdog was draining the worker mid-cycle on a too-low default cap, and every DB error was downstream of that. This wave fixes all three systemic problems in #1678, plus a fourth real production bug found while validating against real Postgres.

Problem 1 — RSS watchdog: opaque + footgun default

  • New WORKER_EXIT_RSS_WATCHDOG exit code: a watchdog drain is no longer indistinguishable from a clean queue-drain. Supervisor classifies it as likely_cause=rss_watchdog with its own sliding-window breaker (independent of the stable-run reset that hid the 400×/day loop) + a loud rss_watchdog_loop alert naming the cap.
  • resolveDefaultMaxRssMb() replaces the flat 2048 MB default at every spawn site (jobs work, jobs supervisor, autopilot, MinionSupervisor): clamp(0.5 × min(cgroupLimit, totalRAM), 4096, 16384). cgroup-aware so a 4GB container doesn't get a 16GB cap and SIGKILL anyway.
  • 80% soft-warn (peak RSS + in-flight job kind) before the kill.

Problem 2 — pooler-reap cascade had no recovery

  • CONNECTION_ENDED now classified retryable (code + message).
  • PostgresEngine.sql getter throws a tailored retryable error on a reaped instance pool instead of the misleading module-singleton fallthrough — and transaction() / withReservedConnection() now route through that same getter (pre-landing review caught they bypassed it).
  • promoteDelayed reconnect-retries; claim recovers on the next poll tick (no double-claim); the lock-renewal tick gets a bounded reconnect-once dep (no background retry racing the renewal deadline).

Problem 3 — extract_atoms backlog ran silently forever

  • New extract_atoms_backlog doctor check (page-backlog-only label) warns with the exact --drain command when the active pack doesn't run the phase; pack-gated skips now carry a greppable pack_gated marker.
  • New gbrain dream --phase extract_atoms --drain [--window N]: single-hold bounded drain that takes the same cycle lock id, rediscovers eligibility each batch, reports {extracted, skipped, remaining}, and exits non-zero while work remains (a failed count also counts as incomplete).

Bonus — the dream cycle could kill its own DB connection (found via E2E)

The lint phase's resolveLintContentSanity created + disconnected a module-style engine to read 4 config values, which cascaded to db.disconnect() and nulled the shared singleton mid-cycle — every later phase then threw connect() has not been called. Triggers whenever loadConfig() reports a connection string (i.e. every production Postgres gbrain dream). Lint now reuses the caller's live engine; standalone gbrain lint keeps the create-own path.

Review + verification

  • Planned + /plan-eng-review (CLEAN) + Codex outside-voice (12 findings, all absorbed) before implementation; a second Codex adversarial pass on the final diff caught 4 more (2 fixed here: getter-bypass + drain-null-count; 2 documented follow-ups: claim idempotent-recovery, PGLite drain lock-parity — both already mitigated, see TODOS).
  • bun run verify 29/29 · tsc clean · 12,333 unit pass (1 pre-existing isolation flake fixed) · real-Postgres E2E green incl. cycle/dream (was 3 pass/6 fail → 9/0) and db-singleton-shared-recovery.
  • ~70 new test cases across watchdog, rss-default, retry-matcher, lock-renewal, queue, getter-selfheal, extract-atoms-backlog, drain, and lint-shared-engine.

Merged master's v0.42.1.0 (skillopt) cleanly; re-versioned 0.41.39.0 → 0.42.2.0. Plan: ~/.claude/plans/sorry-it-s-an-issue-tranquil-russell.md.

🤖 Generated with Claude Code

Documentation

  • CHANGELOG.md — v0.42.2.0 entry (written at ship time, ELI10-lead-first): the watchdog crash-loop story + all three Worker crash-loop (400+/24h) from too-low RSS watchdog default, mis-reported as connection/lock failure + silently-starved lens-phase backlog #1678 fixes + the cycle lint DB-disconnect bonus + the "To take advantage of v0.42.2.0" upgrade block.
  • CLAUDE.md — Key Files entries added for worker-exit-codes.ts, rss-default.ts, extract-atoms-drain.ts; v0.42.2.0 annotations on worker.ts (watchdog exit + claim-catch), child-worker-supervisor.ts (rss_watchdog breaker), lock-renewal-tick.ts (reconnect dep), and dream.ts (--drain). llms-full.txt regenerated to match (test/build-llms.test.ts gate).
  • TODOS.md — section re-versioned to v0.42.2.0; documented two Codex pre-landing follow-ups (claim idempotent-recovery, PGLite drain lock-path parity).

Coverage

All new public surface is documented: gbrain dream --phase extract_atoms --drain [--window N] (CHANGELOG + CLAUDE.md), extract_atoms_backlog doctor check (CHANGELOG + CLAUDE.md), auto-sized --max-rss default (CHANGELOG). No architecture diagrams drifted. No documentation debt.

garrytan and others added 7 commits June 1, 2026 08:11
…pooler-reap self-heal (#1678)

Problem 1: distinct WORKER_EXIT_RSS_WATCHDOG exit code + cause-keyed supervisor
breaker (bypasses the stable-run reset that hid the 400x/24h loop) + rss_watchdog
audit bucket + 80% soft-warn; cgroup-aware resolveDefaultMaxRssMb replaces the
flat 2048 default at every spawn site.

Problem 2: CONNECTION_ENDED classified retryable; postgres-engine sql getter
throws a retryable error on a reaped instance pool instead of the misleading
module-singleton fallthrough; promoteDelayed reconnect-retry; claim recovers on
the next poll tick (no double-claim); lock-renewal tick reconnect-once dep.
… fix lint clobbering the shared DB connection (#1678)

Problem 3: extract_atoms_backlog doctor check + pack_gated skip marker +
shared countExtractAtomsBacklog; `gbrain dream --phase extract_atoms --drain
[--window N]` single-hold bounded drain (same cycleLockIdFor, rediscover each
batch, reports remaining, exits non-zero while work remains).

Also fixes a real production bug found via E2E: the cycle lint phase's
resolveLintContentSanity created + disconnected a module-style engine that
nulled the shared db singleton mid-cycle, breaking every later phase with
"connect() has not been called". Lint now reuses the caller's live engine
(cycle + Minion handlers thread it; standalone CLI keeps the create-own path).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to v0.42.2.0

Resolved VERSION/package.json/CHANGELOG to 0.42.2.0 (fix wave on top of master's
0.42.1.0). jobs.ts + cycle.ts auto-merged cleanly. Renumbered #1678 wave version
refs 0.41.39.0 -> 0.42.2.0.
…tion through the sql getter + drain treats failed count as incomplete

Codex adversarial review findings:
- #2: transaction(), withReservedConnection(), and one other site bypassed the
  v0.42.2.0 sql-getter self-heal via `this._sql || db.getConnection()`, so a
  reaped instance pool fell through to the module singleton there. Route all
  three through `this.sql` so they throw the retryable instance-pool error and
  recover consistently (MinionQueue.transaction hits this).
- #4: `gbrain dream --drain` treated a null backlog count (query failure) as
  success via `remaining ?? 0`; now null exits EXIT_DRAIN_INCOMPLETE so
  automation never believes an unverified backlog drained.
- #1 (claim orphan) + #3 (PGLite drain lock) documented as follow-ups in TODOS.
Adds Key Files entries for worker-exit-codes.ts, rss-default.ts, and
extract-atoms-drain.ts, plus v0.42.2.0 annotations on worker.ts,
child-worker-supervisor.ts, lock-renewal-tick.ts, and dream.ts. Regenerated
llms-full.txt to match (test/build-llms.test.ts gate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…HANGELOG/docs/comments

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.42.2.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678) v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678) Jun 1, 2026
garrytan added 2 commits June 1, 2026 19:11
# Conflicts:
#	CHANGELOG.md
#	TODOS.md
#	VERSION
#	package.json
#	test/audit/batch-retry-audit.test.ts
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@garrytan garrytan merged commit 766604d into master Jun 2, 2026
21 checks passed
garrytan added a commit that referenced this pull request Jun 3, 2026
…ked doctor + OOM-loop line + auto-drain + pool-reap (#1685) (#1802)

* feat(minions): pool-recovery audit + reconnect reason-threading + shared drain helper (#1685 GAP B, 5A)

- pool-recovery-audit.ts: reap_detected (CONNECTION_ENDED) vs reconnect_other; recovered/failed split
- postgres-engine reconnect(ctx?) classifies the triggering error so only true pooler reaps are tagged (CODEX #8)
- retry.ts reconnect callback widened to thread the error; retry-matcher isConnectionEndedError
- runExtractAtomsDrainForSource shared helper (cycleLockIdFor + withRefreshingLock) — one drain path (5A)
- supervisor-audit readRecentSupervisorEvents (current+prev ISO week, CODEX #7)
- extract-atoms-drain PROTECTED; autopilot.auto_drain.* config keys

* feat(doctor): worker_oom_loop + pool_reap_health checks + cause-ranked top_issues (#1685 GAP A/B/C)

- computeWorkerOomLoopCheck: unions supervisor rss_watchdog + minion_jobs watchdog-abort (CODEX #5), cap fallback to resolveDefaultMaxRssMb (CODEX #6)
- computePoolReapHealthCheck: reaps-not-recovering fail, thrash warn
- doctor-cause-rank rankIssues: tier ordering + grounded downstream_of (CODEX #9) + drift guard (4A)
- supervisor causeStr + queue_health cross-reference worker_oom_loop (DRY 1C)
- register both checks in doctor-categories ops

* feat(autopilot): per-source extract_atoms auto-drain + handler + dream --drain refactor (#1685 GAP D)

- autopilot per-source gate: enabled + !packDeclares + backlog>threshold + daily cap; time-sloted idempotency key (CODEX #2)
- extract-atoms-drain Minion handler (thin wrapper, LockUnavailableError -> deferred)
- dream --drain routes through the shared helper (5A)

* chore: bump version and changelog (v0.42.12.0)

#1685 brain-health-as-solved-problem: cause-ranked doctor, worker_oom_loop
line, per-source auto-drain, pool-reap health. Layers on #1678/#1735.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(todos): file #1685 GAP E + remote-path follow-ups (v0.42.12.0)

* fix(#1685): pre-landing review — multi-source auto-drain, honest pool-reap signal, lock-renewal reap labeling

- autopilot: drop maxWaiting (coalesces by name+queue not source → only one source drained + cap over-count); pre-check idempotency key so only genuinely-new sources submit+count
- pool_reap_health: fail on reconnect FAILURES (the real signal), not reaps>0&&failures>0 (false causality when a recovered reap + unrelated failure co-occur)
- lock-renewal-tick threads its triggering error to reconnect() so a CONNECTION_ENDED pooler reap is labeled reap_detected not reconnect_other (pool_reap_health now fires for the #1678 incident path)

* chore: re-version v0.42.12.0 → v0.42.16.0 (#1685)

Slot collision avoidance per queue.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: restore slim CLAUDE.md + move #1685 entries to KEY_FILES.md (fix check:doc-history)

The master merge wrongly kept the pre-restructure 577KB CLAUDE.md; the
check:doc-history guard caps it at 60KB. Take master's slim CLAUDE.md and
record the #1685 files (doctor-cause-rank, pool-recovery-audit, worker_oom_loop
+ pool_reap_health checks, auto-drain, 5A helper) as current-state prose in
docs/architecture/KEY_FILES.md (no release markers). llms regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (garrytan#1699) (garrytan#1756)
  v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (garrytan#1696) (garrytan#1755)
  v0.42.6.0 feat(enrich): gbrain enrich --thin — brain-internal grounded synthesis for stub pages (garrytan#1700) (garrytan#1757)
  v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (garrytan#1678) (garrytan#1735)
  v0.42.4.0 fix: think --model fails loud — slash-form ids + never persist empty synthesis (garrytan#1698) (garrytan#1736)
  v0.42.3.0 feat(search): autocut — score-discontinuity result-sizing (garrytan#1663 wave 1) (garrytan#1682)
  v0.42.2.0 feat: gbrain connect — one-command Claude Code onboarding from a bearer token (garrytan#1683)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant