v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (#1678)#1735
Merged
Conversation
…pooler-reap self-heal (#1678) Problem 1: distinct WORKER_EXIT_RSS_WATCHDOG exit code + cause-keyed supervisor breaker (bypasses the stable-run reset that hid the 400x/24h loop) + rss_watchdog audit bucket + 80% soft-warn; cgroup-aware resolveDefaultMaxRssMb replaces the flat 2048 default at every spawn site. Problem 2: CONNECTION_ENDED classified retryable; postgres-engine sql getter throws a retryable error on a reaped instance pool instead of the misleading module-singleton fallthrough; promoteDelayed reconnect-retry; claim recovers on the next poll tick (no double-claim); lock-renewal tick reconnect-once dep.
… fix lint clobbering the shared DB connection (#1678) Problem 3: extract_atoms_backlog doctor check + pack_gated skip marker + shared countExtractAtomsBacklog; `gbrain dream --phase extract_atoms --drain [--window N]` single-hold bounded drain (same cycleLockIdFor, rediscover each batch, reports remaining, exits non-zero while work remains). Also fixes a real production bug found via E2E: the cycle lint phase's resolveLintContentSanity created + disconnected a module-style engine that nulled the shared db singleton mid-cycle, breaking every later phase with "connect() has not been called". Lint now reuses the caller's live engine (cycle + Minion handlers thread it; standalone CLI keeps the create-own path).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to v0.42.2.0 Resolved VERSION/package.json/CHANGELOG to 0.42.2.0 (fix wave on top of master's 0.42.1.0). jobs.ts + cycle.ts auto-merged cleanly. Renumbered #1678 wave version refs 0.41.39.0 -> 0.42.2.0.
…tion through the sql getter + drain treats failed count as incomplete Codex adversarial review findings: - #2: transaction(), withReservedConnection(), and one other site bypassed the v0.42.2.0 sql-getter self-heal via `this._sql || db.getConnection()`, so a reaped instance pool fell through to the module singleton there. Route all three through `this.sql` so they throw the retryable instance-pool error and recover consistently (MinionQueue.transaction hits this). - #4: `gbrain dream --drain` treated a null backlog count (query failure) as success via `remaining ?? 0`; now null exits EXIT_DRAIN_INCOMPLETE so automation never believes an unverified backlog drained. - #1 (claim orphan) + #3 (PGLite drain lock) documented as follow-ups in TODOS.
Adds Key Files entries for worker-exit-codes.ts, rss-default.ts, and extract-atoms-drain.ts, plus v0.42.2.0 annotations on worker.ts, child-worker-supervisor.ts, lock-renewal-tick.ts, and dream.ts. Regenerated llms-full.txt to match (test/build-llms.test.ts gate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…HANGELOG/docs/comments Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # test/audit/batch-retry-audit.test.ts
# Conflicts: # CHANGELOG.md # VERSION # package.json
garrytan
added a commit
that referenced
this pull request
Jun 3, 2026
…ked doctor + OOM-loop line + auto-drain + pool-reap (#1685) (#1802) * feat(minions): pool-recovery audit + reconnect reason-threading + shared drain helper (#1685 GAP B, 5A) - pool-recovery-audit.ts: reap_detected (CONNECTION_ENDED) vs reconnect_other; recovered/failed split - postgres-engine reconnect(ctx?) classifies the triggering error so only true pooler reaps are tagged (CODEX #8) - retry.ts reconnect callback widened to thread the error; retry-matcher isConnectionEndedError - runExtractAtomsDrainForSource shared helper (cycleLockIdFor + withRefreshingLock) — one drain path (5A) - supervisor-audit readRecentSupervisorEvents (current+prev ISO week, CODEX #7) - extract-atoms-drain PROTECTED; autopilot.auto_drain.* config keys * feat(doctor): worker_oom_loop + pool_reap_health checks + cause-ranked top_issues (#1685 GAP A/B/C) - computeWorkerOomLoopCheck: unions supervisor rss_watchdog + minion_jobs watchdog-abort (CODEX #5), cap fallback to resolveDefaultMaxRssMb (CODEX #6) - computePoolReapHealthCheck: reaps-not-recovering fail, thrash warn - doctor-cause-rank rankIssues: tier ordering + grounded downstream_of (CODEX #9) + drift guard (4A) - supervisor causeStr + queue_health cross-reference worker_oom_loop (DRY 1C) - register both checks in doctor-categories ops * feat(autopilot): per-source extract_atoms auto-drain + handler + dream --drain refactor (#1685 GAP D) - autopilot per-source gate: enabled + !packDeclares + backlog>threshold + daily cap; time-sloted idempotency key (CODEX #2) - extract-atoms-drain Minion handler (thin wrapper, LockUnavailableError -> deferred) - dream --drain routes through the shared helper (5A) * chore: bump version and changelog (v0.42.12.0) #1685 brain-health-as-solved-problem: cause-ranked doctor, worker_oom_loop line, per-source auto-drain, pool-reap health. Layers on #1678/#1735. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(todos): file #1685 GAP E + remote-path follow-ups (v0.42.12.0) * fix(#1685): pre-landing review — multi-source auto-drain, honest pool-reap signal, lock-renewal reap labeling - autopilot: drop maxWaiting (coalesces by name+queue not source → only one source drained + cap over-count); pre-check idempotency key so only genuinely-new sources submit+count - pool_reap_health: fail on reconnect FAILURES (the real signal), not reaps>0&&failures>0 (false causality when a recovered reap + unrelated failure co-occur) - lock-renewal-tick threads its triggering error to reconnect() so a CONNECTION_ENDED pooler reap is labeled reap_detected not reconnect_other (pool_reap_health now fires for the #1678 incident path) * chore: re-version v0.42.12.0 → v0.42.16.0 (#1685) Slot collision avoidance per queue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: restore slim CLAUDE.md + move #1685 entries to KEY_FILES.md (fix check:doc-history) The master merge wrongly kept the pre-restructure 577KB CLAUDE.md; the check:doc-history guard caps it at 60KB. Take master's slim CLAUDE.md and record the #1685 files (doctor-cause-rank, pool-recovery-audit, worker_oom_loop + pool_reap_health checks, auto-drain, 5A helper) as current-state prose in docs/architecture/KEY_FILES.md (no release markers). llms regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (garrytan#1699) (garrytan#1756) v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (garrytan#1696) (garrytan#1755) v0.42.6.0 feat(enrich): gbrain enrich --thin — brain-internal grounded synthesis for stub pages (garrytan#1700) (garrytan#1757) v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (garrytan#1678) (garrytan#1735) v0.42.4.0 fix: think --model fails loud — slash-form ids + never persist empty synthesis (garrytan#1698) (garrytan#1736) v0.42.3.0 feat(search): autocut — score-discontinuity result-sizing (garrytan#1663 wave 1) (garrytan#1682) v0.42.2.0 feat: gbrain connect — one-command Claude Code onboarding from a bearer token (garrytan#1683)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v0.42.2.0 — fixes issue #1678 (worker crash-loop + pooler-reap cascade + silent lens backlog) + a real cycle DB-disconnect bug
A production worker crash-looped ~400×/24h. Every visible log blamed the database (
CONNECTION_ENDED,connect() has not been called,lock-renewal-failed) but none were the cause — the RSS watchdog was draining the worker mid-cycle on a too-low default cap, and every DB error was downstream of that. This wave fixes all three systemic problems in #1678, plus a fourth real production bug found while validating against real Postgres.Problem 1 — RSS watchdog: opaque + footgun default
WORKER_EXIT_RSS_WATCHDOGexit code: a watchdog drain is no longer indistinguishable from a clean queue-drain. Supervisor classifies it aslikely_cause=rss_watchdogwith its own sliding-window breaker (independent of the stable-run reset that hid the 400×/day loop) + a loudrss_watchdog_loopalert naming the cap.resolveDefaultMaxRssMb()replaces the flat2048MB default at every spawn site (jobs work,jobs supervisor, autopilot,MinionSupervisor):clamp(0.5 × min(cgroupLimit, totalRAM), 4096, 16384). cgroup-aware so a 4GB container doesn't get a 16GB cap and SIGKILL anyway.Problem 2 — pooler-reap cascade had no recovery
CONNECTION_ENDEDnow classified retryable (code + message).PostgresEngine.sqlgetter throws a tailored retryable error on a reaped instance pool instead of the misleading module-singleton fallthrough — andtransaction()/withReservedConnection()now route through that same getter (pre-landing review caught they bypassed it).promoteDelayedreconnect-retries;claimrecovers on the next poll tick (no double-claim); the lock-renewal tick gets a bounded reconnect-once dep (no background retry racing the renewal deadline).Problem 3 — extract_atoms backlog ran silently forever
extract_atoms_backlogdoctor check (page-backlog-only label) warns with the exact--draincommand when the active pack doesn't run the phase; pack-gated skips now carry a greppablepack_gatedmarker.gbrain dream --phase extract_atoms --drain [--window N]: single-hold bounded drain that takes the same cycle lock id, rediscovers eligibility each batch, reports{extracted, skipped, remaining}, and exits non-zero while work remains (a failed count also counts as incomplete).Bonus — the dream cycle could kill its own DB connection (found via E2E)
The
lintphase'sresolveLintContentSanitycreated + disconnected a module-style engine to read 4 config values, which cascaded todb.disconnect()and nulled the shared singleton mid-cycle — every later phase then threwconnect() has not been called. Triggers wheneverloadConfig()reports a connection string (i.e. every production Postgresgbrain dream). Lint now reuses the caller's live engine; standalonegbrain lintkeeps the create-own path.Review + verification
/plan-eng-review(CLEAN) + Codex outside-voice (12 findings, all absorbed) before implementation; a second Codex adversarial pass on the final diff caught 4 more (2 fixed here: getter-bypass + drain-null-count; 2 documented follow-ups: claim idempotent-recovery, PGLite drain lock-parity — both already mitigated, see TODOS).bun run verify29/29 ·tscclean · 12,333 unit pass (1 pre-existing isolation flake fixed) · real-Postgres E2E green incl. cycle/dream (was 3 pass/6 fail → 9/0) anddb-singleton-shared-recovery.Merged master's v0.42.1.0 (skillopt) cleanly; re-versioned 0.41.39.0 → 0.42.2.0. Plan:
~/.claude/plans/sorry-it-s-an-issue-tranquil-russell.md.🤖 Generated with Claude Code
Documentation
worker-exit-codes.ts,rss-default.ts,extract-atoms-drain.ts; v0.42.2.0 annotations onworker.ts(watchdog exit + claim-catch),child-worker-supervisor.ts(rss_watchdog breaker),lock-renewal-tick.ts(reconnect dep), anddream.ts(--drain).llms-full.txtregenerated to match (test/build-llms.test.tsgate).Coverage
All new public surface is documented:
gbrain dream --phase extract_atoms --drain [--window N](CHANGELOG + CLAUDE.md),extract_atoms_backlogdoctor check (CHANGELOG + CLAUDE.md), auto-sized--max-rssdefault (CHANGELOG). No architecture diagrams drifted. No documentation debt.