Worker crash-loop (400+/24h) from too-low RSS watchdog default, mis-reported as connection/lock failure + silently-starved lens-phase backlog

# Worker crash-loop (400+/24h) from a too-low default RSS watchdog masquerading as a connection/lock failure — plus a silently-starved lens phase backlog

**Version:** gbrain `0.41.34.0`
**Severity:** High (production worker fully wedged for ~24h; brain processing halted)
**Environment:** Supabase transaction-mode pooler (port 6543, `prepare:false`); 126GB box; minion supervisor + 1 worker, concurrency 3.

## TL;DR

A production worker crash-looped every ~5 min for 24h+ (400+ exits). The proximate logs all pointed at the database: `CONNECTION_ENDED`, `No database connection: connect() has not been called`, `lock-renewal-failed`, `cycle_already_running`. **None of those were the cause.** The real cause was the **RSS watchdog SIGTERM-killing the worker** when an embed/contextual-reindex pass blew past the memory cap. Every DB error downstream was a symptom of the worker being shot mid-cycle. We spent hours chasing the connection ghost (even opened a connection-retry PR, #1669) before the watchdog log line revealed the truth.

Separately, this incident surfaced a second long-standing bug: **`extract_atoms` (and other lens phases) silently never run in the routine cycle**, so an atom-extraction backlog (~686 pages) accumulated indefinitely with zero signal.

This issue proposes fixes for **three** distinct systemic problems so this can't recur for us or anyone else.

---

## Problem 1 — RSS watchdog default is dangerously low and the kill is indistinguishable from a real crash

### What happens
`worker.ts:checkMemoryLimit()` compares post-job RSS against `maxRssMb`; if exceeded it logs `[watchdog ...] rss=NNNNMB threshold=NNNNMB ... — draining` and drains/exits. The supervisor then sees `worker_exited code=1 ... likely_cause=runtime_error` and respawns. On a box where the embed/reindex working set legitimately needs ~10GB, an `8192`MB cap (or worse, the **gbrain default of `2048`MB** — `src/commands/jobs.ts:785` and `:1026`) guarantees a kill on every heavy cycle → infinite loop.

### Why it's so hard to diagnose
1. The watchdog exit is `code=1` and the supervisor labels it `likely_cause=runtime_error` — **identical to an actual crash**. (`child-worker-supervisor.ts:287` treats `code=0` as clean drain, but the watchdog path exits **non-zero**.)
2. The *visible* errors are all the **downstream** DB failures from being SIGTERM'd mid-cycle (pooler reaps the socket → `CONNECTION_ENDED` → engine nulls `_sql` → every subsequent call throws `No database connection` → `lock-renewal-failed` → abort). The actual `[watchdog]` line scrolls by once per cycle and is easy to miss among hundreds of DB error lines.
3. `2048`MB default is absurdly low for any brain doing embeddings; most operators won't know to raise it until they hit this.

### Code refs
- `src/core/minions/worker.ts:594` `checkMemoryLimit()`, `:611` the watchdog log, `:803` `checkMemoryLimit('post-job')`
- `src/commands/jobs.ts:785` `const maxRssMb = maxRssExplicit ?? 2048;`, `:1026` `parseMaxRssFlag(args) ?? 2048`
- `src/core/minions/child-worker-supervisor.ts:287` (code=0 clean) vs the non-zero watchdog exit

### Proposed fixes
1. **Make the watchdog exit self-identifying.** Exit with a distinct, reserved code (e.g. `137`-style or a dedicated `code=2` + `reason=rss-watchdog`), and have the supervisor log `likely_cause=rss-watchdog rss=NNNN threshold=NNNN` instead of the generic `runtime_error`. One glance at `worker_exited` should say "OOM cap," not "runtime error."
2. **Raise the default cap and/or auto-size it.** `2048`MB is a footgun. Either bump the default to something embed-realistic (8–16GB) **or** default to a fraction of detected system memory (e.g. `min(0.25 * total_ram, 16384)`), with the explicit flag still overriding.
3. **Crash-loop circuit breaker keyed on cause.** When the supervisor sees N consecutive `rss-watchdog` exits within a window, it should (a) emit a loud one-line operator alert ("worker OOM-looping: cap=NNNN, peak=NNNN, raise --max-rss"), and (b) optionally back off respawn instead of hot-looping 400×/day. Today `max_crashes=10` exists but the loop still effectively never stops because the count resets / the cause is opaque.
4. **Per-job-kind RSS expectation + soft warn.** Track peak RSS per job kind (`embed-backfill`, `autopilot-cycle`, etc.). Warn at e.g. 80% of cap *before* the kill, so operators get a heads-up ("embed-backfill peaked at 9.8GB, cap 8GB — next run will be killed") rather than silent death.

---

## Problem 2 — Downstream DB-error cascade has no "the pool got reaped, rebuild it" recovery in the hot lock paths

Even independent of the OOM trigger, a transaction-mode pooler **will** reap idle sockets between lock-renewal ticks. When postgres.js throws `CONNECTION_ENDED`, the engine nulls `_sql`, and the `engine.sql` getter falls through to the module-level singleton (`db.ts:154`) which was never connected on an instance-pool worker → throws `No database connection: connect() has not been called`. After that, **every** call (`promoteDelayed`, `renewLock`, `claim`) fails until process restart.

### Code refs
- `src/core/db.ts:154-155` the "connect() has not been called" throw
- `src/core/minions/queue.ts` `renewLock` → raw `executeRaw` with no retry/reconnect
- `src/core/minions/worker.ts:463` `promoteDelayed()` raw path, no retry
- `src/core/retry-matcher.ts` (already lists `CONNECTION_ENDED` / `No database connection` patterns)
- `src/core/minions/lock-renewal-tick.ts` (`runLockRenewalTick`, `lock-renewal-failed`)

### Proposed fixes
1. **Wrap the hot raw-SQL paths (`renewLock`, `promoteDelayed`, `claim`) in `withRetry` with a real `resolveReconnect()`** that rebuilds the instance pool (`engine.reconnect()`), not the module singleton. This is what PR #1669 began (`db-lock.ts` + `retry-matcher.ts`) but it must cover the **Minion job-lock** paths, not just the cycle lock.
2. **Fix the `engine.sql` getter fallback.** When `_sql` is null on an instance-pool engine, it should **rebuild its own pool**, not silently fall through to the never-connected module singleton and throw a misleading error. The current fallthrough is what makes the failure look like "you never called connect()" when in fact the pool was reaped.
3. **Classify `CONNECTION_ENDED` as retryable everywhere** (code + message), confirmed present in `retry-matcher.ts` but verify all call sites actually route through it.

---

## Problem 3 — `extract_atoms` / lens phases silently never run in the routine cycle → unbounded backlog with zero signal

### What happens
The routine 5-min autopilot-cycle runs `opts.phases ?? ALL_PHASES` (`cycle.ts:1285`), but each lens phase is **pack-gated**: `extract_atoms` only executes if the active schema pack *declares* it (`cycle.ts:1583` → `packDeclaresPhase`). If the pack doesn't declare it, the phase no-ops with `extract_atoms: active pack does not declare this phase` (`cycle.ts:1588`) — **silently, as a "successful" skip.** Operators see a green cycle and assume atoms are being extracted. They aren't. The backlog grows forever.

Workarounds people reach for then fail:
- **A separate 6h lens cron** (running the phase out-of-band) **starves** against the 5-min autopilot's single global cycle lock — it loses the lock fight ~every time and no-ops in ~4s with `cycle_already_running`.
- **Parallel shell-minions** can't work at all: one global cycle lock means every parallel `dream --phase` just hits "Skipped: another cycle is already running (locked)."

### Code refs
- `src/core/cycle.ts:1285` `const phases = opts.phases ?? ALL_PHASES`
- `src/core/cycle.ts:1573-1588` the pack-gate that turns a missing declaration into a silent skip
- `src/core/cycle.ts:773-800` `packDeclaresPhase` / `phases:` resolution

### Proposed fixes
1. **Surface the silent skip as a health signal.** A pack-gated phase that skips should increment a visible counter / emit a `doctor` warning: "extract_atoms skipped (pack does not declare it) — N eligible pages pending, backlog growing." Never let "I did nothing" report as a clean success with no backlog visibility.
2. **`doctor` backlog metric.** Add a `doctor` check: "extract_atoms backlog = N eligible pages, last successful extraction = T." Anything stale/growing should warn. (Same for synthesize_concepts, conversation_facts.)
3. **A first-class, bounded backlog-drain mode for slow phases.** `dream --phase extract_atoms --drain --window <seconds>` that holds the lock for a bounded window, processes a chunk, and **releases cooperatively** so the 5-min autopilot isn't starved — with progress in the JSON report (`{extracted, skipped, remaining}`). This makes the "slow phase with a big backlog" case a supported first-class operation instead of a brittle cron hack.
4. **Cooperative lock yielding.** The single global cycle lock should support a "yield after N seconds / N items so other waiters can run" mode, so a long backlog drain and the routine cycle can interleave instead of one starving the other.

---

## Repro (Problem 1, the headline)

1. Run a worker with `--max-rss 8192` (or the `2048` default) on a brain with a real embed backlog (17K+ stale chunks).
2. Trigger an `embed-backfill` / contextual-reindex pass; RSS climbs to ~9.8GB.
3. Observe `[watchdog] rss=9811MB threshold=8192MB ... — draining` → `worker_exited code=1 likely_cause=runtime_error` → respawn → repeat every 3–5 min.
4. Note the *visible* errors are all DB-connection/lock failures (the downstream cascade), not the watchdog line.

**Fix that stopped it for us:** raise `--max-rss` to 16384 (box has 126GB; embed working set ~10GB fits with headroom). Worker then ran multi-hour with zero crashes.

## Asks (priority order)
1. Distinct exit code + `likely_cause=rss-watchdog` on watchdog kills (P1 — makes this 5-minute-diagnosable instead of 5-hour).
2. Sane default cap (auto-size to RAM) + crash-loop breaker keyed on cause (P1).
3. Pool-reaped self-heal in `renewLock`/`promoteDelayed`/`claim` + fix the misleading `engine.sql` getter fallback (P2 — finishes PR #1669's intent for the Minion lock paths).
4. Surface silently-skipped pack-gated phases + `doctor` backlog metric + first-class bounded `--drain` mode with cooperative lock yielding (P2/P3 — stops invisible backlogs).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker crash-loop (400+/24h) from too-low RSS watchdog default, mis-reported as connection/lock failure + silently-starved lens-phase backlog #1678

Worker crash-loop (400+/24h) from a too-low default RSS watchdog masquerading as a connection/lock failure — plus a silently-starved lens phase backlog

TL;DR

Problem 1 — RSS watchdog default is dangerously low and the kill is indistinguishable from a real crash

What happens

Why it's so hard to diagnose

Code refs

Proposed fixes

Problem 2 — Downstream DB-error cascade has no "the pool got reaped, rebuild it" recovery in the hot lock paths

Code refs

Proposed fixes

Problem 3 — `extract_atoms` / lens phases silently never run in the routine cycle → unbounded backlog with zero signal

What happens

Code refs

Proposed fixes

Repro (Problem 1, the headline)

Asks (priority order)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Worker crash-loop (400+/24h) from too-low RSS watchdog default, mis-reported as connection/lock failure + silently-starved lens-phase backlog #1678

Description

Worker crash-loop (400+/24h) from a too-low default RSS watchdog masquerading as a connection/lock failure — plus a silently-starved lens phase backlog

TL;DR

Problem 1 — RSS watchdog default is dangerously low and the kill is indistinguishable from a real crash

What happens

Why it's so hard to diagnose

Code refs

Proposed fixes

Problem 2 — Downstream DB-error cascade has no "the pool got reaped, rebuild it" recovery in the hot lock paths

Code refs

Proposed fixes

Problem 3 — extract_atoms / lens phases silently never run in the routine cycle → unbounded backlog with zero signal

What happens

Code refs

Proposed fixes

Repro (Problem 1, the headline)

Asks (priority order)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Problem 3 — `extract_atoms` / lens phases silently never run in the routine cycle → unbounded backlog with zero signal