Skip to content

Worker crash-loop (400+/24h) from too-low RSS watchdog default, mis-reported as connection/lock failure + silently-starved lens-phase backlog #1678

@garrytan-agents

Description

@garrytan-agents

Worker crash-loop (400+/24h) from a too-low default RSS watchdog masquerading as a connection/lock failure — plus a silently-starved lens phase backlog

Version: gbrain 0.41.34.0
Severity: High (production worker fully wedged for ~24h; brain processing halted)
Environment: Supabase transaction-mode pooler (port 6543, prepare:false); 126GB box; minion supervisor + 1 worker, concurrency 3.

TL;DR

A production worker crash-looped every ~5 min for 24h+ (400+ exits). The proximate logs all pointed at the database: CONNECTION_ENDED, No database connection: connect() has not been called, lock-renewal-failed, cycle_already_running. None of those were the cause. The real cause was the RSS watchdog SIGTERM-killing the worker when an embed/contextual-reindex pass blew past the memory cap. Every DB error downstream was a symptom of the worker being shot mid-cycle. We spent hours chasing the connection ghost (even opened a connection-retry PR, #1669) before the watchdog log line revealed the truth.

Separately, this incident surfaced a second long-standing bug: extract_atoms (and other lens phases) silently never run in the routine cycle, so an atom-extraction backlog (~686 pages) accumulated indefinitely with zero signal.

This issue proposes fixes for three distinct systemic problems so this can't recur for us or anyone else.


Problem 1 — RSS watchdog default is dangerously low and the kill is indistinguishable from a real crash

What happens

worker.ts:checkMemoryLimit() compares post-job RSS against maxRssMb; if exceeded it logs [watchdog ...] rss=NNNNMB threshold=NNNNMB ... — draining and drains/exits. The supervisor then sees worker_exited code=1 ... likely_cause=runtime_error and respawns. On a box where the embed/reindex working set legitimately needs ~10GB, an 8192MB cap (or worse, the gbrain default of 2048MBsrc/commands/jobs.ts:785 and :1026) guarantees a kill on every heavy cycle → infinite loop.

Why it's so hard to diagnose

  1. The watchdog exit is code=1 and the supervisor labels it likely_cause=runtime_erroridentical to an actual crash. (child-worker-supervisor.ts:287 treats code=0 as clean drain, but the watchdog path exits non-zero.)
  2. The visible errors are all the downstream DB failures from being SIGTERM'd mid-cycle (pooler reaps the socket → CONNECTION_ENDED → engine nulls _sql → every subsequent call throws No database connectionlock-renewal-failed → abort). The actual [watchdog] line scrolls by once per cycle and is easy to miss among hundreds of DB error lines.
  3. 2048MB default is absurdly low for any brain doing embeddings; most operators won't know to raise it until they hit this.

Code refs

  • src/core/minions/worker.ts:594 checkMemoryLimit(), :611 the watchdog log, :803 checkMemoryLimit('post-job')
  • src/commands/jobs.ts:785 const maxRssMb = maxRssExplicit ?? 2048;, :1026 parseMaxRssFlag(args) ?? 2048
  • src/core/minions/child-worker-supervisor.ts:287 (code=0 clean) vs the non-zero watchdog exit

Proposed fixes

  1. Make the watchdog exit self-identifying. Exit with a distinct, reserved code (e.g. 137-style or a dedicated code=2 + reason=rss-watchdog), and have the supervisor log likely_cause=rss-watchdog rss=NNNN threshold=NNNN instead of the generic runtime_error. One glance at worker_exited should say "OOM cap," not "runtime error."
  2. Raise the default cap and/or auto-size it. 2048MB is a footgun. Either bump the default to something embed-realistic (8–16GB) or default to a fraction of detected system memory (e.g. min(0.25 * total_ram, 16384)), with the explicit flag still overriding.
  3. Crash-loop circuit breaker keyed on cause. When the supervisor sees N consecutive rss-watchdog exits within a window, it should (a) emit a loud one-line operator alert ("worker OOM-looping: cap=NNNN, peak=NNNN, raise --max-rss"), and (b) optionally back off respawn instead of hot-looping 400×/day. Today max_crashes=10 exists but the loop still effectively never stops because the count resets / the cause is opaque.
  4. Per-job-kind RSS expectation + soft warn. Track peak RSS per job kind (embed-backfill, autopilot-cycle, etc.). Warn at e.g. 80% of cap before the kill, so operators get a heads-up ("embed-backfill peaked at 9.8GB, cap 8GB — next run will be killed") rather than silent death.

Problem 2 — Downstream DB-error cascade has no "the pool got reaped, rebuild it" recovery in the hot lock paths

Even independent of the OOM trigger, a transaction-mode pooler will reap idle sockets between lock-renewal ticks. When postgres.js throws CONNECTION_ENDED, the engine nulls _sql, and the engine.sql getter falls through to the module-level singleton (db.ts:154) which was never connected on an instance-pool worker → throws No database connection: connect() has not been called. After that, every call (promoteDelayed, renewLock, claim) fails until process restart.

Code refs

  • src/core/db.ts:154-155 the "connect() has not been called" throw
  • src/core/minions/queue.ts renewLock → raw executeRaw with no retry/reconnect
  • src/core/minions/worker.ts:463 promoteDelayed() raw path, no retry
  • src/core/retry-matcher.ts (already lists CONNECTION_ENDED / No database connection patterns)
  • src/core/minions/lock-renewal-tick.ts (runLockRenewalTick, lock-renewal-failed)

Proposed fixes

  1. Wrap the hot raw-SQL paths (renewLock, promoteDelayed, claim) in withRetry with a real resolveReconnect() that rebuilds the instance pool (engine.reconnect()), not the module singleton. This is what PR fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED #1669 began (db-lock.ts + retry-matcher.ts) but it must cover the Minion job-lock paths, not just the cycle lock.
  2. Fix the engine.sql getter fallback. When _sql is null on an instance-pool engine, it should rebuild its own pool, not silently fall through to the never-connected module singleton and throw a misleading error. The current fallthrough is what makes the failure look like "you never called connect()" when in fact the pool was reaped.
  3. Classify CONNECTION_ENDED as retryable everywhere (code + message), confirmed present in retry-matcher.ts but verify all call sites actually route through it.

Problem 3 — extract_atoms / lens phases silently never run in the routine cycle → unbounded backlog with zero signal

What happens

The routine 5-min autopilot-cycle runs opts.phases ?? ALL_PHASES (cycle.ts:1285), but each lens phase is pack-gated: extract_atoms only executes if the active schema pack declares it (cycle.ts:1583packDeclaresPhase). If the pack doesn't declare it, the phase no-ops with extract_atoms: active pack does not declare this phase (cycle.ts:1588) — silently, as a "successful" skip. Operators see a green cycle and assume atoms are being extracted. They aren't. The backlog grows forever.

Workarounds people reach for then fail:

  • A separate 6h lens cron (running the phase out-of-band) starves against the 5-min autopilot's single global cycle lock — it loses the lock fight ~every time and no-ops in ~4s with cycle_already_running.
  • Parallel shell-minions can't work at all: one global cycle lock means every parallel dream --phase just hits "Skipped: another cycle is already running (locked)."

Code refs

  • src/core/cycle.ts:1285 const phases = opts.phases ?? ALL_PHASES
  • src/core/cycle.ts:1573-1588 the pack-gate that turns a missing declaration into a silent skip
  • src/core/cycle.ts:773-800 packDeclaresPhase / phases: resolution

Proposed fixes

  1. Surface the silent skip as a health signal. A pack-gated phase that skips should increment a visible counter / emit a doctor warning: "extract_atoms skipped (pack does not declare it) — N eligible pages pending, backlog growing." Never let "I did nothing" report as a clean success with no backlog visibility.
  2. doctor backlog metric. Add a doctor check: "extract_atoms backlog = N eligible pages, last successful extraction = T." Anything stale/growing should warn. (Same for synthesize_concepts, conversation_facts.)
  3. A first-class, bounded backlog-drain mode for slow phases. dream --phase extract_atoms --drain --window <seconds> that holds the lock for a bounded window, processes a chunk, and releases cooperatively so the 5-min autopilot isn't starved — with progress in the JSON report ({extracted, skipped, remaining}). This makes the "slow phase with a big backlog" case a supported first-class operation instead of a brittle cron hack.
  4. Cooperative lock yielding. The single global cycle lock should support a "yield after N seconds / N items so other waiters can run" mode, so a long backlog drain and the routine cycle can interleave instead of one starving the other.

Repro (Problem 1, the headline)

  1. Run a worker with --max-rss 8192 (or the 2048 default) on a brain with a real embed backlog (17K+ stale chunks).
  2. Trigger an embed-backfill / contextual-reindex pass; RSS climbs to ~9.8GB.
  3. Observe [watchdog] rss=9811MB threshold=8192MB ... — drainingworker_exited code=1 likely_cause=runtime_error → respawn → repeat every 3–5 min.
  4. Note the visible errors are all DB-connection/lock failures (the downstream cascade), not the watchdog line.

Fix that stopped it for us: raise --max-rss to 16384 (box has 126GB; embed working set ~10GB fits with headroom). Worker then ran multi-hour with zero crashes.

Asks (priority order)

  1. Distinct exit code + likely_cause=rss-watchdog on watchdog kills (P1 — makes this 5-minute-diagnosable instead of 5-hour).
  2. Sane default cap (auto-size to RAM) + crash-loop breaker keyed on cause (P1).
  3. Pool-reaped self-heal in renewLock/promoteDelayed/claim + fix the misleading engine.sql getter fallback (P2 — finishes PR fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED #1669's intent for the Minion lock paths).
  4. Surface silently-skipped pack-gated phases + doctor backlog metric + first-class bounded --drain mode with cooperative lock yielding (P2/P3 — stops invisible backlogs).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions