You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Worker crash-loop (400+/24h) from a too-low default RSS watchdog masquerading as a connection/lock failure — plus a silently-starved lens phase backlog
A production worker crash-looped every ~5 min for 24h+ (400+ exits). The proximate logs all pointed at the database: CONNECTION_ENDED, No database connection: connect() has not been called, lock-renewal-failed, cycle_already_running. None of those were the cause. The real cause was the RSS watchdog SIGTERM-killing the worker when an embed/contextual-reindex pass blew past the memory cap. Every DB error downstream was a symptom of the worker being shot mid-cycle. We spent hours chasing the connection ghost (even opened a connection-retry PR, #1669) before the watchdog log line revealed the truth.
Separately, this incident surfaced a second long-standing bug: extract_atoms (and other lens phases) silently never run in the routine cycle, so an atom-extraction backlog (~686 pages) accumulated indefinitely with zero signal.
This issue proposes fixes for three distinct systemic problems so this can't recur for us or anyone else.
Problem 1 — RSS watchdog default is dangerously low and the kill is indistinguishable from a real crash
What happens
worker.ts:checkMemoryLimit() compares post-job RSS against maxRssMb; if exceeded it logs [watchdog ...] rss=NNNNMB threshold=NNNNMB ... — draining and drains/exits. The supervisor then sees worker_exited code=1 ... likely_cause=runtime_error and respawns. On a box where the embed/reindex working set legitimately needs ~10GB, an 8192MB cap (or worse, the gbrain default of 2048MB — src/commands/jobs.ts:785 and :1026) guarantees a kill on every heavy cycle → infinite loop.
Why it's so hard to diagnose
The watchdog exit is code=1 and the supervisor labels it likely_cause=runtime_error — identical to an actual crash. (child-worker-supervisor.ts:287 treats code=0 as clean drain, but the watchdog path exits non-zero.)
The visible errors are all the downstream DB failures from being SIGTERM'd mid-cycle (pooler reaps the socket → CONNECTION_ENDED → engine nulls _sql → every subsequent call throws No database connection → lock-renewal-failed → abort). The actual [watchdog] line scrolls by once per cycle and is easy to miss among hundreds of DB error lines.
2048MB default is absurdly low for any brain doing embeddings; most operators won't know to raise it until they hit this.
Code refs
src/core/minions/worker.ts:594checkMemoryLimit(), :611 the watchdog log, :803checkMemoryLimit('post-job')
src/core/minions/child-worker-supervisor.ts:287 (code=0 clean) vs the non-zero watchdog exit
Proposed fixes
Make the watchdog exit self-identifying. Exit with a distinct, reserved code (e.g. 137-style or a dedicated code=2 + reason=rss-watchdog), and have the supervisor log likely_cause=rss-watchdog rss=NNNN threshold=NNNN instead of the generic runtime_error. One glance at worker_exited should say "OOM cap," not "runtime error."
Raise the default cap and/or auto-size it.2048MB is a footgun. Either bump the default to something embed-realistic (8–16GB) or default to a fraction of detected system memory (e.g. min(0.25 * total_ram, 16384)), with the explicit flag still overriding.
Crash-loop circuit breaker keyed on cause. When the supervisor sees N consecutive rss-watchdog exits within a window, it should (a) emit a loud one-line operator alert ("worker OOM-looping: cap=NNNN, peak=NNNN, raise --max-rss"), and (b) optionally back off respawn instead of hot-looping 400×/day. Today max_crashes=10 exists but the loop still effectively never stops because the count resets / the cause is opaque.
Per-job-kind RSS expectation + soft warn. Track peak RSS per job kind (embed-backfill, autopilot-cycle, etc.). Warn at e.g. 80% of cap before the kill, so operators get a heads-up ("embed-backfill peaked at 9.8GB, cap 8GB — next run will be killed") rather than silent death.
Problem 2 — Downstream DB-error cascade has no "the pool got reaped, rebuild it" recovery in the hot lock paths
Even independent of the OOM trigger, a transaction-mode pooler will reap idle sockets between lock-renewal ticks. When postgres.js throws CONNECTION_ENDED, the engine nulls _sql, and the engine.sql getter falls through to the module-level singleton (db.ts:154) which was never connected on an instance-pool worker → throws No database connection: connect() has not been called. After that, every call (promoteDelayed, renewLock, claim) fails until process restart.
Code refs
src/core/db.ts:154-155 the "connect() has not been called" throw
src/core/minions/queue.tsrenewLock → raw executeRaw with no retry/reconnect
src/core/minions/worker.ts:463promoteDelayed() raw path, no retry
src/core/retry-matcher.ts (already lists CONNECTION_ENDED / No database connection patterns)
Wrap the hot raw-SQL paths (renewLock, promoteDelayed, claim) in withRetry with a real resolveReconnect() that rebuilds the instance pool (engine.reconnect()), not the module singleton. This is what PR fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED #1669 began (db-lock.ts + retry-matcher.ts) but it must cover the Minion job-lock paths, not just the cycle lock.
Fix the engine.sql getter fallback. When _sql is null on an instance-pool engine, it should rebuild its own pool, not silently fall through to the never-connected module singleton and throw a misleading error. The current fallthrough is what makes the failure look like "you never called connect()" when in fact the pool was reaped.
Classify CONNECTION_ENDED as retryable everywhere (code + message), confirmed present in retry-matcher.ts but verify all call sites actually route through it.
Problem 3 — extract_atoms / lens phases silently never run in the routine cycle → unbounded backlog with zero signal
What happens
The routine 5-min autopilot-cycle runs opts.phases ?? ALL_PHASES (cycle.ts:1285), but each lens phase is pack-gated: extract_atoms only executes if the active schema pack declares it (cycle.ts:1583 → packDeclaresPhase). If the pack doesn't declare it, the phase no-ops with extract_atoms: active pack does not declare this phase (cycle.ts:1588) — silently, as a "successful" skip. Operators see a green cycle and assume atoms are being extracted. They aren't. The backlog grows forever.
Workarounds people reach for then fail:
A separate 6h lens cron (running the phase out-of-band) starves against the 5-min autopilot's single global cycle lock — it loses the lock fight ~every time and no-ops in ~4s with cycle_already_running.
Parallel shell-minions can't work at all: one global cycle lock means every parallel dream --phase just hits "Skipped: another cycle is already running (locked)."
Surface the silent skip as a health signal. A pack-gated phase that skips should increment a visible counter / emit a doctor warning: "extract_atoms skipped (pack does not declare it) — N eligible pages pending, backlog growing." Never let "I did nothing" report as a clean success with no backlog visibility.
doctor backlog metric. Add a doctor check: "extract_atoms backlog = N eligible pages, last successful extraction = T." Anything stale/growing should warn. (Same for synthesize_concepts, conversation_facts.)
A first-class, bounded backlog-drain mode for slow phases.dream --phase extract_atoms --drain --window <seconds> that holds the lock for a bounded window, processes a chunk, and releases cooperatively so the 5-min autopilot isn't starved — with progress in the JSON report ({extracted, skipped, remaining}). This makes the "slow phase with a big backlog" case a supported first-class operation instead of a brittle cron hack.
Cooperative lock yielding. The single global cycle lock should support a "yield after N seconds / N items so other waiters can run" mode, so a long backlog drain and the routine cycle can interleave instead of one starving the other.
Repro (Problem 1, the headline)
Run a worker with --max-rss 8192 (or the 2048 default) on a brain with a real embed backlog (17K+ stale chunks).
Trigger an embed-backfill / contextual-reindex pass; RSS climbs to ~9.8GB.
Observe [watchdog] rss=9811MB threshold=8192MB ... — draining → worker_exited code=1 likely_cause=runtime_error → respawn → repeat every 3–5 min.
Note the visible errors are all DB-connection/lock failures (the downstream cascade), not the watchdog line.
Fix that stopped it for us: raise --max-rss to 16384 (box has 126GB; embed working set ~10GB fits with headroom). Worker then ran multi-hour with zero crashes.
Asks (priority order)
Distinct exit code + likely_cause=rss-watchdog on watchdog kills (P1 — makes this 5-minute-diagnosable instead of 5-hour).
Sane default cap (auto-size to RAM) + crash-loop breaker keyed on cause (P1).
Worker crash-loop (400+/24h) from a too-low default RSS watchdog masquerading as a connection/lock failure — plus a silently-starved lens phase backlog
Version: gbrain
0.41.34.0Severity: High (production worker fully wedged for ~24h; brain processing halted)
Environment: Supabase transaction-mode pooler (port 6543,
prepare:false); 126GB box; minion supervisor + 1 worker, concurrency 3.TL;DR
A production worker crash-looped every ~5 min for 24h+ (400+ exits). The proximate logs all pointed at the database:
CONNECTION_ENDED,No database connection: connect() has not been called,lock-renewal-failed,cycle_already_running. None of those were the cause. The real cause was the RSS watchdog SIGTERM-killing the worker when an embed/contextual-reindex pass blew past the memory cap. Every DB error downstream was a symptom of the worker being shot mid-cycle. We spent hours chasing the connection ghost (even opened a connection-retry PR, #1669) before the watchdog log line revealed the truth.Separately, this incident surfaced a second long-standing bug:
extract_atoms(and other lens phases) silently never run in the routine cycle, so an atom-extraction backlog (~686 pages) accumulated indefinitely with zero signal.This issue proposes fixes for three distinct systemic problems so this can't recur for us or anyone else.
Problem 1 — RSS watchdog default is dangerously low and the kill is indistinguishable from a real crash
What happens
worker.ts:checkMemoryLimit()compares post-job RSS againstmaxRssMb; if exceeded it logs[watchdog ...] rss=NNNNMB threshold=NNNNMB ... — drainingand drains/exits. The supervisor then seesworker_exited code=1 ... likely_cause=runtime_errorand respawns. On a box where the embed/reindex working set legitimately needs ~10GB, an8192MB cap (or worse, the gbrain default of2048MB —src/commands/jobs.ts:785and:1026) guarantees a kill on every heavy cycle → infinite loop.Why it's so hard to diagnose
code=1and the supervisor labels itlikely_cause=runtime_error— identical to an actual crash. (child-worker-supervisor.ts:287treatscode=0as clean drain, but the watchdog path exits non-zero.)CONNECTION_ENDED→ engine nulls_sql→ every subsequent call throwsNo database connection→lock-renewal-failed→ abort). The actual[watchdog]line scrolls by once per cycle and is easy to miss among hundreds of DB error lines.2048MB default is absurdly low for any brain doing embeddings; most operators won't know to raise it until they hit this.Code refs
src/core/minions/worker.ts:594checkMemoryLimit(),:611the watchdog log,:803checkMemoryLimit('post-job')src/commands/jobs.ts:785const maxRssMb = maxRssExplicit ?? 2048;,:1026parseMaxRssFlag(args) ?? 2048src/core/minions/child-worker-supervisor.ts:287(code=0 clean) vs the non-zero watchdog exitProposed fixes
137-style or a dedicatedcode=2+reason=rss-watchdog), and have the supervisor loglikely_cause=rss-watchdog rss=NNNN threshold=NNNNinstead of the genericruntime_error. One glance atworker_exitedshould say "OOM cap," not "runtime error."2048MB is a footgun. Either bump the default to something embed-realistic (8–16GB) or default to a fraction of detected system memory (e.g.min(0.25 * total_ram, 16384)), with the explicit flag still overriding.rss-watchdogexits within a window, it should (a) emit a loud one-line operator alert ("worker OOM-looping: cap=NNNN, peak=NNNN, raise --max-rss"), and (b) optionally back off respawn instead of hot-looping 400×/day. Todaymax_crashes=10exists but the loop still effectively never stops because the count resets / the cause is opaque.embed-backfill,autopilot-cycle, etc.). Warn at e.g. 80% of cap before the kill, so operators get a heads-up ("embed-backfill peaked at 9.8GB, cap 8GB — next run will be killed") rather than silent death.Problem 2 — Downstream DB-error cascade has no "the pool got reaped, rebuild it" recovery in the hot lock paths
Even independent of the OOM trigger, a transaction-mode pooler will reap idle sockets between lock-renewal ticks. When postgres.js throws
CONNECTION_ENDED, the engine nulls_sql, and theengine.sqlgetter falls through to the module-level singleton (db.ts:154) which was never connected on an instance-pool worker → throwsNo database connection: connect() has not been called. After that, every call (promoteDelayed,renewLock,claim) fails until process restart.Code refs
src/core/db.ts:154-155the "connect() has not been called" throwsrc/core/minions/queue.tsrenewLock→ rawexecuteRawwith no retry/reconnectsrc/core/minions/worker.ts:463promoteDelayed()raw path, no retrysrc/core/retry-matcher.ts(already listsCONNECTION_ENDED/No database connectionpatterns)src/core/minions/lock-renewal-tick.ts(runLockRenewalTick,lock-renewal-failed)Proposed fixes
renewLock,promoteDelayed,claim) inwithRetrywith a realresolveReconnect()that rebuilds the instance pool (engine.reconnect()), not the module singleton. This is what PR fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED #1669 began (db-lock.ts+retry-matcher.ts) but it must cover the Minion job-lock paths, not just the cycle lock.engine.sqlgetter fallback. When_sqlis null on an instance-pool engine, it should rebuild its own pool, not silently fall through to the never-connected module singleton and throw a misleading error. The current fallthrough is what makes the failure look like "you never called connect()" when in fact the pool was reaped.CONNECTION_ENDEDas retryable everywhere (code + message), confirmed present inretry-matcher.tsbut verify all call sites actually route through it.Problem 3 —
extract_atoms/ lens phases silently never run in the routine cycle → unbounded backlog with zero signalWhat happens
The routine 5-min autopilot-cycle runs
opts.phases ?? ALL_PHASES(cycle.ts:1285), but each lens phase is pack-gated:extract_atomsonly executes if the active schema pack declares it (cycle.ts:1583→packDeclaresPhase). If the pack doesn't declare it, the phase no-ops withextract_atoms: active pack does not declare this phase(cycle.ts:1588) — silently, as a "successful" skip. Operators see a green cycle and assume atoms are being extracted. They aren't. The backlog grows forever.Workarounds people reach for then fail:
cycle_already_running.dream --phasejust hits "Skipped: another cycle is already running (locked)."Code refs
src/core/cycle.ts:1285const phases = opts.phases ?? ALL_PHASESsrc/core/cycle.ts:1573-1588the pack-gate that turns a missing declaration into a silent skipsrc/core/cycle.ts:773-800packDeclaresPhase/phases:resolutionProposed fixes
doctorwarning: "extract_atoms skipped (pack does not declare it) — N eligible pages pending, backlog growing." Never let "I did nothing" report as a clean success with no backlog visibility.doctorbacklog metric. Add adoctorcheck: "extract_atoms backlog = N eligible pages, last successful extraction = T." Anything stale/growing should warn. (Same for synthesize_concepts, conversation_facts.)dream --phase extract_atoms --drain --window <seconds>that holds the lock for a bounded window, processes a chunk, and releases cooperatively so the 5-min autopilot isn't starved — with progress in the JSON report ({extracted, skipped, remaining}). This makes the "slow phase with a big backlog" case a supported first-class operation instead of a brittle cron hack.Repro (Problem 1, the headline)
--max-rss 8192(or the2048default) on a brain with a real embed backlog (17K+ stale chunks).embed-backfill/ contextual-reindex pass; RSS climbs to ~9.8GB.[watchdog] rss=9811MB threshold=8192MB ... — draining→worker_exited code=1 likely_cause=runtime_error→ respawn → repeat every 3–5 min.Fix that stopped it for us: raise
--max-rssto 16384 (box has 126GB; embed working set ~10GB fits with headroom). Worker then ran multi-hour with zero crashes.Asks (priority order)
likely_cause=rss-watchdogon watchdog kills (P1 — makes this 5-minute-diagnosable instead of 5-hour).renewLock/promoteDelayed/claim+ fix the misleadingengine.sqlgetter fallback (P2 — finishes PR fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED #1669's intent for the Minion lock paths).doctorbacklog metric + first-class bounded--drainmode with cooperative lock yielding (P2/P3 — stops invisible backlogs).