[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise
Companion to #1678 (which has the three concrete technical bugs + line-level fixes). This issue is the posture-level ask that #1678's bugs are evidence for.
The real failure
A production worker died 400+ times in 24h. A page backlog grew unbounded for weeks. Total operator/agent time to figure out why: hours — grepping worker logs, reading source, opening a wrong-track PR, tracing a connection cascade that turned out to be a symptom, not the disease.
That is the bug. Not the memory cap, not the lock path, not the pack-gate. The bug is that gbrain made a human/agent do forensic archaeology to discover its own failure mode — and while doing so, its own signals actively pointed the wrong way:
- The worker died
code=1 likely_cause=runtime_error when the real cause was an RSS-watchdog OOM kill.
- The loud, repeated errors were all downstream DB noise (
CONNECTION_ENDED, No database connection, lock-renewal-failed) — symptoms of being SIGTERM'd mid-cycle.
- The one log line that named the truth (
[watchdog] rss=9811MB threshold=8192MB) scrolled by once per cycle, buried.
- The growing atom backlog reported itself as a clean, successful cycle (
extract_atoms: active pack does not declare this phase → green).
The brand promise is "the agent runs while you sleep." The reality was "you wake up to a crime scene that needs a detective." That gap is what this issue is about.
The standard gbrain should hold itself to
An operator (human or agent) should never have to grep a worker log or read source to learn that the brain is unhealthy or why. Three capabilities make that true:
1. Self-diagnosing: gbrain doctor is the single source of truth for "is the brain healthy," and it reports cause, not symptoms.
Every failure in this incident is a question doctor should already answer in one command:
- "Worker is OOM-looping: cap=8192MB, peak=9811MB on
embed-backfill, 400 kills/24h → raise --max-rss or it will keep dying." (Today: invisible; you must grep.)
- "Atom-extraction backlog = 686 eligible pages, last successful extraction = 14 days ago, growing." (Today: invisible; reported as healthy.)
- "Lens phase
extract_atoms is enabled but the active pack doesn't declare it → it has been no-op'ing every cycle." (Today: a silent green skip.)
- "DB pool was reaped N times in the last hour and is not auto-recovering." (Today: surfaces as a misleading 'connect() not called'.)
doctor should aggregate worker-supervisor state, job-queue backlogs per kind, phase freshness, pool health, and memory headroom — and rank by cause with a one-line fix for each. If a class of failure can happen, doctor should be able to name it. The acceptance test: every diagnosis in this incident is a single gbrain doctor line, no grep, no source-reading.
2. Self-healing: the mechanical cases fix themselves, no human and no agent in the loop.
None of these need judgment — they're deterministic:
- Auto-size the RSS cap to a fraction of detected system RAM (e.g.
min(0.25 × total, 16384)), explicit flag still overrides. A 2GB default on a 126GB box is a footgun that guarantees this incident on any embedding brain.
- Auto-drain slow backlogs on a cadence with cooperative lock-yielding, so
extract_atoms/synthesize_concepts never silently pile up. A backlog should be a transient, not a permanent state nobody sees.
- Rebuild a reaped pool in-process (
renewLock/promoteDelayed/claim) instead of wedging every subsequent call until restart.
- Cause-aware crash-loop breaker: N consecutive same-cause kills → stop hot-looping, emit one loud alert, back off. 400 identical deaths should trip a breaker at ~5, not run all day.
3. Loud-but-precise: when it genuinely can't self-heal, the first line names the real cause.
Not 200 lines of downstream noise burying one truth. The worker_exited line itself should read:
worker_exited reason=oom cap=8192 peak=9811 job_kind=embed-backfill fix="raise --max-rss; box has 126GB"
Diagnosis in one glance, not one hour. Downstream cascade errors should be tagged as consequences (secondary=true cause_ref=oom-kill-<id>) so they can't masquerade as the root cause.
Why this matters beyond us
This is the difference between gbrain being "a powerful engine you must babysit" and "infrastructure you can trust unattended." Every operator who runs an embedding brain on a real box hits the 2GB-cap OOM loop. Every operator with a pack that doesn't declare a lens phase grows an invisible backlog. The current design makes each of them re-run the same multi-hour investigation we just did. Solve it once in the product and nobody ever does this archaeology again.
Ask
Adopt "brain health is a solved problem" as a design invariant, tracked across:
doctor as the single health truth, cause-ranked (covers the diagnosis gap).
- Self-heal the mechanical cases (auto-size cap, auto-drain, pool rebuild, cause-aware breaker).
- First-log-line-names-the-cause + secondary-error tagging (covers the misleading-signal gap).
Concrete bug-level fixes and exact code line cites live in #1678 — this issue is the umbrella the work should ladder up to.
[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise
Companion to #1678 (which has the three concrete technical bugs + line-level fixes). This issue is the posture-level ask that #1678's bugs are evidence for.
The real failure
A production worker died 400+ times in 24h. A page backlog grew unbounded for weeks. Total operator/agent time to figure out why: hours — grepping worker logs, reading source, opening a wrong-track PR, tracing a connection cascade that turned out to be a symptom, not the disease.
That is the bug. Not the memory cap, not the lock path, not the pack-gate. The bug is that gbrain made a human/agent do forensic archaeology to discover its own failure mode — and while doing so, its own signals actively pointed the wrong way:
code=1 likely_cause=runtime_errorwhen the real cause was an RSS-watchdog OOM kill.CONNECTION_ENDED,No database connection,lock-renewal-failed) — symptoms of being SIGTERM'd mid-cycle.[watchdog] rss=9811MB threshold=8192MB) scrolled by once per cycle, buried.extract_atoms: active pack does not declare this phase→ green).The brand promise is "the agent runs while you sleep." The reality was "you wake up to a crime scene that needs a detective." That gap is what this issue is about.
The standard gbrain should hold itself to
An operator (human or agent) should never have to grep a worker log or read source to learn that the brain is unhealthy or why. Three capabilities make that true:
1. Self-diagnosing:
gbrain doctoris the single source of truth for "is the brain healthy," and it reports cause, not symptoms.Every failure in this incident is a question
doctorshould already answer in one command:embed-backfill, 400 kills/24h → raise--max-rssor it will keep dying." (Today: invisible; you must grep.)extract_atomsis enabled but the active pack doesn't declare it → it has been no-op'ing every cycle." (Today: a silent green skip.)doctorshould aggregate worker-supervisor state, job-queue backlogs per kind, phase freshness, pool health, and memory headroom — and rank by cause with a one-line fix for each. If a class of failure can happen,doctorshould be able to name it. The acceptance test: every diagnosis in this incident is a singlegbrain doctorline, no grep, no source-reading.2. Self-healing: the mechanical cases fix themselves, no human and no agent in the loop.
None of these need judgment — they're deterministic:
min(0.25 × total, 16384)), explicit flag still overrides. A 2GB default on a 126GB box is a footgun that guarantees this incident on any embedding brain.extract_atoms/synthesize_conceptsnever silently pile up. A backlog should be a transient, not a permanent state nobody sees.renewLock/promoteDelayed/claim) instead of wedging every subsequent call until restart.3. Loud-but-precise: when it genuinely can't self-heal, the first line names the real cause.
Not 200 lines of downstream noise burying one truth. The
worker_exitedline itself should read:Diagnosis in one glance, not one hour. Downstream cascade errors should be tagged as consequences (
secondary=true cause_ref=oom-kill-<id>) so they can't masquerade as the root cause.Why this matters beyond us
This is the difference between gbrain being "a powerful engine you must babysit" and "infrastructure you can trust unattended." Every operator who runs an embedding brain on a real box hits the 2GB-cap OOM loop. Every operator with a pack that doesn't declare a lens phase grows an invisible backlog. The current design makes each of them re-run the same multi-hour investigation we just did. Solve it once in the product and nobody ever does this archaeology again.
Ask
Adopt "brain health is a solved problem" as a design invariant, tracked across:
doctoras the single health truth, cause-ranked (covers the diagnosis gap).Concrete bug-level fixes and exact code line cites live in #1678 — this issue is the umbrella the work should ladder up to.