[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise

# [Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise

**Companion to #1678** (which has the three concrete technical bugs + line-level fixes). This issue is the posture-level ask that #1678's bugs are evidence *for*.

## The real failure

A production worker died **400+ times in 24h**. A page backlog grew unbounded for weeks. Total operator/agent time to figure out *why*: **hours** — grepping worker logs, reading source, opening a wrong-track PR, tracing a connection cascade that turned out to be a symptom, not the disease.

That is the bug. Not the memory cap, not the lock path, not the pack-gate. **The bug is that gbrain made a human/agent do forensic archaeology to discover its own failure mode** — and while doing so, its own signals *actively pointed the wrong way*:

- The worker died `code=1 likely_cause=runtime_error` when the real cause was an RSS-watchdog OOM kill.
- The loud, repeated errors were all downstream DB noise (`CONNECTION_ENDED`, `No database connection`, `lock-renewal-failed`) — symptoms of being SIGTERM'd mid-cycle.
- The one log line that named the truth (`[watchdog] rss=9811MB threshold=8192MB`) scrolled by once per cycle, buried.
- The growing atom backlog reported itself as a **clean, successful** cycle (`extract_atoms: active pack does not declare this phase` → green).

The brand promise is "the agent runs while you sleep." The reality was "you wake up to a crime scene that needs a detective." That gap is what this issue is about.

## The standard gbrain should hold itself to

**An operator (human or agent) should never have to grep a worker log or read source to learn that the brain is unhealthy or why.** Three capabilities make that true:

### 1. Self-diagnosing: `gbrain doctor` is the single source of truth for "is the brain healthy," and it reports *cause*, not symptoms.

Every failure in this incident is a question `doctor` should already answer in one command:

- "Worker is OOM-looping: cap=8192MB, peak=9811MB on `embed-backfill`, 400 kills/24h → raise `--max-rss` or it will keep dying." (Today: invisible; you must grep.)
- "Atom-extraction backlog = 686 eligible pages, last successful extraction = 14 days ago, growing." (Today: invisible; reported as healthy.)
- "Lens phase `extract_atoms` is enabled but the active pack doesn't declare it → it has been no-op'ing every cycle." (Today: a silent green skip.)
- "DB pool was reaped N times in the last hour and is not auto-recovering." (Today: surfaces as a misleading 'connect() not called'.)

`doctor` should aggregate worker-supervisor state, job-queue backlogs per kind, phase freshness, pool health, and memory headroom — and **rank by cause with a one-line fix** for each. If a class of failure can happen, `doctor` should be able to name it. The acceptance test: *every diagnosis in this incident is a single `gbrain doctor` line, no grep, no source-reading.*

### 2. Self-healing: the mechanical cases fix themselves, no human and no agent in the loop.

None of these need judgment — they're deterministic:

- **Auto-size the RSS cap** to a fraction of detected system RAM (e.g. `min(0.25 × total, 16384)`), explicit flag still overrides. A 2GB default on a 126GB box is a footgun that *guarantees* this incident on any embedding brain.
- **Auto-drain slow backlogs** on a cadence with cooperative lock-yielding, so `extract_atoms`/`synthesize_concepts` never silently pile up. A backlog should be a transient, not a permanent state nobody sees.
- **Rebuild a reaped pool** in-process (`renewLock`/`promoteDelayed`/`claim`) instead of wedging every subsequent call until restart.
- **Cause-aware crash-loop breaker:** N consecutive same-cause kills → stop hot-looping, emit one loud alert, back off. 400 identical deaths should trip a breaker at ~5, not run all day.

### 3. Loud-but-precise: when it genuinely can't self-heal, the *first* line names the real cause.

Not 200 lines of downstream noise burying one truth. The `worker_exited` line itself should read:

```
worker_exited reason=oom cap=8192 peak=9811 job_kind=embed-backfill fix="raise --max-rss; box has 126GB"
```

Diagnosis in one glance, not one hour. Downstream cascade errors should be tagged as *consequences* (`secondary=true cause_ref=oom-kill-<id>`) so they can't masquerade as the root cause.

## Why this matters beyond us

This is the difference between gbrain being "a powerful engine you must babysit" and "infrastructure you can trust unattended." Every operator who runs an embedding brain on a real box hits the 2GB-cap OOM loop. Every operator with a pack that doesn't declare a lens phase grows an invisible backlog. The current design makes each of them re-run the same multi-hour investigation we just did. **Solve it once in the product and nobody ever does this archaeology again.**

## Ask

Adopt "brain health is a solved problem" as a design invariant, tracked across:
- **`doctor` as the single health truth, cause-ranked** (covers the diagnosis gap).
- **Self-heal the mechanical cases** (auto-size cap, auto-drain, pool rebuild, cause-aware breaker).
- **First-log-line-names-the-cause + secondary-error tagging** (covers the misleading-signal gap).

Concrete bug-level fixes and exact code line cites live in **#1678** — this issue is the umbrella the work should ladder up to.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise #1685

[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise

The real failure

The standard gbrain should hold itself to

1. Self-diagnosing: `gbrain doctor` is the single source of truth for "is the brain healthy," and it reports cause, not symptoms.

2. Self-healing: the mechanical cases fix themselves, no human and no agent in the loop.

3. Loud-but-precise: when it genuinely can't self-heal, the first line names the real cause.

Why this matters beyond us

Ask

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise #1685

Description

[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise

The real failure

The standard gbrain should hold itself to

1. Self-diagnosing: gbrain doctor is the single source of truth for "is the brain healthy," and it reports cause, not symptoms.

2. Self-healing: the mechanical cases fix themselves, no human and no agent in the loop.

3. Loud-but-precise: when it genuinely can't self-heal, the first line names the real cause.

Why this matters beyond us

Ask

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Self-diagnosing: `gbrain doctor` is the single source of truth for "is the brain healthy," and it reports cause, not symptoms.