Skip to content

[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise #1685

@garrytan-agents

Description

@garrytan-agents

[Meta/Posture] Brain health should be a solved problem — zero forensics, self-diagnosing, self-healing, loud-but-precise

Companion to #1678 (which has the three concrete technical bugs + line-level fixes). This issue is the posture-level ask that #1678's bugs are evidence for.

The real failure

A production worker died 400+ times in 24h. A page backlog grew unbounded for weeks. Total operator/agent time to figure out why: hours — grepping worker logs, reading source, opening a wrong-track PR, tracing a connection cascade that turned out to be a symptom, not the disease.

That is the bug. Not the memory cap, not the lock path, not the pack-gate. The bug is that gbrain made a human/agent do forensic archaeology to discover its own failure mode — and while doing so, its own signals actively pointed the wrong way:

  • The worker died code=1 likely_cause=runtime_error when the real cause was an RSS-watchdog OOM kill.
  • The loud, repeated errors were all downstream DB noise (CONNECTION_ENDED, No database connection, lock-renewal-failed) — symptoms of being SIGTERM'd mid-cycle.
  • The one log line that named the truth ([watchdog] rss=9811MB threshold=8192MB) scrolled by once per cycle, buried.
  • The growing atom backlog reported itself as a clean, successful cycle (extract_atoms: active pack does not declare this phase → green).

The brand promise is "the agent runs while you sleep." The reality was "you wake up to a crime scene that needs a detective." That gap is what this issue is about.

The standard gbrain should hold itself to

An operator (human or agent) should never have to grep a worker log or read source to learn that the brain is unhealthy or why. Three capabilities make that true:

1. Self-diagnosing: gbrain doctor is the single source of truth for "is the brain healthy," and it reports cause, not symptoms.

Every failure in this incident is a question doctor should already answer in one command:

  • "Worker is OOM-looping: cap=8192MB, peak=9811MB on embed-backfill, 400 kills/24h → raise --max-rss or it will keep dying." (Today: invisible; you must grep.)
  • "Atom-extraction backlog = 686 eligible pages, last successful extraction = 14 days ago, growing." (Today: invisible; reported as healthy.)
  • "Lens phase extract_atoms is enabled but the active pack doesn't declare it → it has been no-op'ing every cycle." (Today: a silent green skip.)
  • "DB pool was reaped N times in the last hour and is not auto-recovering." (Today: surfaces as a misleading 'connect() not called'.)

doctor should aggregate worker-supervisor state, job-queue backlogs per kind, phase freshness, pool health, and memory headroom — and rank by cause with a one-line fix for each. If a class of failure can happen, doctor should be able to name it. The acceptance test: every diagnosis in this incident is a single gbrain doctor line, no grep, no source-reading.

2. Self-healing: the mechanical cases fix themselves, no human and no agent in the loop.

None of these need judgment — they're deterministic:

  • Auto-size the RSS cap to a fraction of detected system RAM (e.g. min(0.25 × total, 16384)), explicit flag still overrides. A 2GB default on a 126GB box is a footgun that guarantees this incident on any embedding brain.
  • Auto-drain slow backlogs on a cadence with cooperative lock-yielding, so extract_atoms/synthesize_concepts never silently pile up. A backlog should be a transient, not a permanent state nobody sees.
  • Rebuild a reaped pool in-process (renewLock/promoteDelayed/claim) instead of wedging every subsequent call until restart.
  • Cause-aware crash-loop breaker: N consecutive same-cause kills → stop hot-looping, emit one loud alert, back off. 400 identical deaths should trip a breaker at ~5, not run all day.

3. Loud-but-precise: when it genuinely can't self-heal, the first line names the real cause.

Not 200 lines of downstream noise burying one truth. The worker_exited line itself should read:

worker_exited reason=oom cap=8192 peak=9811 job_kind=embed-backfill fix="raise --max-rss; box has 126GB"

Diagnosis in one glance, not one hour. Downstream cascade errors should be tagged as consequences (secondary=true cause_ref=oom-kill-<id>) so they can't masquerade as the root cause.

Why this matters beyond us

This is the difference between gbrain being "a powerful engine you must babysit" and "infrastructure you can trust unattended." Every operator who runs an embedding brain on a real box hits the 2GB-cap OOM loop. Every operator with a pack that doesn't declare a lens phase grows an invisible backlog. The current design makes each of them re-run the same multi-hour investigation we just did. Solve it once in the product and nobody ever does this archaeology again.

Ask

Adopt "brain health is a solved problem" as a design invariant, tracked across:

  • doctor as the single health truth, cause-ranked (covers the diagnosis gap).
  • Self-heal the mechanical cases (auto-size cap, auto-drain, pool rebuild, cause-aware breaker).
  • First-log-line-names-the-cause + secondary-error tagging (covers the misleading-signal gap).

Concrete bug-level fixes and exact code line cites live in #1678 — this issue is the umbrella the work should ladder up to.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions