Skip to content

fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec#1022

Closed
garrytan-agents wants to merge 9 commits into
garrytan:masterfrom
garrytan-agents:fix/doctor-crash-classification
Closed

fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec#1022
garrytan-agents wants to merge 9 commits into
garrytan:masterfrom
garrytan-agents:fix/doctor-crash-classification

Conversation

@garrytan-agents

@garrytan-agents garrytan-agents commented May 15, 2026

Copy link
Copy Markdown
Contributor

Two-tier PR

Tier 1: Code fix (~20 lines)

The supervisor health check counted ALL worker_exited events as crashes, including clean exits (code 0). A stable supervisor showed "62 crashes in 24h" when there were only 1-2 real errors.

Fix: Filter on code !== 0. Clean restarts shown separately for visibility.

Before:

[WARN] supervisor: Supervisor running but worker crashed 62x in last 24h.

After:

[OK] supervisor: running=true pid=578679 crashes=2 clean_restarts=60

Files: src/commands/doctor.ts, src/commands/jobs.ts

Tier 2: Spec for Claude Code implementation

docs/issues/doctor-auto-heal-and-scoring.md — comprehensive issue doc with 7 improvements:

  1. Frontmatter severity levels (6,900 cosmetic warnings drowning 280 real issues)
  2. Temporal contradiction awareness (60% false positive reduction)
  3. Multi-source drift baseline (acknowledge unfixable pre-v0.30.3 pages)
  4. Image assets acknowledgment
  5. Auto-heal mode (doctor --auto-heal via job queue)
  6. Score delta tracking (history + trend)
  7. Weighted scoring

Each improvement includes evidence, root cause, proposed fix, and test cases for red-test-first implementation.

@garrytan-agents garrytan-agents force-pushed the fix/doctor-crash-classification branch from f5e41eb to d4c7b10 Compare May 15, 2026 13:30
@garrytan-agents garrytan-agents changed the title fix(doctor): classify supervisor crashes vs clean restarts docs: doctor health score improvements — crash classification, auto-heal, scoring May 15, 2026
@garrytan-agents garrytan-agents changed the title docs: doctor health score improvements — crash classification, auto-heal, scoring fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec May 15, 2026
garrytan and others added 9 commits May 15, 2026 18:36
…olver + stub guard

- doctor.ts/jobs.ts: classify worker exits with code !== 0 as real crashes
  vs code === 0 clean restarts (separate counter); fixes false-positive
  WARN on healthy supervisors
- entities/resolve.ts: prefix-expansion step between fuzzy match and
  slugify fallback catches bare first names that score too low on pg_trgm;
  picks highest-connection candidate as tiebreaker
- facts/fence-write.ts: stub-creation guard refuses to spawn unprefixed
  entity pages at brain root
- facts/backstop.ts: routes stubGuardBlocked facts to engine.insertFact
  so the fact still persists even when no markdown file is created
- docs/issues/doctor-auto-heal-and-scoring.md: spec for follow-up doctor
  health-score improvements
- .gitignore: guard reports/network-intelligence/ (private brain exports)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d JSDoc

Replace YC partner names with placeholders per CLAUDE.md privacy rule:
alice-example, bob-example, charlie-example, dave-example. Stripe and
Stripe Atlas retained (allowed household brands; exercises the two-word
company-prefix case).

Test semantics preserved:
- Alice / Dave: single-match cases
- Bob / Charlie: multi-match tiebreaker cases (winner has more chunks)

All 13 entity-resolve cases pass with the scrubbed fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three call sites were inline-classifying worker exits: supervisor's
restart policy (child-worker-supervisor.ts:291), doctor's supervisor
check (doctor.ts:1016), and jobs supervisor status (jobs.ts:806). Same
rule, three copies — drift risk if one is updated without the others.

Extract to src/core/minions/exit-classification.ts as a pure function.
Signature consumes audit-JSON shape ({ code: number | null }) so doctor
and jobs (which read serialized events from JSONL) and supervisor (which
reads Node's exit callback) call the same function. Helper's classification
rule: code === 0 → clean_exit, everything else (non-zero, null, undefined,
missing) → crash. Default-to-crash prevents corrupted rows from silently
demoting into the clean-restart bucket.

5 hermetic unit tests (test/exit-classification.test.ts) pin all edge cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire telemetry into the v0.34.5 stub-guard at fence-write.ts:190. Every
guard fire now appends a JSONL line to
~/.gbrain/audit/stub-guard-YYYY-Www.jsonl with {ts, slug, source_id,
fact_count}. Operator visibility for the sunset criterion: when the new
audit log reads <5 hits/week for 3 consecutive weeks on production
brains, the prefix-expansion in resolveEntitySlug is sufficient and the
guard can be removed in v0.36.

Reader (readRecentStubGuardEvents) deliberately diverges from
supervisor-audit.ts:readSupervisorEvents — it reads BOTH the current AND
previous ISO-week file before filtering by ts. supervisor-audit's reader
only reads the current week, which loses 24h-window correctness across
Monday 00:00 UTC (a Sunday 23:55 event lives in last week's file). The
2-file read costs nothing and makes the window actually 24h.

9 hermetic unit tests pin filename math, the writer's
swallows-errors contract, the cross-week-boundary read, sort order,
missing-file behavior, and malformed-row tolerance. The cross-week test
is the regression guard: if a future refactor copies the supervisor's
single-file pattern, that test fails.

Follow-up TODO (not in this PR): fix readSupervisorEvents to use the
same 2-file pattern. The new stub-guard reader becomes the canonical
template to copy back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new doctor check that reads ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl
(via the dual-week-aware reader from T8) and surfaces the 24h fire count.
WARN at >10 fires — at that rate the prefix-expansion in resolveEntitySlug
is probably missing a case (typo prefix, alias, non-Latin script) and
operators should grep the audit log for the offending slugs. Below the
threshold but non-zero shows as OK with a count, so operators can watch
the v0.36 sunset criterion (<5/week for 3 weeks → guard can be removed).
Zero hits emits no check, keeping the doctor output clean on healthy
brains.

5 source-grep regression tests pin the contract: check name, WARN
threshold, fix hint mentions the audit log + the resolver function name,
reader is the dual-week-aware variant (NOT the supervisor-audit single-
week pattern), and zero-hits stays silent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…layers

- fence-write.test.ts: 3 new cases for the v0.34.5 stub guard. Bare slugs
  return {inserted: 0, stubGuardBlocked: true, ids: []} and create no
  file/.tmp at brain root. Prefixed slugs bypass the guard (regression
  guard against accidentally inverting the slug.includes('/') check).
  Empty facts array short-circuits before the guard fires.
- facts-backstop.test.ts: 1 new case for the end-to-end routing. A
  bare-name LLM extraction resolves through to a bare slug, hits the
  guard, and lands in the facts table via engine.insertFact (DB-only).
  No phantom .md file; entity_slug stores the bare slug;
  source_markdown_slug is null. This is the routing contract Codex
  flagged as a "split-brain" data shape — the test pins the by-design
  behavior so a future refactor can't silently drop these facts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 new cases on top of the 5 helper unit tests:
- doctor.ts / jobs.ts / child-worker-supervisor.ts each import the helper
- All three call classifyWorkerExit at least once
- doctor.ts and jobs.ts no longer carry the pre-T7 inline filter
- supervisor uses the helper result to choose the clean_exit branch
- audit-event shape round-trip: code=0 → clean_exit, code=1 → crash,
  code=null+SIGKILL → crash (catches future shape changes)

The regression guards (3) and the wire-up checks (6) close the gap that
motivated T7 in the first place: if a future change accidentally re-inlines
the filter or shifts the audit event shape, the test fails before
production sees the silent divergence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the derived-table JOIN shape in tryPrefixExpansion with
correlated subqueries. The pre-fix SQL did

  LEFT JOIN (SELECT to_page_id, COUNT(*) FROM links GROUP BY to_page_id) li ON ...

which forced the planner to aggregate the entire links + content_chunks
tables on every prefix-expansion call — O(N) per call where N is total
links/chunks in the brain. On a 100K-link / 50K-chunk brain that's slow
enough to bottleneck fact-extraction.

New shape uses correlated subqueries:

  (SELECT COUNT(*) FROM links WHERE to_page_id = p.id)
    + (SELECT COUNT(*) FROM links WHERE from_page_id = p.id)
    + (SELECT COUNT(*) FROM content_chunks WHERE page_id = p.id)

The slug LIKE filter is already selective (typical brain has 0-5 pages
per prefix), so the three subqueries run N≈3 times per matched row
against the existing indexes on links.to_page_id, links.from_page_id,
and content_chunks.page_id. Behavior preserved: 13/13 entity-resolve
tests pass (single-match + multi-match tiebreaker + edge cases).

Codex's outside-voice review caught the dead-end design that an earlier
draft of this plan proposed (a CTE with `LIMIT 50` candidate cap — would
have excluded correct high-connection candidates if their slug sorted
late). Correlated subqueries without a candidate cap are the cleaner
shape that lets the LIKE filter do the bounding work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hermetic PGLite benchmark with 5K pages + 50K links + 25K chunks. Runs
the pre-T12 derived-table shape and the new correlated-subquery shape
side-by-side against the same fixture, asserts NEW >= 5x faster than OLD.
Baseline-ratio, not absolute wall-clock — different machines / Bun
versions / CI load can shift absolute timings by 10x without indicating
a real regression, but the SHAPE difference between "aggregate the full
tables" and "correlated subquery per candidate" is what we care about.

Measured: old_median=18.16ms, new_median=0.31ms, speedup=58.22x.
The 5x assertion has plenty of headroom.

The OLD SQL is embedded verbatim as the regression baseline. If a future
refactor re-introduces full-table aggregation (LEFT JOIN against
SELECT...GROUP BY over the whole links or content_chunks table), the
test fails. PGLite-only — Postgres planner can shape derived-table
JOINs differently enough that the 5x ratio could be noise on a 5K-page
fixture. The structural correctness of the rewrite is the same on both;
this is purely a planner-shape regression guard.

.slow.test.ts suffix keeps it out of the fast loop (run via
`bun run test:slow`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan force-pushed the fix/doctor-crash-classification branch from 58a1863 to 5b44bae Compare May 16, 2026 01:56
@garrytan

Copy link
Copy Markdown
Owner

Closing this PR. A replacement with sanitized content will land via the base repo. No further action needed on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants