fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec by garrytan-agents · Pull Request #1022 · garrytan/gbrain

garrytan-agents · 2026-05-15T13:19:11Z

Two-tier PR

Tier 1: Code fix (~20 lines)

The supervisor health check counted ALL worker_exited events as crashes, including clean exits (code 0). A stable supervisor showed "62 crashes in 24h" when there were only 1-2 real errors.

Fix: Filter on code !== 0. Clean restarts shown separately for visibility.

Before:

[WARN] supervisor: Supervisor running but worker crashed 62x in last 24h.

After:

[OK] supervisor: running=true pid=578679 crashes=2 clean_restarts=60

Files: src/commands/doctor.ts, src/commands/jobs.ts

Tier 2: Spec for Claude Code implementation

docs/issues/doctor-auto-heal-and-scoring.md — comprehensive issue doc with 7 improvements:

Frontmatter severity levels (6,900 cosmetic warnings drowning 280 real issues)
Temporal contradiction awareness (60% false positive reduction)
Multi-source drift baseline (acknowledge unfixable pre-v0.30.3 pages)
Image assets acknowledgment
Auto-heal mode (doctor --auto-heal via job queue)
Score delta tracking (history + trend)
Weighted scoring

Each improvement includes evidence, root cause, proposed fix, and test cases for red-test-first implementation.

…olver + stub guard - doctor.ts/jobs.ts: classify worker exits with code !== 0 as real crashes vs code === 0 clean restarts (separate counter); fixes false-positive WARN on healthy supervisors - entities/resolve.ts: prefix-expansion step between fuzzy match and slugify fallback catches bare first names that score too low on pg_trgm; picks highest-connection candidate as tiebreaker - facts/fence-write.ts: stub-creation guard refuses to spawn unprefixed entity pages at brain root - facts/backstop.ts: routes stubGuardBlocked facts to engine.insertFact so the fact still persists even when no markdown file is created - docs/issues/doctor-auto-heal-and-scoring.md: spec for follow-up doctor health-score improvements - .gitignore: guard reports/network-intelligence/ (private brain exports) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d JSDoc Replace YC partner names with placeholders per CLAUDE.md privacy rule: alice-example, bob-example, charlie-example, dave-example. Stripe and Stripe Atlas retained (allowed household brands; exercises the two-word company-prefix case). Test semantics preserved: - Alice / Dave: single-match cases - Bob / Charlie: multi-match tiebreaker cases (winner has more chunks) All 13 entity-resolve cases pass with the scrubbed fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three call sites were inline-classifying worker exits: supervisor's restart policy (child-worker-supervisor.ts:291), doctor's supervisor check (doctor.ts:1016), and jobs supervisor status (jobs.ts:806). Same rule, three copies — drift risk if one is updated without the others. Extract to src/core/minions/exit-classification.ts as a pure function. Signature consumes audit-JSON shape ({ code: number | null }) so doctor and jobs (which read serialized events from JSONL) and supervisor (which reads Node's exit callback) call the same function. Helper's classification rule: code === 0 → clean_exit, everything else (non-zero, null, undefined, missing) → crash. Default-to-crash prevents corrupted rows from silently demoting into the clean-restart bucket. 5 hermetic unit tests (test/exit-classification.test.ts) pin all edge cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire telemetry into the v0.34.5 stub-guard at fence-write.ts:190. Every guard fire now appends a JSONL line to ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl with {ts, slug, source_id, fact_count}. Operator visibility for the sunset criterion: when the new audit log reads <5 hits/week for 3 consecutive weeks on production brains, the prefix-expansion in resolveEntitySlug is sufficient and the guard can be removed in v0.36. Reader (readRecentStubGuardEvents) deliberately diverges from supervisor-audit.ts:readSupervisorEvents — it reads BOTH the current AND previous ISO-week file before filtering by ts. supervisor-audit's reader only reads the current week, which loses 24h-window correctness across Monday 00:00 UTC (a Sunday 23:55 event lives in last week's file). The 2-file read costs nothing and makes the window actually 24h. 9 hermetic unit tests pin filename math, the writer's swallows-errors contract, the cross-week-boundary read, sort order, missing-file behavior, and malformed-row tolerance. The cross-week test is the regression guard: if a future refactor copies the supervisor's single-file pattern, that test fails. Follow-up TODO (not in this PR): fix readSupervisorEvents to use the same 2-file pattern. The new stub-guard reader becomes the canonical template to copy back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new doctor check that reads ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl (via the dual-week-aware reader from T8) and surfaces the 24h fire count. WARN at >10 fires — at that rate the prefix-expansion in resolveEntitySlug is probably missing a case (typo prefix, alias, non-Latin script) and operators should grep the audit log for the offending slugs. Below the threshold but non-zero shows as OK with a count, so operators can watch the v0.36 sunset criterion (<5/week for 3 weeks → guard can be removed). Zero hits emits no check, keeping the doctor output clean on healthy brains. 5 source-grep regression tests pin the contract: check name, WARN threshold, fix hint mentions the audit log + the resolver function name, reader is the dual-week-aware variant (NOT the supervisor-audit single- week pattern), and zero-hits stays silent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…layers - fence-write.test.ts: 3 new cases for the v0.34.5 stub guard. Bare slugs return {inserted: 0, stubGuardBlocked: true, ids: []} and create no file/.tmp at brain root. Prefixed slugs bypass the guard (regression guard against accidentally inverting the slug.includes('/') check). Empty facts array short-circuits before the guard fires. - facts-backstop.test.ts: 1 new case for the end-to-end routing. A bare-name LLM extraction resolves through to a bare slug, hits the guard, and lands in the facts table via engine.insertFact (DB-only). No phantom .md file; entity_slug stores the bare slug; source_markdown_slug is null. This is the routing contract Codex flagged as a "split-brain" data shape — the test pins the by-design behavior so a future refactor can't silently drop these facts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

12 new cases on top of the 5 helper unit tests: - doctor.ts / jobs.ts / child-worker-supervisor.ts each import the helper - All three call classifyWorkerExit at least once - doctor.ts and jobs.ts no longer carry the pre-T7 inline filter - supervisor uses the helper result to choose the clean_exit branch - audit-event shape round-trip: code=0 → clean_exit, code=1 → crash, code=null+SIGKILL → crash (catches future shape changes) The regression guards (3) and the wire-up checks (6) close the gap that motivated T7 in the first place: if a future change accidentally re-inlines the filter or shifts the audit event shape, the test fails before production sees the silent divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the derived-table JOIN shape in tryPrefixExpansion with correlated subqueries. The pre-fix SQL did LEFT JOIN (SELECT to_page_id, COUNT(*) FROM links GROUP BY to_page_id) li ON ... which forced the planner to aggregate the entire links + content_chunks tables on every prefix-expansion call — O(N) per call where N is total links/chunks in the brain. On a 100K-link / 50K-chunk brain that's slow enough to bottleneck fact-extraction. New shape uses correlated subqueries: (SELECT COUNT(*) FROM links WHERE to_page_id = p.id) + (SELECT COUNT(*) FROM links WHERE from_page_id = p.id) + (SELECT COUNT(*) FROM content_chunks WHERE page_id = p.id) The slug LIKE filter is already selective (typical brain has 0-5 pages per prefix), so the three subqueries run N≈3 times per matched row against the existing indexes on links.to_page_id, links.from_page_id, and content_chunks.page_id. Behavior preserved: 13/13 entity-resolve tests pass (single-match + multi-match tiebreaker + edge cases). Codex's outside-voice review caught the dead-end design that an earlier draft of this plan proposed (a CTE with `LIMIT 50` candidate cap — would have excluded correct high-connection candidates if their slug sorted late). Correlated subqueries without a candidate cap are the cleaner shape that lets the LIKE filter do the bounding work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hermetic PGLite benchmark with 5K pages + 50K links + 25K chunks. Runs the pre-T12 derived-table shape and the new correlated-subquery shape side-by-side against the same fixture, asserts NEW >= 5x faster than OLD. Baseline-ratio, not absolute wall-clock — different machines / Bun versions / CI load can shift absolute timings by 10x without indicating a real regression, but the SHAPE difference between "aggregate the full tables" and "correlated subquery per candidate" is what we care about. Measured: old_median=18.16ms, new_median=0.31ms, speedup=58.22x. The 5x assertion has plenty of headroom. The OLD SQL is embedded verbatim as the regression baseline. If a future refactor re-introduces full-table aggregation (LEFT JOIN against SELECT...GROUP BY over the whole links or content_chunks table), the test fails. PGLite-only — Postgres planner can shape derived-table JOINs differently enough that the 5x ratio could be noise on a 5K-page fixture. The structural correctness of the rewrite is the same on both; this is purely a planner-shape regression guard. .slow.test.ts suffix keeps it out of the fast loop (run via `bun run test:slow`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan · 2026-05-16T01:56:27Z

Closing this PR. A replacement with sanitized content will land via the base repo. No further action needed on this thread.

garrytan-agents force-pushed the fix/doctor-crash-classification branch from f5e41eb to d4c7b10 Compare May 15, 2026 13:30

garrytan-agents changed the title ~~fix(doctor): classify supervisor crashes vs clean restarts~~ docs: doctor health score improvements — crash classification, auto-heal, scoring May 15, 2026

garrytan-agents changed the title ~~docs: doctor health score improvements — crash classification, auto-heal, scoring~~ fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec May 15, 2026

garrytan and others added 9 commits May 15, 2026 18:36

garrytan force-pushed the fix/doctor-crash-classification branch from 58a1863 to 5b44bae Compare May 16, 2026 01:56

garrytan closed this May 16, 2026

garrytan mentioned this pull request May 16, 2026

v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability #1085

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec#1022

fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec#1022
garrytan-agents wants to merge 9 commits into
garrytan:masterfrom
garrytan-agents:fix/doctor-crash-classification

garrytan-agents commented May 15, 2026 •

edited

Loading

Uh oh!

garrytan commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan-agents commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two-tier PR

Tier 1: Code fix (~20 lines)

Tier 2: Spec for Claude Code implementation

Uh oh!

garrytan commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrytan-agents commented May 15, 2026 •

edited

Loading