fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec#1022
Closed
garrytan-agents wants to merge 9 commits into
Closed
fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec#1022garrytan-agents wants to merge 9 commits into
garrytan-agents wants to merge 9 commits into
Conversation
f5e41eb to
d4c7b10
Compare
…olver + stub guard - doctor.ts/jobs.ts: classify worker exits with code !== 0 as real crashes vs code === 0 clean restarts (separate counter); fixes false-positive WARN on healthy supervisors - entities/resolve.ts: prefix-expansion step between fuzzy match and slugify fallback catches bare first names that score too low on pg_trgm; picks highest-connection candidate as tiebreaker - facts/fence-write.ts: stub-creation guard refuses to spawn unprefixed entity pages at brain root - facts/backstop.ts: routes stubGuardBlocked facts to engine.insertFact so the fact still persists even when no markdown file is created - docs/issues/doctor-auto-heal-and-scoring.md: spec for follow-up doctor health-score improvements - .gitignore: guard reports/network-intelligence/ (private brain exports) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d JSDoc Replace YC partner names with placeholders per CLAUDE.md privacy rule: alice-example, bob-example, charlie-example, dave-example. Stripe and Stripe Atlas retained (allowed household brands; exercises the two-word company-prefix case). Test semantics preserved: - Alice / Dave: single-match cases - Bob / Charlie: multi-match tiebreaker cases (winner has more chunks) All 13 entity-resolve cases pass with the scrubbed fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three call sites were inline-classifying worker exits: supervisor's
restart policy (child-worker-supervisor.ts:291), doctor's supervisor
check (doctor.ts:1016), and jobs supervisor status (jobs.ts:806). Same
rule, three copies — drift risk if one is updated without the others.
Extract to src/core/minions/exit-classification.ts as a pure function.
Signature consumes audit-JSON shape ({ code: number | null }) so doctor
and jobs (which read serialized events from JSONL) and supervisor (which
reads Node's exit callback) call the same function. Helper's classification
rule: code === 0 → clean_exit, everything else (non-zero, null, undefined,
missing) → crash. Default-to-crash prevents corrupted rows from silently
demoting into the clean-restart bucket.
5 hermetic unit tests (test/exit-classification.test.ts) pin all edge cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire telemetry into the v0.34.5 stub-guard at fence-write.ts:190. Every guard fire now appends a JSONL line to ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl with {ts, slug, source_id, fact_count}. Operator visibility for the sunset criterion: when the new audit log reads <5 hits/week for 3 consecutive weeks on production brains, the prefix-expansion in resolveEntitySlug is sufficient and the guard can be removed in v0.36. Reader (readRecentStubGuardEvents) deliberately diverges from supervisor-audit.ts:readSupervisorEvents — it reads BOTH the current AND previous ISO-week file before filtering by ts. supervisor-audit's reader only reads the current week, which loses 24h-window correctness across Monday 00:00 UTC (a Sunday 23:55 event lives in last week's file). The 2-file read costs nothing and makes the window actually 24h. 9 hermetic unit tests pin filename math, the writer's swallows-errors contract, the cross-week-boundary read, sort order, missing-file behavior, and malformed-row tolerance. The cross-week test is the regression guard: if a future refactor copies the supervisor's single-file pattern, that test fails. Follow-up TODO (not in this PR): fix readSupervisorEvents to use the same 2-file pattern. The new stub-guard reader becomes the canonical template to copy back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new doctor check that reads ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl (via the dual-week-aware reader from T8) and surfaces the 24h fire count. WARN at >10 fires — at that rate the prefix-expansion in resolveEntitySlug is probably missing a case (typo prefix, alias, non-Latin script) and operators should grep the audit log for the offending slugs. Below the threshold but non-zero shows as OK with a count, so operators can watch the v0.36 sunset criterion (<5/week for 3 weeks → guard can be removed). Zero hits emits no check, keeping the doctor output clean on healthy brains. 5 source-grep regression tests pin the contract: check name, WARN threshold, fix hint mentions the audit log + the resolver function name, reader is the dual-week-aware variant (NOT the supervisor-audit single- week pattern), and zero-hits stays silent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…layers
- fence-write.test.ts: 3 new cases for the v0.34.5 stub guard. Bare slugs
return {inserted: 0, stubGuardBlocked: true, ids: []} and create no
file/.tmp at brain root. Prefixed slugs bypass the guard (regression
guard against accidentally inverting the slug.includes('/') check).
Empty facts array short-circuits before the guard fires.
- facts-backstop.test.ts: 1 new case for the end-to-end routing. A
bare-name LLM extraction resolves through to a bare slug, hits the
guard, and lands in the facts table via engine.insertFact (DB-only).
No phantom .md file; entity_slug stores the bare slug;
source_markdown_slug is null. This is the routing contract Codex
flagged as a "split-brain" data shape — the test pins the by-design
behavior so a future refactor can't silently drop these facts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 new cases on top of the 5 helper unit tests: - doctor.ts / jobs.ts / child-worker-supervisor.ts each import the helper - All three call classifyWorkerExit at least once - doctor.ts and jobs.ts no longer carry the pre-T7 inline filter - supervisor uses the helper result to choose the clean_exit branch - audit-event shape round-trip: code=0 → clean_exit, code=1 → crash, code=null+SIGKILL → crash (catches future shape changes) The regression guards (3) and the wire-up checks (6) close the gap that motivated T7 in the first place: if a future change accidentally re-inlines the filter or shifts the audit event shape, the test fails before production sees the silent divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the derived-table JOIN shape in tryPrefixExpansion with
correlated subqueries. The pre-fix SQL did
LEFT JOIN (SELECT to_page_id, COUNT(*) FROM links GROUP BY to_page_id) li ON ...
which forced the planner to aggregate the entire links + content_chunks
tables on every prefix-expansion call — O(N) per call where N is total
links/chunks in the brain. On a 100K-link / 50K-chunk brain that's slow
enough to bottleneck fact-extraction.
New shape uses correlated subqueries:
(SELECT COUNT(*) FROM links WHERE to_page_id = p.id)
+ (SELECT COUNT(*) FROM links WHERE from_page_id = p.id)
+ (SELECT COUNT(*) FROM content_chunks WHERE page_id = p.id)
The slug LIKE filter is already selective (typical brain has 0-5 pages
per prefix), so the three subqueries run N≈3 times per matched row
against the existing indexes on links.to_page_id, links.from_page_id,
and content_chunks.page_id. Behavior preserved: 13/13 entity-resolve
tests pass (single-match + multi-match tiebreaker + edge cases).
Codex's outside-voice review caught the dead-end design that an earlier
draft of this plan proposed (a CTE with `LIMIT 50` candidate cap — would
have excluded correct high-connection candidates if their slug sorted
late). Correlated subqueries without a candidate cap are the cleaner
shape that lets the LIKE filter do the bounding work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hermetic PGLite benchmark with 5K pages + 50K links + 25K chunks. Runs the pre-T12 derived-table shape and the new correlated-subquery shape side-by-side against the same fixture, asserts NEW >= 5x faster than OLD. Baseline-ratio, not absolute wall-clock — different machines / Bun versions / CI load can shift absolute timings by 10x without indicating a real regression, but the SHAPE difference between "aggregate the full tables" and "correlated subquery per candidate" is what we care about. Measured: old_median=18.16ms, new_median=0.31ms, speedup=58.22x. The 5x assertion has plenty of headroom. The OLD SQL is embedded verbatim as the regression baseline. If a future refactor re-introduces full-table aggregation (LEFT JOIN against SELECT...GROUP BY over the whole links or content_chunks table), the test fails. PGLite-only — Postgres planner can shape derived-table JOINs differently enough that the 5x ratio could be noise on a 5K-page fixture. The structural correctness of the rewrite is the same on both; this is purely a planner-shape regression guard. .slow.test.ts suffix keeps it out of the fast loop (run via `bun run test:slow`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
58a1863 to
5b44bae
Compare
Owner
|
Closing this PR. A replacement with sanitized content will land via the base repo. No further action needed on this thread. |
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two-tier PR
Tier 1: Code fix (~20 lines)
The supervisor health check counted ALL
worker_exitedevents as crashes, including clean exits (code 0). A stable supervisor showed "62 crashes in 24h" when there were only 1-2 real errors.Fix: Filter on
code !== 0. Clean restarts shown separately for visibility.Before:
After:
Files:
src/commands/doctor.ts,src/commands/jobs.tsTier 2: Spec for Claude Code implementation
docs/issues/doctor-auto-heal-and-scoring.md— comprehensive issue doc with 7 improvements:doctor --auto-healvia job queue)Each improvement includes evidence, root cause, proposed fix, and test cases for red-test-first implementation.