v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability by garrytan · Pull Request #1085 · garrytan/gbrain

garrytan · 2026-05-16T19:43:45Z

Summary

This branch replaces the closed PR #1022 (which contained leaked private-brain artifacts) with a sanitized, hardened, and tested version of the same fix. Built end-to-end via /plan-eng-review + /codex outside-voice + 13 implementation tasks (T1-T13).

Privacy / containment:

Three reports/network-intelligence/*.md files containing real founder/family names removed. Branch rebuilt from origin/master so leak commits never enter the branch graph.
Two dev scripts (scripts/sql.mjs, scripts/supersede.mjs) dropped — the latter hardcoded /data/brain and would corrupt a different brain than the one it was pointed at.
Test fixtures + JSDoc real names → alice-example / bob-example / charlie-example / dave-example per CLAUDE.md privacy rule.
Fork branch on garrytan-agents/gbrain force-pushed to scrubbed SHA; PR fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec #1022 closed.

Bug fixes:

doctor + jobs supervisor status now correctly classify code === 0 worker exits as clean restarts (not crashes). Matches the v0.34.3.0 supervisor's restart-policy semantics. No more false-positive WARNs after autopilot drain.
resolveEntitySlug adds a prefix-expansion step between fuzzy match and slugify fallback. Bare first names like "Alice" now resolve to people/alice-example instead of spawning an unparented alice.md at brain root.
writeFactsToFence adds a defensive stub-creation guard for unprefixed slugs. Facts still persist via the legacy DB-only path; only the phantom markdown file is refused.

DRY:

New src/core/minions/exit-classification.ts is the one source of truth for "is this exit a crash?" Three call sites (supervisor restart policy, doctor's supervisor check, jobs supervisor status) call the same helper. Signature consumes audit-JSON shape (code: number | null) so audit-log readers and Node-callback writers stay aligned.

Observability:

New stub_guard_24h doctor check reads ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl. WARN at >10 hits/24h; OK with count when non-zero; silent when zero. Wired with a v0.36 sunset criterion baked into the JSDoc.
New stub-guard audit log uses a dual-week-aware reader that reads both current AND previous ISO-week files before timestamp-filtering — deliberately diverges from supervisor-audit.ts:readSupervisorEvents which loses 24h-window correctness across Monday 00:00 UTC. Follow-up TODO filed to fix the supervisor reader with the same pattern.

Performance:

tryPrefixExpansion rewritten from derived-table JOINs (whole-table aggregation per call) to correlated subqueries scoped to slug-LIKE candidates. Measured 58x speedup (18.16ms → 0.31ms median on 5K pages + 50K links + 25K chunks). Behavior preserved; tiebreaker semantics unchanged.

Test Coverage

NEW FILES:                                            EXISTING FILES EXTENDED:
[+] src/core/minions/exit-classification.ts           [+] test/fence-write.test.ts (+3 stub-guard cases)
  └── classifyWorkerExit()                              ├── [★★★] Bare slug refused, audit fired
      ├── [★★★] code=0 → clean_exit                     ├── [★★★] Prefixed slug bypasses guard
      ├── [★★★] code=1, null, undefined, 137 → crash    └── [★★★] Empty facts no-op (no guard fire)
      └── [★★★] Consumer wire-up + audit-shape roundtrip
                                                      [+] test/facts-backstop.test.ts (+1 case)
[+] src/core/facts/stub-guard-audit.ts                  └── [★★★] Bare-name routes to engine.insertFact
  ├── computeStubGuardAuditFilename()                       (no phantom file, fact persists)
  │   ├── [★★★] Mid-week ISO-week math
  │   └── [★★★] Year-boundary 2027-01-01 → W53/2026     [+] test/doctor.test.ts (+5 cases)
  ├── logStubGuardEvent()                                 ├── [★★★] Check name stub_guard_24h
  │   ├── [★★★] JSONL append with ts                      ├── [★★★] WARN threshold > 10
  │   └── [★★★] Never throws on unwritable dir            ├── [★★★] Fix hint points at audit log
  └── readRecentStubGuardEvents()                         ├── [★★★] Uses dual-week reader, not supervisor
      ├── [★★★] DUAL-WEEK boundary read (Sun→Mon)         └── [★★★] Zero hits emits no check
      ├── [★★★] Sorted oldest-first
      ├── [★★★] Missing file returns []
      ├── [★★★] Malformed JSON tolerated
      └── [★★★] Required fields enforced

[+] src/core/entities/resolve.ts (prefix-expansion + SQL rewrite)
  └── tryPrefixExpansion() + isBareName()
      ├── [★★★] Single match + lowercase + slug fallback (×3)
      ├── [★★★] Multi-match tiebreaker (connection count)
      ├── [★★★] Multi-word + hyphenated bypass
      └── [★★★] Empty input

PERF REGRESSION GUARD:
[+] test/entity-resolve-perf.slow.test.ts
  └── [★★★] OLD shape vs NEW shape side-by-side, 5x median assertion
      └── Measured: old=18.16ms, new=0.31ms, speedup=58.22x

Tests added: 36 new (17 exit-classification + 9 stub-guard-audit + 13 entity-resolve +
  3 fence-write + 1 backstop + 5 doctor + 1 perf regression — minus reuse double-counts).
Test counts: 6621 parallel pass + 19 serial files pass on full suite, 0 failures.

Pre-Landing Review

Plan-eng-review run inline (in plan mode) before implementation. 9 decisions locked (D1-D9):

D1: PII reports → full sweep (force-push fork + close PR)
D2: real names in tests/comments → full scrub to placeholders
D3: dev scripts → drop both
D4: crash classification duplication → extract helper (T7)
D5: stub guard hardening → audit + doctor + tests + sunset comment
D6: doctor/jobs consumer tests → both
D7: prefix-expansion perf → correlated-subquery rewrite + perf gate
D8: full lake scope kept
D9: rebuild from master (not git-rm-at-tip)

Codex outside-voice surfaced 16 findings; 6 plan corrections absorbed inline:

T3 rebuild workflow (was git-rm-at-tip → rebuild for branch-history correctness)
T7 helper signature consumes audit-JSON shape (not Node callback)
T8 dual-week reader (don't copy supervisor-audit's boundary bug)
T10 extends test/fence-write.test.ts (not new file)
T12 correlated subqueries (no CTE cap that excluded correct candidates)
T13 baseline-ratio perf assertion (not absolute wall-clock)

Plan Completion

All 13 implementation tasks (T1-T13) completed and verified. No deferred items. 4 follow-up TODOs filed for v0.36.x:

Fix supervisor-audit.ts:77 reader's ISO-week boundary bug (use stub-guard reader as template)
Decommission stub guard once sunset criterion holds (track via stub_guard_24h doctor surface)
Make PREFIX_EXPANSION_DIRS config-driven
Sweep pre-existing "wintermute" references out of CHANGELOG.md narrative text

TODOS

4 new follow-ups added to TODOS.md under "kinshasa-v3 follow-ups (v0.35.2.0)" section. No existing TODOs marked complete (this PR's work was bug fixes + new code, not addressing pre-existing TODO items).

Plan file

Lives at ~/.claude/plans/review-all-these-changes-compiled-bee.md. 13 tasks, 9 decisions, GSTACK REVIEW REPORT as terminal section. Plan-eng-review + Codex outside-voice both CLEARED.

Test plan

bun run test passes (6621 unit + 19 serial files, 0 failures)
bun run typecheck clean

Verify the supervisor + stub_guard_24h doctor checks render correctly:

gbrain doctor --json | jq '.checks[] | select(.name == "supervisor" or .name == "stub_guard_24h")'

Confirm no reports/network-intelligence/ directory in the branch tree
Confirm no wintermute references introduced (pre-existing CHANGELOG references flagged in follow-up TODO)
Branch rebuilt from origin/master via T3 (not git-rm-at-tip); leak commits never enter the branch graph

🤖 Generated with Claude Code

…olver + stub guard - doctor.ts/jobs.ts: classify worker exits with code !== 0 as real crashes vs code === 0 clean restarts (separate counter); fixes false-positive WARN on healthy supervisors - entities/resolve.ts: prefix-expansion step between fuzzy match and slugify fallback catches bare first names that score too low on pg_trgm; picks highest-connection candidate as tiebreaker - facts/fence-write.ts: stub-creation guard refuses to spawn unprefixed entity pages at brain root - facts/backstop.ts: routes stubGuardBlocked facts to engine.insertFact so the fact still persists even when no markdown file is created - docs/issues/doctor-auto-heal-and-scoring.md: spec for follow-up doctor health-score improvements - .gitignore: guard reports/network-intelligence/ (private brain exports) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d JSDoc Replace YC partner names with placeholders per CLAUDE.md privacy rule: alice-example, bob-example, charlie-example, dave-example. Stripe and Stripe Atlas retained (allowed household brands; exercises the two-word company-prefix case). Test semantics preserved: - Alice / Dave: single-match cases - Bob / Charlie: multi-match tiebreaker cases (winner has more chunks) All 13 entity-resolve cases pass with the scrubbed fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three call sites were inline-classifying worker exits: supervisor's restart policy (child-worker-supervisor.ts:291), doctor's supervisor check (doctor.ts:1016), and jobs supervisor status (jobs.ts:806). Same rule, three copies — drift risk if one is updated without the others. Extract to src/core/minions/exit-classification.ts as a pure function. Signature consumes audit-JSON shape ({ code: number | null }) so doctor and jobs (which read serialized events from JSONL) and supervisor (which reads Node's exit callback) call the same function. Helper's classification rule: code === 0 → clean_exit, everything else (non-zero, null, undefined, missing) → crash. Default-to-crash prevents corrupted rows from silently demoting into the clean-restart bucket. 5 hermetic unit tests (test/exit-classification.test.ts) pin all edge cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire telemetry into the v0.34.5 stub-guard at fence-write.ts:190. Every guard fire now appends a JSONL line to ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl with {ts, slug, source_id, fact_count}. Operator visibility for the sunset criterion: when the new audit log reads <5 hits/week for 3 consecutive weeks on production brains, the prefix-expansion in resolveEntitySlug is sufficient and the guard can be removed in v0.36. Reader (readRecentStubGuardEvents) deliberately diverges from supervisor-audit.ts:readSupervisorEvents — it reads BOTH the current AND previous ISO-week file before filtering by ts. supervisor-audit's reader only reads the current week, which loses 24h-window correctness across Monday 00:00 UTC (a Sunday 23:55 event lives in last week's file). The 2-file read costs nothing and makes the window actually 24h. 9 hermetic unit tests pin filename math, the writer's swallows-errors contract, the cross-week-boundary read, sort order, missing-file behavior, and malformed-row tolerance. The cross-week test is the regression guard: if a future refactor copies the supervisor's single-file pattern, that test fails. Follow-up TODO (not in this PR): fix readSupervisorEvents to use the same 2-file pattern. The new stub-guard reader becomes the canonical template to copy back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new doctor check that reads ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl (via the dual-week-aware reader from T8) and surfaces the 24h fire count. WARN at >10 fires — at that rate the prefix-expansion in resolveEntitySlug is probably missing a case (typo prefix, alias, non-Latin script) and operators should grep the audit log for the offending slugs. Below the threshold but non-zero shows as OK with a count, so operators can watch the v0.36 sunset criterion (<5/week for 3 weeks → guard can be removed). Zero hits emits no check, keeping the doctor output clean on healthy brains. 5 source-grep regression tests pin the contract: check name, WARN threshold, fix hint mentions the audit log + the resolver function name, reader is the dual-week-aware variant (NOT the supervisor-audit single- week pattern), and zero-hits stays silent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…layers - fence-write.test.ts: 3 new cases for the v0.34.5 stub guard. Bare slugs return {inserted: 0, stubGuardBlocked: true, ids: []} and create no file/.tmp at brain root. Prefixed slugs bypass the guard (regression guard against accidentally inverting the slug.includes('/') check). Empty facts array short-circuits before the guard fires. - facts-backstop.test.ts: 1 new case for the end-to-end routing. A bare-name LLM extraction resolves through to a bare slug, hits the guard, and lands in the facts table via engine.insertFact (DB-only). No phantom .md file; entity_slug stores the bare slug; source_markdown_slug is null. This is the routing contract Codex flagged as a "split-brain" data shape — the test pins the by-design behavior so a future refactor can't silently drop these facts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

12 new cases on top of the 5 helper unit tests: - doctor.ts / jobs.ts / child-worker-supervisor.ts each import the helper - All three call classifyWorkerExit at least once - doctor.ts and jobs.ts no longer carry the pre-T7 inline filter - supervisor uses the helper result to choose the clean_exit branch - audit-event shape round-trip: code=0 → clean_exit, code=1 → crash, code=null+SIGKILL → crash (catches future shape changes) The regression guards (3) and the wire-up checks (6) close the gap that motivated T7 in the first place: if a future change accidentally re-inlines the filter or shifts the audit event shape, the test fails before production sees the silent divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the derived-table JOIN shape in tryPrefixExpansion with correlated subqueries. The pre-fix SQL did LEFT JOIN (SELECT to_page_id, COUNT(*) FROM links GROUP BY to_page_id) li ON ... which forced the planner to aggregate the entire links + content_chunks tables on every prefix-expansion call — O(N) per call where N is total links/chunks in the brain. On a 100K-link / 50K-chunk brain that's slow enough to bottleneck fact-extraction. New shape uses correlated subqueries: (SELECT COUNT(*) FROM links WHERE to_page_id = p.id) + (SELECT COUNT(*) FROM links WHERE from_page_id = p.id) + (SELECT COUNT(*) FROM content_chunks WHERE page_id = p.id) The slug LIKE filter is already selective (typical brain has 0-5 pages per prefix), so the three subqueries run N≈3 times per matched row against the existing indexes on links.to_page_id, links.from_page_id, and content_chunks.page_id. Behavior preserved: 13/13 entity-resolve tests pass (single-match + multi-match tiebreaker + edge cases). Codex's outside-voice review caught the dead-end design that an earlier draft of this plan proposed (a CTE with `LIMIT 50` candidate cap — would have excluded correct high-connection candidates if their slug sorted late). Correlated subqueries without a candidate cap are the cleaner shape that lets the LIKE filter do the bounding work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hermetic PGLite benchmark with 5K pages + 50K links + 25K chunks. Runs the pre-T12 derived-table shape and the new correlated-subquery shape side-by-side against the same fixture, asserts NEW >= 5x faster than OLD. Baseline-ratio, not absolute wall-clock — different machines / Bun versions / CI load can shift absolute timings by 10x without indicating a real regression, but the SHAPE difference between "aggregate the full tables" and "correlated subquery per candidate" is what we care about. Measured: old_median=18.16ms, new_median=0.31ms, speedup=58.22x. The 5x assertion has plenty of headroom. The OLD SQL is embedded verbatim as the regression baseline. If a future refactor re-introduces full-table aggregation (LEFT JOIN against SELECT...GROUP BY over the whole links or content_chunks table), the test fails. PGLite-only — Postgres planner can shape derived-table JOINs differently enough that the 5x ratio could be noise on a 5K-page fixture. The structural correctness of the rewrite is the same on both; this is purely a planner-shape regression guard. .slow.test.ts suffix keeps it out of the fast loop (run via `bun run test:slow`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wave content: - Privacy scrub: PII rebuilt out of branch history; real names → placeholders - Bug fix: doctor + jobs no longer count clean worker exits as crashes - Bug fix: entity resolver prefix-expansion catches bare first names - DRY refactor: classifyWorkerExit() helper (one rule, 3 call sites) - Observability: stub_guard_24h doctor check + ISO-week audit log - Perf: 58x speedup on tryPrefixExpansion query shape Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VERSION/package.json/CHANGELOG header rebumped to v0.35.4.0 per user request (queue allocation). TODOS.md rephrased to not literally name the banned private-agent string — that was the CI failure root cause on the v0.35.2.0 push. CHANGELOG.md is on check-privacy.sh's allow-list (meta-documentation exception); TODOS.md is not. CI re-runs against this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

* upstream/master: v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208) v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165) v0.36.5.0 feat: secure DATABASE_URL access for shell jobs (inherit: ["database_url"]) (garrytan#1192) v0.36.4.0 feat: brain-health-100 — autonomous remediation via doctor --remediate + Minions (garrytan#1193) fix(docs): comprehensive drift audit — contradictions, broken links, stale refs (garrytan#1201) v0.36.3.0 feat: dynamic embedding column selection for search (garrytan#1164) v0.36.2.0 feat: ZeroEntropy as default + zero-based README rewrite (garrytan#1136) v0.36.1.1 fix-wave: community PR triage + 28 atomic fixes (garrytan#1182) v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong (garrytan#1139) v0.36.0.0 feat(skillpack): scaffold + reference + harvest (retire managed-block install) (garrytan#1130) v0.35.8.0 feat(cycle): phantom-page redirect inside extract_facts (garrytan#1138) v0.35.7.0 feat: temporal trajectory + founder scorecard (Phases 2-4) (garrytan#1131) v0.35.6.0 feat(search): floor-ratio gate for metadata boost stages (closes garrytan#1091) (garrytan#1129) v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes (garrytan#1108) v0.35.5.0 fix wave: bootstrap + orphans + think MCP + worktree + walker (garrytan#1111) v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability (garrytan#1085) v0.35.3.1 feat(eval): temporal-aware contradiction probe + verdict enum (garrytan#1052) v0.35.3.0 fix wave: extract_facts items + git --no-recurse-submodules placement (garrytan#1053) # Conflicts: # src/core/postgres-engine.ts # test/schema-bootstrap-coverage.test.ts

garrytan and others added 12 commits May 15, 2026 18:36

Merge remote-tracking branch 'origin/master' into garrytan/kinshasa-v3

99afb3c

garrytan changed the title ~~v0.35.2.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability~~ v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability May 16, 2026

garrytan added 2 commits May 17, 2026 08:02

Merge remote-tracking branch 'origin/master' into garrytan/kinshasa-v3

01f7cf1

# Conflicts: # CHANGELOG.md # VERSION # package.json

Merge remote-tracking branch 'origin/master' into garrytan/kinshasa-v3

5b2368e

# Conflicts: # CHANGELOG.md # VERSION # package.json

This was referenced May 17, 2026

fix(entities): prefix-expansion resolver + stub-guard + dropped-fact audit #1010

Closed

v0.35.3.2 docs(designs): MERGE_PHANTOMS retrospective + future-implementation guide #1109

Open

garrytan merged commit 0c6fcab into master May 17, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability#1085

v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability#1085
garrytan merged 14 commits into
masterfrom
garrytan/kinshasa-v3

garrytan commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 16, 2026

Summary

Test Coverage

Pre-Landing Review

Plan Completion

TODOS

Plan file

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant