Skip to content

v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability#1085

Merged
garrytan merged 14 commits into
masterfrom
garrytan/kinshasa-v3
May 17, 2026
Merged

v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability#1085
garrytan merged 14 commits into
masterfrom
garrytan/kinshasa-v3

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

This branch replaces the closed PR #1022 (which contained leaked private-brain artifacts) with a sanitized, hardened, and tested version of the same fix. Built end-to-end via /plan-eng-review + /codex outside-voice + 13 implementation tasks (T1-T13).

Privacy / containment:

  • Three reports/network-intelligence/*.md files containing real founder/family names removed. Branch rebuilt from origin/master so leak commits never enter the branch graph.
  • Two dev scripts (scripts/sql.mjs, scripts/supersede.mjs) dropped — the latter hardcoded /data/brain and would corrupt a different brain than the one it was pointed at.
  • Test fixtures + JSDoc real names → alice-example / bob-example / charlie-example / dave-example per CLAUDE.md privacy rule.
  • Fork branch on garrytan-agents/gbrain force-pushed to scrubbed SHA; PR fix(doctor): classify supervisor crashes vs clean restarts + auto-heal spec #1022 closed.

Bug fixes:

  • doctor + jobs supervisor status now correctly classify code === 0 worker exits as clean restarts (not crashes). Matches the v0.34.3.0 supervisor's restart-policy semantics. No more false-positive WARNs after autopilot drain.
  • resolveEntitySlug adds a prefix-expansion step between fuzzy match and slugify fallback. Bare first names like "Alice" now resolve to people/alice-example instead of spawning an unparented alice.md at brain root.
  • writeFactsToFence adds a defensive stub-creation guard for unprefixed slugs. Facts still persist via the legacy DB-only path; only the phantom markdown file is refused.

DRY:

  • New src/core/minions/exit-classification.ts is the one source of truth for "is this exit a crash?" Three call sites (supervisor restart policy, doctor's supervisor check, jobs supervisor status) call the same helper. Signature consumes audit-JSON shape (code: number | null) so audit-log readers and Node-callback writers stay aligned.

Observability:

  • New stub_guard_24h doctor check reads ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl. WARN at >10 hits/24h; OK with count when non-zero; silent when zero. Wired with a v0.36 sunset criterion baked into the JSDoc.
  • New stub-guard audit log uses a dual-week-aware reader that reads both current AND previous ISO-week files before timestamp-filtering — deliberately diverges from supervisor-audit.ts:readSupervisorEvents which loses 24h-window correctness across Monday 00:00 UTC. Follow-up TODO filed to fix the supervisor reader with the same pattern.

Performance:

  • tryPrefixExpansion rewritten from derived-table JOINs (whole-table aggregation per call) to correlated subqueries scoped to slug-LIKE candidates. Measured 58x speedup (18.16ms → 0.31ms median on 5K pages + 50K links + 25K chunks). Behavior preserved; tiebreaker semantics unchanged.

Test Coverage

NEW FILES:                                            EXISTING FILES EXTENDED:
[+] src/core/minions/exit-classification.ts           [+] test/fence-write.test.ts (+3 stub-guard cases)
  └── classifyWorkerExit()                              ├── [★★★] Bare slug refused, audit fired
      ├── [★★★] code=0 → clean_exit                     ├── [★★★] Prefixed slug bypasses guard
      ├── [★★★] code=1, null, undefined, 137 → crash    └── [★★★] Empty facts no-op (no guard fire)
      └── [★★★] Consumer wire-up + audit-shape roundtrip
                                                      [+] test/facts-backstop.test.ts (+1 case)
[+] src/core/facts/stub-guard-audit.ts                  └── [★★★] Bare-name routes to engine.insertFact
  ├── computeStubGuardAuditFilename()                       (no phantom file, fact persists)
  │   ├── [★★★] Mid-week ISO-week math
  │   └── [★★★] Year-boundary 2027-01-01 → W53/2026     [+] test/doctor.test.ts (+5 cases)
  ├── logStubGuardEvent()                                 ├── [★★★] Check name stub_guard_24h
  │   ├── [★★★] JSONL append with ts                      ├── [★★★] WARN threshold > 10
  │   └── [★★★] Never throws on unwritable dir            ├── [★★★] Fix hint points at audit log
  └── readRecentStubGuardEvents()                         ├── [★★★] Uses dual-week reader, not supervisor
      ├── [★★★] DUAL-WEEK boundary read (Sun→Mon)         └── [★★★] Zero hits emits no check
      ├── [★★★] Sorted oldest-first
      ├── [★★★] Missing file returns []
      ├── [★★★] Malformed JSON tolerated
      └── [★★★] Required fields enforced

[+] src/core/entities/resolve.ts (prefix-expansion + SQL rewrite)
  └── tryPrefixExpansion() + isBareName()
      ├── [★★★] Single match + lowercase + slug fallback (×3)
      ├── [★★★] Multi-match tiebreaker (connection count)
      ├── [★★★] Multi-word + hyphenated bypass
      └── [★★★] Empty input

PERF REGRESSION GUARD:
[+] test/entity-resolve-perf.slow.test.ts
  └── [★★★] OLD shape vs NEW shape side-by-side, 5x median assertion
      └── Measured: old=18.16ms, new=0.31ms, speedup=58.22x

Tests added: 36 new (17 exit-classification + 9 stub-guard-audit + 13 entity-resolve +
  3 fence-write + 1 backstop + 5 doctor + 1 perf regression — minus reuse double-counts).
Test counts: 6621 parallel pass + 19 serial files pass on full suite, 0 failures.

Pre-Landing Review

Plan-eng-review run inline (in plan mode) before implementation. 9 decisions locked (D1-D9):

  • D1: PII reports → full sweep (force-push fork + close PR)
  • D2: real names in tests/comments → full scrub to placeholders
  • D3: dev scripts → drop both
  • D4: crash classification duplication → extract helper (T7)
  • D5: stub guard hardening → audit + doctor + tests + sunset comment
  • D6: doctor/jobs consumer tests → both
  • D7: prefix-expansion perf → correlated-subquery rewrite + perf gate
  • D8: full lake scope kept
  • D9: rebuild from master (not git-rm-at-tip)

Codex outside-voice surfaced 16 findings; 6 plan corrections absorbed inline:

  • T3 rebuild workflow (was git-rm-at-tip → rebuild for branch-history correctness)
  • T7 helper signature consumes audit-JSON shape (not Node callback)
  • T8 dual-week reader (don't copy supervisor-audit's boundary bug)
  • T10 extends test/fence-write.test.ts (not new file)
  • T12 correlated subqueries (no CTE cap that excluded correct candidates)
  • T13 baseline-ratio perf assertion (not absolute wall-clock)

Plan Completion

All 13 implementation tasks (T1-T13) completed and verified. No deferred items. 4 follow-up TODOs filed for v0.36.x:

  • Fix supervisor-audit.ts:77 reader's ISO-week boundary bug (use stub-guard reader as template)
  • Decommission stub guard once sunset criterion holds (track via stub_guard_24h doctor surface)
  • Make PREFIX_EXPANSION_DIRS config-driven
  • Sweep pre-existing "wintermute" references out of CHANGELOG.md narrative text

TODOS

4 new follow-ups added to TODOS.md under "kinshasa-v3 follow-ups (v0.35.2.0)" section. No existing TODOs marked complete (this PR's work was bug fixes + new code, not addressing pre-existing TODO items).

Plan file

Lives at ~/.claude/plans/review-all-these-changes-compiled-bee.md. 13 tasks, 9 decisions, GSTACK REVIEW REPORT as terminal section. Plan-eng-review + Codex outside-voice both CLEARED.

Test plan

  • bun run test passes (6621 unit + 19 serial files, 0 failures)
  • bun run typecheck clean
  • Verify the supervisor + stub_guard_24h doctor checks render correctly:
    gbrain doctor --json | jq '.checks[] | select(.name == "supervisor" or .name == "stub_guard_24h")'
  • Confirm no reports/network-intelligence/ directory in the branch tree
  • Confirm no wintermute references introduced (pre-existing CHANGELOG references flagged in follow-up TODO)
  • Branch rebuilt from origin/master via T3 (not git-rm-at-tip); leak commits never enter the branch graph

🤖 Generated with Claude Code

garrytan and others added 12 commits May 15, 2026 18:36
…olver + stub guard

- doctor.ts/jobs.ts: classify worker exits with code !== 0 as real crashes
  vs code === 0 clean restarts (separate counter); fixes false-positive
  WARN on healthy supervisors
- entities/resolve.ts: prefix-expansion step between fuzzy match and
  slugify fallback catches bare first names that score too low on pg_trgm;
  picks highest-connection candidate as tiebreaker
- facts/fence-write.ts: stub-creation guard refuses to spawn unprefixed
  entity pages at brain root
- facts/backstop.ts: routes stubGuardBlocked facts to engine.insertFact
  so the fact still persists even when no markdown file is created
- docs/issues/doctor-auto-heal-and-scoring.md: spec for follow-up doctor
  health-score improvements
- .gitignore: guard reports/network-intelligence/ (private brain exports)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d JSDoc

Replace YC partner names with placeholders per CLAUDE.md privacy rule:
alice-example, bob-example, charlie-example, dave-example. Stripe and
Stripe Atlas retained (allowed household brands; exercises the two-word
company-prefix case).

Test semantics preserved:
- Alice / Dave: single-match cases
- Bob / Charlie: multi-match tiebreaker cases (winner has more chunks)

All 13 entity-resolve cases pass with the scrubbed fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three call sites were inline-classifying worker exits: supervisor's
restart policy (child-worker-supervisor.ts:291), doctor's supervisor
check (doctor.ts:1016), and jobs supervisor status (jobs.ts:806). Same
rule, three copies — drift risk if one is updated without the others.

Extract to src/core/minions/exit-classification.ts as a pure function.
Signature consumes audit-JSON shape ({ code: number | null }) so doctor
and jobs (which read serialized events from JSONL) and supervisor (which
reads Node's exit callback) call the same function. Helper's classification
rule: code === 0 → clean_exit, everything else (non-zero, null, undefined,
missing) → crash. Default-to-crash prevents corrupted rows from silently
demoting into the clean-restart bucket.

5 hermetic unit tests (test/exit-classification.test.ts) pin all edge cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire telemetry into the v0.34.5 stub-guard at fence-write.ts:190. Every
guard fire now appends a JSONL line to
~/.gbrain/audit/stub-guard-YYYY-Www.jsonl with {ts, slug, source_id,
fact_count}. Operator visibility for the sunset criterion: when the new
audit log reads <5 hits/week for 3 consecutive weeks on production
brains, the prefix-expansion in resolveEntitySlug is sufficient and the
guard can be removed in v0.36.

Reader (readRecentStubGuardEvents) deliberately diverges from
supervisor-audit.ts:readSupervisorEvents — it reads BOTH the current AND
previous ISO-week file before filtering by ts. supervisor-audit's reader
only reads the current week, which loses 24h-window correctness across
Monday 00:00 UTC (a Sunday 23:55 event lives in last week's file). The
2-file read costs nothing and makes the window actually 24h.

9 hermetic unit tests pin filename math, the writer's
swallows-errors contract, the cross-week-boundary read, sort order,
missing-file behavior, and malformed-row tolerance. The cross-week test
is the regression guard: if a future refactor copies the supervisor's
single-file pattern, that test fails.

Follow-up TODO (not in this PR): fix readSupervisorEvents to use the
same 2-file pattern. The new stub-guard reader becomes the canonical
template to copy back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new doctor check that reads ~/.gbrain/audit/stub-guard-YYYY-Www.jsonl
(via the dual-week-aware reader from T8) and surfaces the 24h fire count.
WARN at >10 fires — at that rate the prefix-expansion in resolveEntitySlug
is probably missing a case (typo prefix, alias, non-Latin script) and
operators should grep the audit log for the offending slugs. Below the
threshold but non-zero shows as OK with a count, so operators can watch
the v0.36 sunset criterion (<5/week for 3 weeks → guard can be removed).
Zero hits emits no check, keeping the doctor output clean on healthy
brains.

5 source-grep regression tests pin the contract: check name, WARN
threshold, fix hint mentions the audit log + the resolver function name,
reader is the dual-week-aware variant (NOT the supervisor-audit single-
week pattern), and zero-hits stays silent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…layers

- fence-write.test.ts: 3 new cases for the v0.34.5 stub guard. Bare slugs
  return {inserted: 0, stubGuardBlocked: true, ids: []} and create no
  file/.tmp at brain root. Prefixed slugs bypass the guard (regression
  guard against accidentally inverting the slug.includes('/') check).
  Empty facts array short-circuits before the guard fires.
- facts-backstop.test.ts: 1 new case for the end-to-end routing. A
  bare-name LLM extraction resolves through to a bare slug, hits the
  guard, and lands in the facts table via engine.insertFact (DB-only).
  No phantom .md file; entity_slug stores the bare slug;
  source_markdown_slug is null. This is the routing contract Codex
  flagged as a "split-brain" data shape — the test pins the by-design
  behavior so a future refactor can't silently drop these facts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 new cases on top of the 5 helper unit tests:
- doctor.ts / jobs.ts / child-worker-supervisor.ts each import the helper
- All three call classifyWorkerExit at least once
- doctor.ts and jobs.ts no longer carry the pre-T7 inline filter
- supervisor uses the helper result to choose the clean_exit branch
- audit-event shape round-trip: code=0 → clean_exit, code=1 → crash,
  code=null+SIGKILL → crash (catches future shape changes)

The regression guards (3) and the wire-up checks (6) close the gap that
motivated T7 in the first place: if a future change accidentally re-inlines
the filter or shifts the audit event shape, the test fails before
production sees the silent divergence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the derived-table JOIN shape in tryPrefixExpansion with
correlated subqueries. The pre-fix SQL did

  LEFT JOIN (SELECT to_page_id, COUNT(*) FROM links GROUP BY to_page_id) li ON ...

which forced the planner to aggregate the entire links + content_chunks
tables on every prefix-expansion call — O(N) per call where N is total
links/chunks in the brain. On a 100K-link / 50K-chunk brain that's slow
enough to bottleneck fact-extraction.

New shape uses correlated subqueries:

  (SELECT COUNT(*) FROM links WHERE to_page_id = p.id)
    + (SELECT COUNT(*) FROM links WHERE from_page_id = p.id)
    + (SELECT COUNT(*) FROM content_chunks WHERE page_id = p.id)

The slug LIKE filter is already selective (typical brain has 0-5 pages
per prefix), so the three subqueries run N≈3 times per matched row
against the existing indexes on links.to_page_id, links.from_page_id,
and content_chunks.page_id. Behavior preserved: 13/13 entity-resolve
tests pass (single-match + multi-match tiebreaker + edge cases).

Codex's outside-voice review caught the dead-end design that an earlier
draft of this plan proposed (a CTE with `LIMIT 50` candidate cap — would
have excluded correct high-connection candidates if their slug sorted
late). Correlated subqueries without a candidate cap are the cleaner
shape that lets the LIKE filter do the bounding work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hermetic PGLite benchmark with 5K pages + 50K links + 25K chunks. Runs
the pre-T12 derived-table shape and the new correlated-subquery shape
side-by-side against the same fixture, asserts NEW >= 5x faster than OLD.
Baseline-ratio, not absolute wall-clock — different machines / Bun
versions / CI load can shift absolute timings by 10x without indicating
a real regression, but the SHAPE difference between "aggregate the full
tables" and "correlated subquery per candidate" is what we care about.

Measured: old_median=18.16ms, new_median=0.31ms, speedup=58.22x.
The 5x assertion has plenty of headroom.

The OLD SQL is embedded verbatim as the regression baseline. If a future
refactor re-introduces full-table aggregation (LEFT JOIN against
SELECT...GROUP BY over the whole links or content_chunks table), the
test fails. PGLite-only — Postgres planner can shape derived-table
JOINs differently enough that the 5x ratio could be noise on a 5K-page
fixture. The structural correctness of the rewrite is the same on both;
this is purely a planner-shape regression guard.

.slow.test.ts suffix keeps it out of the fast loop (run via
`bun run test:slow`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave content:
- Privacy scrub: PII rebuilt out of branch history; real names → placeholders
- Bug fix: doctor + jobs no longer count clean worker exits as crashes
- Bug fix: entity resolver prefix-expansion catches bare first names
- DRY refactor: classifyWorkerExit() helper (one rule, 3 call sites)
- Observability: stub_guard_24h doctor check + ISO-week audit log
- Perf: 58x speedup on tryPrefixExpansion query shape

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VERSION/package.json/CHANGELOG header rebumped to v0.35.4.0 per user
request (queue allocation). TODOS.md rephrased to not literally name
the banned private-agent string — that was the CI failure root cause
on the v0.35.2.0 push. CHANGELOG.md is on check-privacy.sh's allow-list
(meta-documentation exception); TODOS.md is not.

CI re-runs against this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.35.2.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability May 16, 2026
garrytan added 2 commits May 17, 2026 08:02
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@garrytan garrytan merged commit 0c6fcab into master May 17, 2026
7 checks passed
brandonlipman added a commit to brandonlipman/gbrain that referenced this pull request May 29, 2026
* upstream/master:
  v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208)
  v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165)
  v0.36.5.0 feat: secure DATABASE_URL access for shell jobs (inherit: ["database_url"]) (garrytan#1192)
  v0.36.4.0 feat: brain-health-100 — autonomous remediation via doctor --remediate + Minions (garrytan#1193)
  fix(docs): comprehensive drift audit — contradictions, broken links, stale refs (garrytan#1201)
  v0.36.3.0 feat: dynamic embedding column selection for search (garrytan#1164)
  v0.36.2.0 feat: ZeroEntropy as default + zero-based README rewrite (garrytan#1136)
  v0.36.1.1 fix-wave: community PR triage + 28 atomic fixes (garrytan#1182)
  v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong (garrytan#1139)
  v0.36.0.0 feat(skillpack): scaffold + reference + harvest (retire managed-block install) (garrytan#1130)
  v0.35.8.0 feat(cycle): phantom-page redirect inside extract_facts (garrytan#1138)
  v0.35.7.0 feat: temporal trajectory + founder scorecard (Phases 2-4) (garrytan#1131)
  v0.35.6.0 feat(search): floor-ratio gate for metadata boost stages (closes garrytan#1091) (garrytan#1129)
  v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes (garrytan#1108)
  v0.35.5.0 fix wave: bootstrap + orphans + think MCP + worktree + walker (garrytan#1111)
  v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability (garrytan#1085)
  v0.35.3.1 feat(eval): temporal-aware contradiction probe + verdict enum (garrytan#1052)
  v0.35.3.0 fix wave: extract_facts items + git --no-recurse-submodules placement (garrytan#1053)

# Conflicts:
#	src/core/postgres-engine.ts
#	test/schema-bootstrap-coverage.test.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant