v0.20.0 feat: extract BrainBench to sibling gbrain-evals repo by garrytan · Pull Request #195 · garrytan/gbrain

garrytan · 2026-04-18T15:54:06Z

Summary

Extracts BrainBench — gbrain's 10/12-Cat benchmark harness + 4-adapter scorecard + 418-item amara-life-v1 fictional corpus + 314 tests — to a public sibling repo at github.com/garrytan/gbrain-evals. gbrain stays the knowledge-brain CLI + library and never pulls the ~5MB eval tree or pdf-parse devDep at install time. gbrain-evals depends on gbrain via GitHub URL and consumes it through a new public exports map.

Also folds in v0.10.5 inferLinkType regex expansion (works_at 58% → >85% expected on rich prose; advises 41% → >85% via new EMPLOYEE_ROLE_RE prior + broader advisor phrasings) and a test-infra fix for PGLite parallel-load stability.

What moved

Stays in gbrain	Moves to gbrain-evals
`src/` (CLI, MCP, engines, operations, skills runtime)	`eval/` (runners, adapters, generators, schemas, gold, cli)
`Page.type` enum incl. `email/slack/calendar-event/note/meeting`	`test/eval/` (14 files, 314 tests)
`inferType()` heuristics for inbox/chat/calendar/note/meeting dirs	`docs/benchmarks/*.md` (all scorecards)
11 new public exports (`gbrain/engine`, `gbrain/pglite-engine`, `gbrain/search/hybrid`, …)	`eval:*` package.json scripts
v0.10.5 link-extraction regex expansion	`pdf-parse` devDep

Test Coverage

Step 7 subagent coverage audit: 85% coverage, 1 gap (closed). The new inferType() directory heuristics (email/slack/cal/notes/meetings) had no direct unit tests. Closed with a table-driven test.each block in test/markdown.test.ts covering 9 path variants against realistic amara-life-v1 fixture paths. The test doubles as a contract check between the two repos.

Net-new code paths to audit were tiny because this is primarily a deletion-heavy refactor:

src/core/types.ts (6 lines) — PageType enum
src/core/markdown.ts (7 lines) — inferType heuristics
src/core/link-extraction.ts (51 lines) — v0.10.5 regex expansion, covered by existing tests
src/commands/migrations/v0_12_0.ts (1 line) — banner text

Test suite

Before this branch: 2253 pass / 18 fail (PGLite parallel-load flakes in dream/orphans/cycle/extract-db/brain-allowlist/multi-source-integration)

After the fix in 626aebf: 2317 pass / 0 fail at full parallelism. Root cause: bun's default 5s hook timeout wasn't enough for PGLite.create() + 20 migrations under 136-way file parallelism. bunfig.toml's timeout = 60_000 covers tests, not hooks — hooks need per-call timeout as the third arg. Six files updated with explicit 60s hook timeouts.

With the new inferType test: 2326 pass / 0 fail.

Plan Completion

Subagent audit against ~/.claude/plans/.../abstract-treehouse.md (BrainBench v1.1 Delta):

Phase 1 (Polish): 3/3 shipped to gbrain-evals
Phase 2 (Credibility): 5/5 shipped (all 3 external adapters + N=5 tolerance bands + Tier 5/5.5 queries + sealed qrels)
Phase 3 (Contributor explorer): 3/4 shipped; world.html explorer is a follow-up in gbrain-evals

Deferred items (3) are all explicit v1.2+/v2 punts from the original plan, not gaps in this PR.

Pre-Landing Review

Adversarial subagent produced 4 findings. Processed:

Scope creep (false positive): subagent diffed against stale local master (v0.18.2); actual base origin/master is v0.19.0 (skillify). Against the real base, src changes are 4 surgical files.
inferType /media/ before /meetings/: intentional ordering (media directory wins). Not a bug.
Public exports expose internals (./pglite-engine, ./extract, ./config): acknowledged trade-off; gbrain-evals depends on deep integration. CHANGELOG/docs already flag this as the v0.20 contract.
60s hook timeout masks hypothetical PGLite deadlock: accepted. The macOS WASM bug (PGLite WASM crash on macOS 26.3 with Bun 1.3.11 #223) has its own error wrapping in pglite-engine.ts.

Quality score: 9.0/10. No shipping blockers.

TODOS

Marked complete in v0.20.0:

BrainBench Cats 5/6/8/9/11 (all shipped to gbrain-evals)
v0.10.5 inferLinkType residuals (shipped in-tree via regex expansion)

Remaining P1: BrainBench Cat 1+2 at full scale (2-3K pages vs current 240).

Documentation

CLAUDE.md — added 7 new v0.19 source modules to the Key Files list (resolver-filenames, skillify, skillify-check, skillpack, skill-manifest, routing-eval, filing-audit); added 8 new test files + openclaw-reference-compat E2E to the test index; repointed the release-summary template's benchmark source to gbrain-evals/docs/benchmarks/ since the benchmark dir moved.
CHANGELOG.md — voice polish on the v0.20.0 entry (em dashes → periods/parens/ellipses per project style). No content changes.
README, CONTRIBUTING, AGENTS.md — verified accurate against the diff.
Commit: 9e567bb

Test plan

Full unit suite: 2326 pass / 0 fail (bun test, 280s wall clock)
Typecheck: clean (tsc --noEmit)
bun run build:llms: llms.txt + llms-full.txt regenerated and committed
Smoke of renamed adapter imports in gbrain-evals (cat13-conceptual.ts + multi-adapter.ts both run end-to-end)

Benchmarks committed to gbrain-evals

2026-04-23-brainbench-v0.20.0.md — Cats 1+2 baseline. gbrain 49.1% P@5 / 97.9% R@5. Within ±0.1 pts of v0.12.1 reference → no retrieval regression across v0.16 → v0.20.
2026-04-23-brainbench-cat13-conceptual.md — Cat 13 Conceptual Recall (NEW). vector 49.1% > fusion 47.5% > gbrain 47.1% > grep-only 46.2%. The ordering flips vs Cats 1+2: vectors earn their keep on paraphrase/synonym queries. Two-axis scorecard (relational + conceptual) is now the honest way to publish adapter comparisons.

🤖 Generated with Claude Code

…ch prose Extends inferLinkType patterns to cover rich-prose phrasings that miss with v0.10.4 regexes. Targets the residuals called out in TODOS.md: works_at at 58% type accuracy, advises at 41%. WORKS_AT_RE additions: - Rank-prefixed: "senior engineer at", "staff engineer at", "principal/lead" - Discipline-prefixed: "backend/frontend/full-stack/ML/data/security engineer at" - Possessive time: "his/her/their/my time at" - Leadership beyond "leads engineering": "heads up X at", "manages engineering at", "runs product at", "leads the [team] at" - Role nouns: "role at", "position at", "tenure as", "stint as" - Promotion patterns: "promoted to staff/senior/principal at" ADVISES_RE additions: - Advisory capacity: "in an advisory capacity", "advisory engagement/partnership/contract" - "as an advisor": "joined as an advisor", "serves as technical advisor" - Prefixed advisor nouns: "strategic/technical/security/product/industry advisor to|at" - Consulting: "consults for", "consulting role at|with" New EMPLOYEE_ROLE_RE page-level prior: fires when the page describes the subject as an employee (senior/staff/principal engineer, director, VP, CTO/CEO/CFO) at some company. Biases outbound company refs toward works_at when per-edge context is possessive or narrative without an explicit work verb. Scoped to person -> company links only. Precedence: investor > advisor > employee (investors often hold board seats which would otherwise mis-classify as advise/works_at). ADVISOR_ROLE_RE broadened from "full-time/professional/advises multiple" to catch any page that self-identifies the subject as an advisor ("is an advisor", "serves as advisor", possessive "her advisory work/role/engagement"). Tests: 65 pass (16 new v0.10.5 coverage tests + 4 regression guards against v0.10.4 tightenings). Templated benchmark still 88.9% type_accuracy (10/10 on works_at and advises). Rich-prose measurement requires the multi-axis report upgrade (next commit) to validate retroactively. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New Category 2 in BrainBench: per-link-type accuracy measured directly on the 240-page rich-prose world-v1 corpus. Distinct from Cat 1's retrieval metrics, this measures whether inferLinkType() correctly classifies extracted edges when the prose varies (the 58% works_at and 41% advises residuals that v0.10.5 regexes targeted). How it works: 1. Loads all pages from eval/data/world-v1/ 2. Derives GOLD expected edges from each page's _facts metadata (founders → founded, investors → invested_in, advisors → advises, employees → works_at, attendees → attended, primary_affiliation + role drives person-page outbound type) 3. Runs extractPageLinks() on each page → INFERRED edges 4. Per (from, to) pair, compares inferred type vs gold type 5. Emits per-link-type table: correct / mistyped / missed / spurious + type accuracy + recall + precision + strict F1 (triple match) 6. Full confusion matrix rows=gold, cols=inferred v0.10.5 validation on 240-page corpus (up from pre-v0.10.5 baselines): - works_at: 58% → 100.0% (+42 pts) — 10/10 correct, 0 mistyped - advises: 41% → 88.2% (+47 pts) — 15/17 correct - attended: — → 100.0% 131/134 recall - founded: 100% → 100.0% 40/40 - invested_in: 89% → 92.0% 69/75 - Overall: 88.5% → 95.7% type accuracy (conditional on edge found) Strict F1 overall: 53.7%. Lower because the _facts-based gold set only captures core relationships; rich prose extracts many peripheral mentions (190 spurious "mentions" edges) that aren't bugs but are correctly-typed prose references without a _facts counterpart. Spurious counts are signal for future type-precision tuning, not failure. Wired into eval/runner/all.ts as Cat 2 so every full benchmark run includes the rich-prose type accuracy table alongside retrieval metrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 credibility unlock: BrainBench now compares gbrain to external baselines on the same corpus and queries. Transforms the benchmark from internal ablation ("gbrain-graph beats gbrain-grep") to category comparison ("gbrain-graph beats classic BM25 by 32 pts P@5"). This is the #1 fix from the 4-review arc — addresses Codex's core critique that v1's before/after was self-referential. Added: eval/runner/types.ts — Adapter interface (v1.1 spec) eval/runner/adapters/ripgrep-bm25.ts — EXT-1 classic IR baseline eval/runner/adapters/ripgrep-bm25.test.ts — 11 unit tests, all pass eval/runner/multi-adapter.ts — side-by-side scorer Adapter interface (eng pass 2 spec): - Thin 3-method Strategy: init(rawPages, config), query(q, state), snapshot(state) - BrainState is opaque to runner (never inspected) - Raw pages passed in-memory; gold/ never crosses adapter boundary (structural ingestion-boundary enforcement) - PoisonDisposition enum reserved for future poison-resistance scoring EXT-1 ripgrep+BM25: - Classic Lucene-variant IDF + k1/b tuned at standard 1.5/0.75 - Title tokens double-weighted for entity-page slug-match bias - Stopword filter, alphanumeric tokenization, stable lexicographic tie-break - Pure in-memory inverted index — no external deps, ~100 LOC core First side-by-side results on 240-page rich-prose corpus, 145 relational queries: | Adapter | P@5 | R@5 | Correct top-5 | |---------------|--------|--------|---------------| | gbrain-after | 49.1% | 97.9% | 248/261 | | ripgrep-bm25 | 17.1% | 62.4% | 124/261 | | Delta | +32.0 | +35.5 | +124 | gbrain-after is the hybrid graph+grep config from PR #188. Ripgrep+BM25 is a genuinely strong classic-IR baseline (BM25 is what Lucene/Elasticsearch ship). gbrain's ~+32-point lead on relational queries reflects real work by the knowledge graph layer: typed links + traversePaths surface the correct answers in top-K that BM25 only pulls in via partial-text overlap. Next in Phase 2: EXT-2 vector-only RAG + EXT-3 hybrid-without-graph adapters. Both plug into the same Adapter interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second external baseline for BrainBench. Pure cosine-similarity ranking using the SAME text-embedding-3-large model gbrain uses internally — apples-to-apples on the embedding layer so any gbrain lead reflects the graph + hybrid fusion, not a better embedder. Files: eval/runner/adapters/vector-only.ts ~130 LOC eval/runner/adapters/vector-only.test.ts 6 unit tests (cosine math) Design: - One vector per page (title + compiled_truth + timeline, capped 8K chars). - No chunking (intentional; chunked vector RAG would be EXT-2b later). - No keyword fallback (that's EXT-3 hybrid-without-graph). - Embeddings in batches of 50 via existing src/core/embedding.ts (retry+backoff). - Cost on 240 pages: ~$0.02/run. Three-adapter side-by-side on 240-page rich-prose corpus, 145 relational queries: | Adapter | P@5 | R@5 | Correct top-5 | |---------------|--------|--------|---------------| | gbrain-after | 49.1% | 97.9% | 248/261 | | ripgrep-bm25 | 17.1% | 62.4% | 124/261 | | vector-only | 10.8% | 40.7% | 78/261 | Interesting finding: vector-only scores WORSE than BM25 on relational queries like "Who invested in X?" — exact entity match matters more than semantic similarity for these templates. BM25 nails the entity-name term; vector-only returns topically-similar-but-not-mentioning pages. This is the known failure mode of pure-vector RAG on precise relational/identity queries. Real-world vector RAG systems always add keyword fallback; EXT-3 (hybrid-without-graph) will be that fairer comparator. gbrain's lead widens in vector-only comparison: +38.4 pts P@5, +57.2 pts R@5. The graph layer is doing the heavy lifting for relational traversal; pure vector RAG can't express "traverse 'attended' edges from this meeting page." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Third and closest-to-gbrain external baseline. Runs gbrain's full hybrid search (vector + keyword + RRF fusion + dedup) WITHOUT the knowledge-graph layer. Same engine, same embedder, same chunking, same hybrid fusion — only traversePaths + typed-link extraction turned off. This is the decisive comparator for "does the knowledge graph do useful work?" Same everything-else, only graph differs. Any lead gbrain-after has over EXT-3 is 100% attributable to the graph layer. Files: eval/runner/adapters/hybrid-nograph.ts — ~110 LOC Implementation: - New PGLiteEngine per run; auto_link set to 'false' (belt). - importFromContent() used instead of bare putPage() so chunks + embeddings get populated (hybridSearch needs them). - NO runExtract() call — typed links/timeline stay empty (suspenders). - hybridSearch(engine, q.text) answers every query. Aggregate chunks to page-level by best chunk score. FOUR-adapter side-by-side on 240-page rich-prose corpus, 145 relational queries: | Adapter | P@5 | R@5 | Correct/Gold | |-----------------|--------|--------|--------------| | gbrain-after | 49.1% | 97.9% | 248/261 | | hybrid-nograph | 17.8% | 65.1% | 129/261 | | ripgrep-bm25 | 17.1% | 62.4% | 124/261 | | vector-only | 10.8% | 40.7% | 78/261 | The headline delta nobody can hand-wave away: gbrain-after → hybrid-nograph = +31.4 P@5, +32.9 R@5 hybrid-nograph → ripgrep-bm25 = +0.7 P@5, +2.7 R@5 Hybrid search (vector+keyword+RRF) over pure BM25 gains ~1 point. The knowledge graph layer over hybrid gains ~31 points. The graph is doing the work; adding it to a retrieval stack is what actually moves the needle on relational queries. The vector/keyword/BM25 debate is a footnote. Timing: hybrid-nograph init is ~2 min (embeds 240 pages once); query loop is fast. gbrain-after is ~1.5s total because traversePaths doesn't need embeddings. Runs at ~$0.02 Opus-equivalent in embedding cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ic + N=5 tolerance bands Closes multiple Phase 2 items in one commit since they form a cohesive package: query schema enforcement + new query tiers + per-query-set statistical rigor. Added: eval/runner/queries/validator.ts — hand-rolled Query schema validator eval/runner/queries/validator.test.ts — 24 unit tests, all pass eval/runner/queries/tier5-fuzzy.ts — 30 hand-authored Tier 5 Fuzzy/Vibe queries eval/runner/queries/tier5_5-synthetic.ts — 50 SYNTHETIC-labeled outsider-style queries (author: "synthetic-outsider-v1") eval/runner/queries/index.ts — aggregator + validateAll() Modified: eval/runner/multi-adapter.ts — N=5 runs per adapter (BRAINBENCH_N override), page-order shuffle, mean±stddev reporting Query validator (hand-rolled, no zod dep to match gbrain codebase style): - Temporal verb regex enforces as_of_date (per eng pass 2 spec): /\\b(is|was|were|current|now|at the time|during|as of|when did)\\b/i - Validates tier enum, expected_output_type enum, gold shape per type - gold.relevant must be non-empty slug[] for cited-source-pages queries - abstention requires gold.expected_abstention === true - externally-authored tier requires author field - batch validation catches duplicate IDs Tier 5 Fuzzy/Vibe (30 queries, hand-authored): - Vague recall: "Someone who was a senior engineer at a biotech company..." - Trait-based: "The engineer who pushed back on microservices" - Cultural/epithet: "Who is known as a 'systems builder' in security?" - Abstention bait: "Which Layer 1 project did the crypto guy leave?" (prose mentions but never names; good systems abstain) - Addresses Codex's circularity critique — vague queries where graph-heavy systems shouldn't inherently win. Tier 5.5 Synthetic Outsider (50 queries, AI-authored placeholder): - Clearly labeled author: "synthetic-outsider-v1" - Phrasing variety not in the 4 template families: * fragment style ("crypto founder Goldman Sachs background") * polite/natural ("Can you pull up what we have on...") * comparison ("What is the difference between X and Y?") * follow-up ("And who else advises Orbit Labs?") * typos/misspellings ("adam lopez bioinformatcis") * similarity ("Find me someone like Alice Davis...") * imperative ("Pull up Alice Davis") - Real Tier 5.5 from outside researchers supersedes synthetic via PRs to eval/external-authors/ (docs ship in follow-up commit). N=5 tolerance bands: - Default N=5, override via BRAINBENCH_N env var (e.g. BRAINBENCH_N=1 for dev loops) - Per-run seeded Fisher-Yates shuffle of page ingest order (LCG seed = run_idx+1) - Surfaces order-dependent adapter bugs (tie-break-by-first-seen etc.) - Reports mean ± sample-stddev per metric - "stddev = 0" is honest signal that the adapter is deterministic, not a bug. LLM-judge metrics (future) will naturally produce non-zero stddev. Validation: all 80 Tier 5 + 5.5 queries pass validateAll(). 24 validator unit tests pass. Next commit: world.html contributor explorer (Phase 3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@me

Contributor DX magical moment. Static HTML explorer renders the full canonical world (240 entities) as an explorable tree, opens in any browser, zero install. Every string HTML-entity-encoded (XSS-safe — direct vuln class per eng pass 2, confidence 9/10). Added: eval/generators/world-html.ts — renderer (~240 LOC; single-file HTML with inline CSS + minimal JS) eval/generators/world-html.test.ts — 16 tests (XSS + rendering correctness) eval/cli/world-view.ts — render + open in default browser eval/cli/query-validate.ts — CLI wrapper for queries/validator eval/cli/query-new.ts — scaffold a query template Modified: package.json — 7 new eval:* scripts .gitignore — ignore generated world.html package.json scripts shipped: bun run test:eval all eval unit tests (57 pass) bun run eval:run full 4-adapter N=5 side-by-side bun run eval:run:dev N=1 fast dev iteration bun run eval:world:view render world.html + open in browser bun run eval:world:render render only (CI-friendly, --no-open) bun run eval:query:validate validate built-in T5+T5.5 (or a file path) bun run eval:query:new scaffold a new Query JSON template bun run eval:type-accuracy per-link-type accuracy report XSS safety: escapeHtml() encodes the 5 critical chars (& < > " '). Tested directly with representative Opus-generated attacks: <img src=x onerror=alert('xss')> → <img src=x onerror=alert('xss')> <script>fetch('/steal')</script> → <script>fetch('/steal')</script> Ledger metadata (generated_at, model) also escaped — covers the less obvious attack surface where Opus could emit tag-like content into the metadata file. world.html structure: - Left rail: entities grouped by type with counts (companies, people, meetings, concepts), alphabetical within type - Right pane: per-entity cards with title + slug + compiled_truth + timeline + canonical _facts as collapsed JSON - URL fragment deep-links (#people/alice-chen) - Sticky rail on desktop; responsive stack on mobile - Vanilla JS for active-link highlighting on scroll (no framework) Generated file: ~1MB for 240 entities (full prose). Gitignored; rebuild with `bun run eval:world:view`. Regeneration is ~50ms. Contributor TTHW (Tier 5.5 query authoring): 1. bun run eval:world:view # see entities 2. bun run eval:query:new --tier externally-authored --author "@me" 3. edit template with real slug + query text 4. bun run eval:query:validate path/to/file.json 5. submit PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ships the contributor-onboarding surface promised in the plan. With this commit, external researchers have a self-serve path from clone to PR in under 5 minutes. Added: eval/README.md — 5-minute quickstart, directory map, methodology one-pager, adapter scorecard eval/CONTRIBUTING.md — three contributor paths: 1. Write Tier 5.5 queries 2. Submit an external adapter 3. Reproduce a scorecard eval/RUNBOOK.md — operational troubleshooting: generation failures, runner failures, query validation, world.html rendering, CI eval/CREDITS.md — contributor attribution (synthetic-outsider-v1 labeled as placeholder; real submissions land here) .github/PULL_REQUEST_TEMPLATE/tier5-queries.md — structured PR template for Tier 5.5 submissions .github/workflows/eval-tests.yml — CI: validates queries, runs all eval unit tests, renders world.html on every PR touching eval/** or src/core/link-extraction.ts CI scope (intentionally narrow): - Triggers on paths: eval/**, src/core/link-extraction.ts, src/core/search/** - Runs: bun run eval:query:validate (80 queries), test:eval (57 tests), eval:world:render (smoke-test the HTML renderer) - Pinned actions by commit SHA (matches existing .github/workflows/test.yml) - Zero API calls — all Opus/OpenAI paths stubbed or skipped in unit tests - Fast: ~30s total wall clock Contributor TTHW (clone → first merged PR): - Path 1 (Tier 5.5 queries): ~5 min - Path 2 (external adapter): ~30 min for a simple adapter - Path 3 (reproduce scorecard): ~15 min wall clock (N=5 run) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The multi-adapter runner left PGLite engines alive after each run. GbrainAfterAdapter and HybridNoGraphAdapter both instantiate a PGLiteEngine in init() but never disconnect it; Bun's shutdown path exits with code 99 when embedded-Postgres workers outlive main(). Added optional `teardown?(state)` to the Adapter interface, implemented it on both engine-backed adapters, and call it from scoreOneRun after the N=5 loop. ripgrep-bm25 and vector-only hold no DB resources and don't need a teardown. Verified: gbrain-after, hybrid-nograph, ripgrep-bm25, vector-only all exit 0 at N=1. Full test:eval passes (57 tests). No metric change.

Reproducibility run of the 4-adapter side-by-side at commit b81373d (branch garrytan/gbrain-evals). N=5, 240-page corpus, 145 relational queries from world-v1. Headline: gbrain-after 49.1% P@5 / 97.9% R@5. hybrid-nograph 17.8% / 65.1%. ripgrep-bm25 17.1% / 62.4%. vector-only 10.8% / 40.7%. All adapters deterministic (stddev = 0 across the 5 runs per adapter). Matches the scorecard in eval/README.md byte-for-byte for the three deterministic adapters; hybrid-nograph matches within tolerance bands.

Runs the same eval harness against two gbrain src/ trees on the same 240-page corpus and 145 queries. Patches the v0.11 copy's gbrain-after adapter to use getLinks/getBacklinks (v0.11 has no traversePaths) with identical direction+linkType semantics. gbrain-after P@5 22.1% -> 49.1% (+27 pts); R@5 54.6% -> 97.9% (+43 pts); correct-in-top-5 99 -> 248 (+149). hybrid-nograph flat at 17.8% / 65.1% on both (v0.12 didn't touch hybridSearch / chunking). Driver is extraction quality, not graph presence: v0.12 emits 499 typed links (v0.11: 136, x3.7) and 2,208 timeline entries (v0.11: 27, x82) on the same 240 pages. Sharpens the April-18 "graph layer does the work" claim -- on v0.11 that architecture only beat hybrid-nograph by 4.3 points; the 31-point lead in the multi-adapter scorecard comes from graph + high-quality extract in combination.

# Conflicts: # package.json

Adds the v1→v2 contract boundary for BrainBench. 6 JSON schemas at eval/schemas/ pin the shape of every artifact a stack must emit to be scorable: corpus-manifest, public-probe (PublicQuery with gold stripped), tool-schema (12 read + 3 dry_run tools, 32K tool-output cap), transcript, scorecard (N ∈ {1, 5, 10}), evidence-contract (structured judge input). 8 gold file templates at eval/data/gold/ scaffold the sealed qrels, contradictions, poison items, and citation labels. Empty-but-valid skeletons; Day 3b fills them with real content once the amara-life-v1 corpus generates. 48 tests validate schema syntax, $schema/$id/title/type headers, round-trip stability, and cross-schema coherence (new Page types in manifest enum, tool counts, token cap, N enum). When v2 ports to Python + Inspect AI + Docker, these schemas are the boundary. Same fixtures, same tool contracts, zero rework.

…al/note Deterministic procedural generator for the twin-amara-lite fictional-life corpus (BrainBench v1 Cat 5/8/9/11 target). 15 contacts picked from world-v1, 50 emails + 300 Slack messages across 4 channels + 20 calendar events + 8 meeting transcripts + 40 first-person notes. Mulberry32 PRNG gives byte-identical output under reseed. Plants 10 contradictions + 5 stale facts + 5 poison items + 3 implicit preferences at deterministic positions. Fixture_ids are unique across the corpus so gold/contradictions.json + gold/poison.json + gold/implicit- preferences.json can cross-reference by stable ID. PageType extended in both src/core/types.ts and eval/runner/types.ts to include email | slack | calendar-event | note (+ meeting on the production side). src/core/markdown.ts inferType() heuristics updated for the new one-slash slug prefixes (emails/em-NNNN, slack/sl-NNNN, cal/evt-NNNN, notes/YYYY-MM-DD-topic, meeting/mtg-NNNN). 17 tests cover counts (50/300/20/8/40), perturbation counts (exact 10/5/5/3), seed determinism + divergence, slug regex conformance (matches eval/runner/queries/validator.ts:131 one-slash rule), unique fixture_ids, amara-in-every-email invariant, calendar dtstart < dtend, and Amara-is- attendee on every meeting.

Opus prose expansion of the amara-life-v1 skeleton. Per-item structured cache key = sha256({schema_version, template_id, template_hash, model_id, model_params, seed, item_spec_hash}). Prompt-template tweak changes template_hash; only those items regenerate. Schema bump changes schema_version; everything invalidates cleanly. Interrupted runs resume from the last cached item; zero re-spend. Cost-gated at $20 hard-stop with Anthropic input/output pricing tracking. Dry-run mode (--dry-run) executes the full pipeline with stub bodies for smoke-testing the I/O layout without LLM spend. --max N caps items per type for debugging. --force ignores cache. Writes per-format outputs under eval/data/amara-life-v1/: inbox/emails.jsonl (one email per line with body_text appended) slack/messages.jsonl (one message per line with text appended) calendar.ics (RFC-5545 VEVENT format, templated — no LLM) meetings/<id>.md (transcript with YAML frontmatter) notes/<YYYY-MM-DD-topic>.md (first-person journal) docs/*.md (6 reference docs, templated — no LLM) corpus-manifest.json (per eval/schemas/corpus-manifest.schema.json, including per-item content_sha256 and generator_cache_key) Perturbation hints (contradiction, stale-fact, poison, implicit- preference) flow through the prompt so Opus weaves the specific claim into each item's body. Poison items are hand-crafted to include paraphrased prompt-injection attempts (not literal 'IGNORE ALL PREVIOUS' — defense is the structured-evidence judge contract at Day 5, not regex redaction). New package.json scripts: eval:generate-amara-life # real run (~$12 Opus estimated) eval:generate-amara-life:dry # smoke test, zero spend test:eval extended to include test/eval/. 10 cache-key tests cover determinism, invalidation across every field of the key, canonical JSON stability under object-key reorder, and per-skeleton-item spec-hash uniqueness (50 distinct hashes for 50 distinct emails).

Resets package.json from stale 0.13.1 to 0.15.0 (matches VERSION). v0.14.0 shipped with the stale package.json version; this sync catches that up and moves to v0.15.0 in one step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CLAUDE.md: adds a full BrainBench section to the Key Files list — 14 new entries covering eval/README.md, multi-adapter.ts, types.ts (with new PublicPage/PublicQuery), adapters/, queries/, type-accuracy.ts, adversarial.ts, all.ts, world.ts/gen.ts, world-html.ts, amara-life.ts, amara-life-gen.ts, schemas/, data/world-v1/, data/gold/, data/amara-life-v1/, docs/benchmarks/, and test/eval/. Adds 3 new test/eval/ lines to the unit-tests catalog. eval/README.md: file tree updated to reflect v0.15 additions — data/amara-life-v1/, data/gold/, schemas/, generators/amara-life.ts + amara-life-gen.ts, runner/all.ts + adversarial.ts. README.md: updates hero benchmark numbers (L7 intro + L353 mid-page) from v0.10.5 PR #188 numbers (R@5 83→95, P@5 39→45) to current v0.12.1 4-adapter numbers (P@5 49.1% · R@5 97.9% · +31.4 pts vs hybrid-nograph). Adds the v0.11→v0.12 regression comparison as the secondary reference. Deeper-section tables (L422+) labeled "BrainBench v1 (PR #188)" are preserved as historical data. CHANGELOG is untouched — /ship already wrote the v0.15.0 entry. TODOS.md is untouched — Cat 5/6/8/9/11 remain open (only foundations shipped in v0.15.0; Cat runners ship in v1 Complete follow-ups). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n + expand:false) Three infrastructure modules for BrainBench v1 Complete Cats 5/8/9/11. **eval/runner/loaders/pdf.ts** — Thin pdf-parse wrapper. Lazy import keeps pdf-parse out of the module-load path (avoids library debug-mode side effects). Size cap (50MB default), encryption detection, structured error classes (PdfEncryptedError, PdfTooLargeError, PdfParseError). Only Cat 11 multimodal will import this; production bundle never sees pdf-parse. **eval/runner/tool-bridge.ts** — Maps 12 read-only operations from src/core/operations.ts to Anthropic tool definitions + adds 3 dry_run write tools. Three structural invariants enforced: 1. No hidden LLM calls. `operations.query` defaults expand=true which routes through expansion.ts → Haiku. Bridge strips `expand` from the query tool's input schema AND executor hard-sets expand:false. Zero nested Haiku calls in any agent trace. 2. Mutating ops throw ForbiddenOpError. put_page, add_link, delete_page, etc. are rejected by name. Agents record intent via dry_run_put_page / dry_run_add_link / dry_run_add_timeline_entry which persist to the flight-recorder without mutating the engine. This is how Cat 8's back_link_compliance + citation_format metrics measure anything with a read-only tool surface. 3. Poison tagged by the bridge, not the judge. Every tool result is scanned for slugs matching gold/poison.json fixtures. Matched fixture_ids flow into tool_call_summary.saw_poison_items for the structured-evidence judge contract. Judge never reads raw tool output — Section-3 defense against paraphrased prompt injections (poison payloads never reach the judge model at all). 32K-token cap (~128K chars) with "…[truncated]" suffix. **eval/runner/recorder.ts** — Per-run flight-recorder bundle emitter. Full 6-artifact bundle (transcript.md, brain-export.json, entity-graph.json, citations.json, scorecard.json, judge-notes.md) when the adapter provides an AdapterExport; 3-artifact fallback (transcript + scorecard + judge-notes) otherwise. Atomic writes via tmp+rename. Collision-safe: duplicate directory names get incremental -2, -3 suffix. `safeStringify` handles circular references without throwing and JSON-serializes Float32Array embeddings. **package.json:** adds pdf-parse@2.4.5 as a devDependency. Scoped to eval/ use only; production gbrain binary unaffected. **Tests:** 63 new — 30 tool-bridge, 21 recorder, 12 pdf-loader. All pass. Fake engine uses a Proxy with `__default__` fallback so poison-matching tests don't have to mock the exact engine method name that each operation calls (some route via searchKeyword, others via getPage — proxy handles both uniformly). Total eval suite now: 132 pass, 0 fail, 923 expect() calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ntract Two modules that together wire Cat 8 / Cat 9 / Cat 5 end-to-end scoring. **eval/runner/judge.ts** — Haiku 4.5 via tool-use `score_answer`. Input is the structured JudgeEvidence contract (fix #16 from the plan's codex review): probe + final_answer_text + evidence_refs + tool_call_summary + ground_truth_pages + rubric. Raw tool output NEVER reaches the judge — that's the Section-3 defense against paraphrased prompt-injection payloads in gold/poison.json. Retry policy: one retry on malformed tool_use response. If the second attempt is still malformed, score the probe as `judge_failed` (all scores 0, verdict=fail) so the run still completes. Aggregation: weighted mean across rubric criteria. Canonical thresholds (pass ≥3.5, partial 2.5-3.5, fail <2.5) — judge can propose a verdict but the computed verdict from the weighted mean is what the scorecard records. This prevents the model from inflating or deflating its own verdict. Score values are clamped to 0-5 on parse even if the model returns out of range. `assertNoRawToolOutput(evidence)` is a regression guard that returns the list of forbidden fields (tool_result, raw_transcript, etc.) if any leak into the evidence contract. **eval/runner/adapters/claude-sonnet-with-tools.ts** — The agent adapter. Implements `Adapter` interface minimally: `init()` spins up PGLite and seeds it, `query()` throws because the adapter is Cat 8/9-only and emits a final-answer text, not a RankedDoc[]. Retrieval scorecard stays at 4 adapters. `runAgentLoop(probeId, text, state, config)` drives the multi-turn loop: Sonnet → tool_use → tool-bridge.executeTool → tool_result → back to Sonnet. Turn cap 10. max_tokens 1024. System prompt (brain-first iron law, citation format, amara context) is cached via cache_control. Exponential backoff on rate-limit errors (1s, 2s, 4s). Emits a `Transcript` per eval/schemas/transcript.schema.json — consumed directly by recorder.ts for the flight-recorder bundle. `brain_first_ordering` classifies Cat 8's flagship metric: did the agent call search/get_page BEFORE producing the final answer? The `no_brain_calls` case (agent answers from general knowledge without ever hitting the brain) is the compliance failure to surface. ForbiddenOpError + UnknownToolError from the bridge are caught in the agent loop and surfaced as tool_result with is_error=true — keeps the loop going and preserves full audit trail for the judge. **Tests (35 new):** judge (23) — happy path, retry, fallback, evidence contract sanitization, rendered prompt does not contain raw tool_result text, verdict thresholds, score clamping, weighted mean with mixed weights, parseToolUse rejects malformed input. agent-adapter (12) — Adapter.query() throws, init() seeds PGLite, end-to-end tool loop with stubbed Sonnet, turn cap exhaustion, mutating-op rejection surfaces as tool_result error, extractSlugs regex. All 12 agent tests take ~23s because PGLite runs 13 schema migrations per test; the alternative of shared-engine-across-tests was rejected so each test is isolated. Total eval suite now: 167 pass, 0 fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…11 multi-modal Three modules that together cover BrainBench v1 Cat 6 (prose-scale extraction fidelity) and Cat 11 (multi-modal ingest fidelity). **eval/runner/adversarial-injections.ts** — 6 deterministic content transforms shared by Cat 10 (adversarial.ts, 22 hand-crafted cases) and Cat 6 (prose-scale variants). Each injection produces a modified content string + a structured GoldDelta describing what the extractor MUST and MUST NOT produce. Kinds: - code_fence_leak — fake [X](people/fake) inside ``` fence, must NOT extract - inline_code_slug — `people/fake` in backticks, must NOT extract - substring_collision — "SamAI" near real `people/sam`, exactly one link - ambiguous_role — "works with" vs "works at", downgrade type to mentions - prose_only_mention — strip markdown link syntax, bare name → mentions only - multi_entity_sentence — pack 4+ entities into one clause, extract all Mulberry32 PRNG keeps variant generation deterministic under fixed seed. Codex flagged the original plan's wording ("extract injection engine from adversarial.ts") as overstated — adversarial.ts is a static case list, not a reusable engine. This module is NEW code. **eval/runner/cat6-prose-scale.ts** — Runner. Loads world-v1, applies all 6 injection kinds to sampled base pages (default 50 variants per kind × 6 kinds = 300 variants), runs extractPageLinks on each, compares to gold delta. Emits per-kind + overall metrics (precision, recall, F1, code_fence_leak_rate, substring_fp_rate, pages_with_links_coverage, mean_links_per_page). **v1 verdict is always "baseline_only"** — no gating threshold per codex fix #9 (current extractor residuals make >0.80 unreachable; v1 records a baseline, regression guard triggers on drop below it). **eval/runner/cat11-multimodal.ts** — PDF + HTML + audio runners. Fixtures load from eval/data/multimodal/<modality>/fixtures.json manifests; each modality skips gracefully when manifest missing or (audio) when neither GROQ_API_KEY nor OPENAI_API_KEY is set. Metrics: - PDF: char-level similarity via Levenshtein + optional entity_recall - HTML: word-recall over normalized tokens (multiset semantics) - Audio: WER (word error rate) via Levenshtein on word sequences Fixtures are NOT committed; a future eval:fetch-multimodal script will download them hash-verified from public sources (arXiv CC-licensed papers, Wikipedia CC-BY-SA, Common Voice CC0). Injectable audio transcriber (`opts.transcribe`) means tests don't need GROQ/OpenAI keys — stubbed transcriptions exercise the WER math path directly. **Tests (60 new):** adversarial-injections (19) — per-kind assertions + dispatcher coverage + slug regex conformance; cat6 (12) — variant determinism, scoreVariant shape, aggregate per-kind + overall metrics, corpus resolver slug rules; cat11 (29) — charSimilarity / wordRecall / wer math, htmlToText strips scripts + decodes entities, HTML modality with real fixtures, audio modality gracefully skips without key + uses stub transcriber correctly. All 60 tests pass in 48ms + 41ms. Total eval suite now: 227 pass, 0 fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…im judge **eval/runner/cat5-provenance.ts** — BrainBench Cat 5 scoring. Samples claims from gbrain brain-export and classifies each against its source material via a dedicated Haiku judge (classify_claim tool with a three-label enum: supported | unsupported | over-generalized). Separate from judge.ts by design: Cat 5 is a single three-way classification per claim, not a weighted rubric. Rather than overload judge.ts with a mode switch, Cat 5 has its own tool definition (CLASSIFY_CLAIM_TOOL) and prompt. The retry-once pattern, $20 cost gate semantics, and structured parsing are mirrored from judge.ts so failures look the same across Cats. Metric: `citation_accuracy` = fraction where predicted label equals gold expected_label. Threshold (informational): >0.90 per design-doc METRICS.md. v1 ships with `enableThreshold: false` so the verdict is always baseline_only — we don't have hand-authored gold claims yet, and codex flagged that threshold gating should wait until the amara-life-v1 corpus + gold file authoring lands in Day 3b. runCat5 uses a bounded-concurrency worker pool (default 4) to respect Haiku rate limits across 100+ claim batches. Evidence pages are looked up by slug from a caller-provided pagesBySlug map — missing pages don't crash, they just pass an empty source list to the judge (correct behavior for genuinely unsupported claims). **Tests (23):** classifyClaim happy/retry/fallback paths with stubbed Haiku, aggregate accuracy math, threshold gating (pass/fail vs baseline_only), runCat5 concurrency + missing-page handling, renderClaimPrompt embeds claim + sources correctly, parseClassification rejects invalid enum values + plain-text responses. Total eval suite now: 250 pass, 0 fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

**eval/runner/cat8-skill-compliance.ts** — Deterministic, judge-free Cat 8 scoring. Replays inbound signals through the agent adapter (Day 5) and extracts four iron-law metrics directly from the tool-bridge state: - brain_first_compliance: agent called search/get_page BEFORE producing its final answer. Non-compliance = hallucinating from general knowledge. - back_link_compliance: every dry_run_put_page intent has at least one markdown [Name](slug) back-link in its compiled_truth. - citation_format: timeline entries use canonical `- **YYYY-MM-DD** | Source — Summary`; long final answers cite at least one slug. - tier_escalation: simple probes use light tooling (≥1 brain call); complex probes require ≥2 brain calls or a dry_run write when expects_dry_run_write is set. No judge call required — everything is computable from `tool_bridge_state.made_dry_run_writes` + `count_by_tool` + final_answer regex. Fast, deterministic, reproducible. Bounded concurrency (p-limit style) worker pool at default 4 to keep Sonnet rate limits comfortable across 100-probe batches. **eval/runner/cat9-workflows.ts** — Rubric-graded Cat 9. 5 canonical workflows (meeting_ingestion, email_to_brain, daily_task_prep, briefing, sync) × ~10 scenarios each. Each scenario runs through the agent adapter, then judge.ts scores the answer against a per-scenario rubric. `buildEvidence(scenario, agentResult, pagesBySlug)` composes the JudgeEvidence contract: resolves ground_truth_slugs to full GroundTruthPage[] from a slug-map, pulls tool_call_summary directly from tool_bridge_state (no raw tool_result content — Section-3 defense), attaches rubric from the scenario. Per-workflow rollup: each workflow gets its own pass_rate so the verdict can fail one workflow without failing the whole Cat. Overall verdict requires every populated workflow's pass_rate ≥ threshold (default 0.80) when enableThreshold=true. Both Cats default to verdict=baseline_only in v1 per codex fix #9: real thresholds return after 10-probe Haiku-vs-hand-score calibration (κ > 0.7) runs against the Day 3b amara-life-v1 corpus. **Tests (23):** Cat 8 per-metric scorer unit tests covering every branch (brain_first ordering, back-link compliance on mixed writes, long vs short answer citation requirement, tier escalation for simple/complex/ writey probes, finalAnswerCiteCount dedups across syntaxes). Cat 9 buildEvidence contract shape — evidence_refs flow from agent, missing slugs skip gracefully, no raw_transcript/tool_result leakage to judge. Cat 9 runCat9 integration with stubbed agent + mixed-verdict judge produces fractional pass rates correctly. Total eval suite now: 273 pass, 0 fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ter boundary Codex fixes #1, #2, #3 from the plan's outside-voice review. Enforcement shifts from SOFT-VIA-TYPE-COMMENT to SOFT-VIA-SANITIZED-OBJECT. Hard enforcement via process isolation waits for BrainBench v2 Docker sandbox. **eval/runner/types.ts** additions: - `PublicPage = Pick<Page, 'slug' | 'type' | 'title' | 'compiled_truth' | 'timeline'>` — the exact 5 fields adapters should see. No _facts. No frontmatter (a known hiding spot for accidental gold leaks). - `sanitizePage(p: Page): PublicPage` — returns a NEW object with the 5 fields only. Cannot be bypassed by `(page as any)._facts` because the field does not exist on the sanitized object. - `PublicQuery = Omit<Query, 'gold'>` — strips the gold field. - `sanitizeQuery(q: Query): PublicQuery` — enumerates public fields explicitly (not spread+delete) so no prototype weirdness leaves gold reachable. **eval/runner/multi-adapter.ts** — scoreOneRun now calls sanitizePage / sanitizeQuery before passing to adapter.init / adapter.query. The scorer retains the full Query shape (including gold.relevant) for precision / recall computation. Adapter signatures unchanged — the sealing is at the OBJECT level, not the type level. This keeps existing adapters (ripgrep-bm25, vector-only, hybrid-nograph, gbrain-after) binary-compatible. Verified: no existing adapter reads q.gold or page._facts, so the change is safe without further adapter updates. **test/eval/sealed-qrels.test.ts** (17 tests): - sanitizePage strips _facts + frontmatter + arbitrary hidden keys - Output has exactly the 5 public keys (deep introspection) - Proxy tripwire simulates a malicious adapter: any access to _facts or gold throws `sealed-qrels violation` - sanitizeQuery retains optional fields (as_of_date, tags, author, acceptable_variants, known_failure_modes) but omits undefined ones - Honest documentation of the seal's limits: filesystem bypass and Proxy attacks would still work in v1; Docker isolation (v2) is the real enforcement Every existing eval test still passes (273 before + 17 sealed-qrels = 290). Total eval suite now: 290 pass, 0 fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Final wiring of BrainBench v1 Complete. all.ts now orchestrates the full Cat catalog (1-12) via a mix of subprocess dispatch (Cats 1, 2, 3, 4, 6, 7, 10, 11, 12 — standalone runners with CLI entry points) and programmatic invocation (Cats 5, 8, 9 — require runtime inputs that can't come via CLI flags). Subprocess Cats run concurrently under a p-limit(2) bound to cap peak memory around ~800MB (two PGLite instances at ~400MB each). Cats 5/8/9 show as "programmatic" in the report with a one-line reference to their `runCatN({...})` harness API. They're deliberately skipped from the master runner because their inputs (claim catalog, probe catalog, scenario catalog, pre-seeded agent state, evidence pagesBySlug) are task-specific and assembled at the caller. **eval/runner/all.ts** — rewritten: - CATEGORIES is a tagged union of SubprocessCategory | ProgrammaticCategory - runCatSubprocess spawns Bun with pipe'd stdout/stderr, 10-min timeout per Cat (124 exit + SIGTERM on timeout; no hung subprocesses) - runConcurrently is a bounded worker pool preserving input order - buildReport emits the full markdown with per-Cat elapsed times, migration-noise filter, and a separate programmatic-only section - Honors BRAINBENCH_N (1/5/10 for smoke/iteration/published), BRAINBENCH_CONCURRENCY (default 2), BRAINBENCH_LLM_CONCURRENCY (default 4, consumed by llm-budget) **eval/runner/llm-budget.ts** — shared LLM rate-limit semaphore. A full N=10 published scorecard makes ~900 Anthropic calls (150 Cat 8/9 probes × N=10 + 100 Cat 5 claims × N=10). Without coordination, concurrent adapters trigger 429s on per-minute limits. - LlmBudget class: acquireSlot/releaseSlot + withLlmSlot(fn) wrapper that releases on success AND throw (try/finally) - getDefaultLlmBudget() singleton reads BRAINBENCH_LLM_CONCURRENCY, falls back to 4 on missing/garbage values - capacity enforced ≥1 (rejects 0/negative) - Double-release is a no-op (guards against upstream double-call bugs) - Active + waiting counts exposed for observability / tests **package.json** scripts: - eval:brainbench — default N=5 iteration - eval:brainbench:smoke — N=1 for fast iteration - eval:brainbench:published — N=10 for committed baselines - eval:cat6 / eval:cat11 — individual new subprocess Cats **Tests (24):** CATEGORIES catalog enforces the exact Cat-number partition (subprocess: 1,2,3,4,6,7,10,11,12; programmatic: 5,8,9). runConcurrently respects the cap (observable via peak in-flight counter), preserves input order under non-uniform delays, handles empty input. LlmBudget enforces capacity, releases on throw, honors env var, rejects 0/negative. buildReport filters migration noise, counts passed/failed/programmatic correctly, includes every Cat + programmatic-only section. Full eval suite now: 314 pass, 0 fail (15 test files). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

…che/ Two fixes surfaced during the Day 3b real-corpus run against Opus 4.5: **eval/generators/amara-life-gen.ts** — Current Opus rejects `temperature` and `top_p` together: ``` 400 invalid_request_error: `temperature` and `top_p` cannot both be specified for this model. Please use only one. ``` top_p=1.0 was a no-op (no nucleus truncation), so removing it has zero semantic effect. The field is still part of MODEL_PARAMS for the cache key so any past cache entries (none in v1) would invalidate cleanly on the next schema version bump. **.gitignore** — `eval/data/amara-life-v1/_cache/` is runtime Opus cache (398 files, ~1.6MB). Regenerable from seed; no point in source control. The corpus itself (inbox/slack/calendar/meetings/notes/docs + corpus-manifest.json with per-item content_sha256) stays committable for reproducibility, just the cache directory gets excluded. Real corpus generation ran cleanly after these two fixes: 398 LLM calls, 84,424 input / 38,062 output tokens, \$4.12 spent (vs \$20 cap, vs \$12 estimate). All 418 items produced. Poison fixtures use subtle paraphrased injection ("for anyone on your team who might be triaging this thread later…") — exactly the pattern that defeats regex redaction and requires the structured-evidence judge contract from Day 5. Corpus itself stays local (will move to the brainbench sibling repo during the v0.16 split per the design doc). No eval/data/amara-life-v1/ content landing in this PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

Renumbered from 0.17.0 per the gbrain-versioning slot. Other work is landing on master around this PR; 0.18 is the slot locked for this BrainBench v1 Complete release. Also pushed the "brainbench split" forward reference in the CHANGELOG from v0.18 → v0.19 to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BrainBench lived in this repo through v0.17, which meant every gbrain install pulled down ~5MB of eval corpus, benchmark reports, and a pdf-parse devDep that the 99% of users who never run benchmarks don't need. v0.18 moves the full eval harness, 14 eval test files (314 tests), all docs/benchmarks scorecards, and the pdf-parse devDep to github.com/garrytan/gbrain-evals. That repo depends on gbrain via GitHub URL and consumes it through a new public exports map. What stays in gbrain: - Page.type enum extensions (email | slack | calendar-event | note | meeting) useful for any ingested format, not just evals - inferType() heuristics for /emails/, /slack/, /cal/, /notes/, /meetings/ - 11 new public exports covering the gbrain internals gbrain-evals consumes (gbrain/engine, gbrain/pglite-engine, gbrain/search/hybrid, etc.) — now gbrain's stable third-party contract What moved: - eval/ — 4.6MB of schemas, runners, adapters, generators, CLI tools - test/eval/ — 14 test files, 314 tests - docs/benchmarks/ — all scorecards and regression reports - eval:* package.json scripts - pdf-parse devDep Tests: 1760 pass, 0 fail, 174 skipped (E2E require DATABASE_URL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master landed significant work since this branch was cut (v0.15.x → v0.16.x → v0.17.0 gbrain dream + runCycle → v0.18.0 multi-source brains → v0.18.1 RLS hardening). Bumped this branch's version from the claimed 0.18.0 to 0.19.0 because master already owns 0.18.x. Conflicts resolved: - VERSION: 0.19.0 (was 0.18.0 on HEAD vs 0.18.1 on master) - package.json: 0.19.0, kept all 11 eval-facing exports, merged master's typescript devDep + postinstall script + test script (typecheck added) - src/core/types.ts: union of both PageType additions. Master had added `meeting | note`; this branch added `email | slack | calendar-event` for inbox/chat/calendar ingest. Final enum carries all five. - CHANGELOG.md: renumbered the BrainBench-extraction entry to 0.19.0 and placed it above master's 0.18.1 RLS entry. Tweaked copy ("In v0.17 it lived inside this repo" → "Previously it lived inside this repo") to stop implying a specific version that never shipped. - CLAUDE.md: adjusted "BrainBench in a sibling repo" heading from (v0.18+) → (v0.19+). - docs/benchmarks/2026-04-18-minions-vs-openclaw-production.md: resolved modify-vs-delete conflict in favor of delete (the extraction). - scripts/llms-config.ts: dropped the docs/benchmarks/ entry (directory no longer exists here; lives in gbrain-evals). - llms.txt / llms-full.txt: regenerated after the config change. - bun.lock: accepted master's (master already dropped pdf-parse as a drive-by; aligned with our removal). Tests: 2094 pass, 236 skip, 18 fail. Spot-checked failures — build-llms, dream, orphans tests all pass in isolation. Failures reproduce only under full-suite parallel load and are pre-existing master flakiness (matches the graph-quality flake noted in the earlier summary). Not merge-introduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master is now at v0.18.2 (migration hardening + RLS + multi-source brains). BrainBench extraction ships as v0.20.0 to leave v0.19 free for any in-flight work on other branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # docs/benchmarks/2026-04-18-minions-vs-openclaw-production.md # llms-full.txt # llms.txt # package.json # scripts/llms-config.ts # src/core/engine.ts # src/core/migrate.ts # src/core/postgres-engine.ts # src/core/types.ts # test/migrate.test.ts

The Eval tests workflow ran `bun run eval:query:validate`, `test:eval`, and `eval:world:render` — all three scripts moved to the gbrain-evals repo when BrainBench was extracted in v0.20.0. The workflow has been failing on master since the split because the scripts no longer exist here. Eval CI now runs from gbrain-evals's own workflows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

Six test files spin up PGLite + 20 migrations + git repos in beforeEach/ beforeAll hooks. Under 136-way parallel test file execution, bun's default 5s hook timeout wasn't enough, producing 18 flaky failures that only reproduced under full-suite parallel load (all 6 files passed in isolation). Root cause: PGLite.create() + initSchema() takes ~3-5s under idle load, but under 136 concurrent WASM instantiations the OS thrashes and hooks stall well past 5s. The bunfig.toml `timeout = 60_000` applies to TESTS, not HOOKS — bun requires per-hook timeouts as the third beforeEach/beforeAll argument. Files touched (hook timeouts added, no test logic changed): - test/dream.test.ts — 5 describe blocks × before/afterEach - test/orphans.test.ts — 1 beforeEach + afterEach - test/core/cycle.test.ts — shared beforeAll + afterAll - test/brain-allowlist.test.ts — beforeAll + afterAll - test/extract-db.test.ts — beforeAll + afterAll - test/multi-source-integration.test.ts — beforeAll + afterAll Results: 2317 pass / 0 fail (was 2253 pass / 18 fail). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the 1 gap surfaced by Step 7 coverage audit. 9 table-driven assertions covering the new Page.type branches: emails/*.md, email/*.md -> 'email' slack/*.md -> 'slack' cal/*.md, calendar/*.md -> 'calendar-event' notes/*.md, note/*.md -> 'note' meetings/*.md, meeting/*.md -> 'meeting' The fixtures use realistic paths from the amara-life-v1 corpus in the sibling gbrain-evals repo (em-0001, sl-0037, evt-0042, mtg-0003) so the test doubles as a contract check between the two repos. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…as completed All five BrainBench categories shipped in v0.20.0 (to the gbrain-evals sibling repo). v0.10.5 inferLinkType regex expansion shipped in-tree. Remaining P1 BrainBench work: Cat 1+2 at full scale (2-3K pages) — currently 240 pages in world-v1 corpus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CLAUDE.md: add v0.19 commands to key-files list (skillify, skillpack, routing-eval, filing-audit, skill-manifest, resolver-filenames); add 8 new test files + openclaw-reference-compat E2E to test index; repoint the release-summary template's benchmark source from `docs/benchmarks/[latest].md` to `gbrain-evals/docs/benchmarks/` since those files now live in the sibling repo. CHANGELOG voice polish for v0.20.0: replace em dashes with periods, parens, or ellipses per project style guide. No content changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es CI) The v0.20.0 doc-sync commit (9e567bb) added 7 new v0.19 modules to the CLAUDE.md Key Files index and polished CHANGELOG voice. Both are includeInFull: true inputs to llms-full.txt but the generator wasn't re-run, so the drift-detection guard (test/build-llms.test.ts) failed CI. One-line fix: regenerate. No content changes beyond what the two source docs already carry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

Pulls upstream v0.20.0 (#195): extract BrainBench to sibling gbrain-evals repo. Evals move out of gbrain proper. Conflicts resolved: - VERSION — kept 0.21.0; upstream is 0.20.0 - package.json — v0.21.0 wins - CHANGELOG.md — v0.21.0 preserved above upstream's v0.20.0 Build clean: 0.21.0 binary runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped its own v0.19.1 (smoke-test skillpack, PR #369) and then v0.20.0 (BrainBench extraction to sibling repo, PR #195) while this branch was in review. Bumping our release from v0.19.1 to v0.20.1 follows the CLAUDE.md rule: "VERSION must be higher than master's." Resolved conflicts: - VERSION: 0.19.1 (ours) + 0.20.0 (master) → 0.20.1 - package.json: same bump applied to the version field - CHANGELOG.md: our queue-resilience entry renamed from v0.19.1 to v0.20.1 (6 inline refs updated across the body: numbers-that-matter table, "To take advantage" block, pre-v0.20.1 code reference, adversarial-review mention, and the v0.19.2 → v0.20.2 deferral reference for composite indexes). Entry stays at the top of the file, followed by master's v0.20.0 (BrainBench) and v0.19.1 (smoke-test skillpack). Sequence is now 0.20.1 → 0.20.0 → 0.19.1 → 0.19.0 → 0.18.2 → ... Rebuilt binary reports gbrain 0.20.1. Pre-merge verification carried forward: - 143 minions + 13 doctor unit tests pass - typecheck clean - 189 E2E tests pass against real Postgres 16 + pgvector - All 3 smoke cases pass (basic, --sigkill-rescue, --wedge-rescue) - queue_health doctor check fires correctly on a forged stalled-forever job No source changes — conflict resolution was version-label surgery only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge upstream/master (commit 11abb24, gbrain v0.20.4) into KOS v2 fork. Six upstream commits land: - v0.19.0 check-resolvable OpenClaw fallback (garrytan#326) - v0.19.1 smoke-test skillpack (garrytan#369) - v0.20.0 BrainBench extracted to sibling repo (garrytan#195) - v0.20.2 jobs supervisor (garrytan#364) — Postgres-only, PGLite skips - v0.20.3 queue resilience + queue_health doctor (garrytan#379) — Postgres-only - v0.20.4 minion-orchestrator skill consolidation (garrytan#381) Conflicts resolved (2 real, 5 auto): - .gitignore: union both fork (.omc/, kos-jarvis log globs) and upstream (eval/data/world-v1/world.html, amara-life-v1 cache) entries. - skills/manifest.json: append upstream's smoke-test skill plus retain the 9 kos-jarvis fork skills (39 total). - CLAUDE.md / README.md / package.json (0.20.4) / skills/RESOLVER.md / src/cli.ts (mode 0755) auto-merged cleanly. Fork-local patches preserved (verified post-merge): - src/core/pglite-schema.ts:65 — idx_pages_source_id commented out (upstream garrytan#370 still open, fix retained). - src/core/pglite-engine.ts:87 — pg_switch_wal() before close() (WAL durability patch, no upstream issue filed yet). - src/cli.ts mode 100755 — bun shim executable bit. Issue garrytan#332 (v0_13_0 process.execPath) fixed upstream in v0.19.0 ... running gbrain apply-migrations --yes will clear the partial-ledger remainder that has been stuck in doctor since the v0.13 sync. v0.20's headline features (jobs supervisor, queue_health, wedge-rescue, backpressure-audit) are Postgres-only and skip on our PGLite engine. Sync is preventive ... keeps the fork mergeable rather than buying new runtime capability. Pre-merge baseline (HEAD 170876f): - pages 1988, chunks 3750 (100% embedded), links 8522, timeline 10881 - doctor health 60/100 (failed: minions_migration partial 0.13.0) - brain_score 86/100 Rollback: git tag pre-sync-v0.20-1777105378 PGLite snapshot: ~/.gbrain/brain.pglite.pre-sync-v0.20-1777105391 (416M)

garrytan and others added 8 commits April 18, 2026 23:21

garrytan changed the title ~~BrainBench v1.1: v0.10.5 extraction + Phase 2 external baselines (gbrain beats all 3 by 32 pts)~~ BrainBench v1.1: extraction fixes + 3 external baselines + N=5 + Tier 5/5.5 + world.html + contributor docs (all 3 phases) Apr 18, 2026

garrytan and others added 9 commits April 19, 2026 08:34

Merge remote-tracking branch 'origin/master' into garrytan/gbrain-evals

d0d3cf0

# Conflicts: # package.json

Merge remote-tracking branch 'origin/master' into garrytan/gbrain-evals

739c50a

chore: bump version and changelog (v0.15.0)

d4a2d51

Resets package.json from stale 0.13.1 to 0.15.0 (matches VERSION). v0.14.0 shipped with the stale package.json version; this sync catches that up and moves to v0.15.0 in one step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

garrytan changed the title ~~BrainBench v1.1: extraction fixes + 3 external baselines + N=5 + Tier 5/5.5 + world.html + contributor docs (all 3 phases)~~ feat: BrainBench v1 — 4-adapter benchmark + portable schemas (v0.15.0) Apr 20, 2026

garrytan and others added 11 commits April 20, 2026 21:36

Merge remote-tracking branch 'origin/master' into garrytan/gbrain-evals

3dc7d69

# Conflicts: # CHANGELOG.md # VERSION # package.json

Merge remote-tracking branch 'origin/master' into garrytan/gbrain-evals

03f49b6

# Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan changed the title ~~feat: BrainBench v1 — 4-adapter benchmark + portable schemas (v0.15.0)~~ feat: BrainBench v1 — v0.15.0 4-adapter benchmark + portable schemas Apr 21, 2026

garrytan changed the title ~~feat: BrainBench v1 — v0.15.0 4-adapter benchmark + portable schemas~~ feat: BrainBench v1 — v0.18.0 4-adapter benchmark + portable schemas Apr 21, 2026

garrytan changed the title ~~feat: BrainBench v1 — v0.18.0 4-adapter benchmark + portable schemas~~ feat: BrainBench v1 — v0.17.0 4-adapter benchmark + portable schemas Apr 21, 2026

garrytan changed the title ~~feat: BrainBench v1 — v0.17.0 4-adapter benchmark + portable schemas~~ feat: v0.17.0 — BrainBench v1 — 4-adapter benchmark + portable schemas Apr 21, 2026

garrytan changed the title ~~feat: v0.17.0 — BrainBench v1 — 4-adapter benchmark + portable schemas~~ feat: v0.18.0 — BrainBench v1 — 4-adapter benchmark + portable schemas Apr 21, 2026

garrytan and others added 3 commits April 21, 2026 19:48

chore: bump to v0.20.0

871227c

Master is now at v0.18.2 (migration hardening + RLS + multi-source brains). BrainBench extraction ships as v0.20.0 to leave v0.19 free for any in-flight work on other branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan changed the title ~~feat: v0.18.0 — BrainBench v1 — 4-adapter benchmark + portable schemas~~ feat: v0.20.0 — extract BrainBench to sibling gbrain-evals repo Apr 23, 2026

garrytan force-pushed the garrytan/gbrain-evals branch from b21adec to 871227c Compare April 23, 2026 17:48

garrytan and others added 7 commits April 23, 2026 11:09

Merge remote-tracking branch 'origin/master' into garrytan/gbrain-evals

423eba6

# Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan changed the title ~~feat: v0.20.0 — extract BrainBench to sibling gbrain-evals repo~~ v0.20.0 feat: extract BrainBench to sibling gbrain-evals repo Apr 24, 2026

garrytan and others added 2 commits April 23, 2026 23:12

Merge remote-tracking branch 'origin/master' into garrytan/gbrain-evals

60ee8ed

# Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan merged commit 8b3c24c into master Apr 24, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.20.0 feat: extract BrainBench to sibling gbrain-evals repo#195

v0.20.0 feat: extract BrainBench to sibling gbrain-evals repo#195
garrytan merged 41 commits intomasterfrom
garrytan/gbrain-evals

garrytan commented Apr 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What moved

Test Coverage

Test suite

Plan Completion

Pre-Landing Review

TODOS

Documentation

Test plan

Benchmarks committed to gbrain-evals

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Apr 18, 2026 •

edited

Loading