v0.25.0 feat: BrainBench-Real session capture + public-exports contract test by garrytan · Pull Request #437 · garrytan/gbrain

garrytan · 2026-04-26T05:49:09Z

Summary

v0.25.0 ships BrainBench-Real: every real query and search your agents run via MCP, CLI, or the subagent tool-bridge gets captured into an eval_candidates table, scrubbed of PII at write, and streamable as NDJSON for replay against gbrain-evals. Plus a public-exports contract test locks the 17-subpath surface gbrain-evals depends on.

Capture (R1)

New v30 migration: eval_candidates + eval_capture_failures (Postgres + PGLite). RLS gated on BYPASSRLS, CHECK constraint on length(query) <= 51200, indexes on created_at DESC for export and ts DESC for doctor's 24h window.
Op-layer capture wrapper in src/core/eval-capture.ts decorates query/search handlers in src/core/operations.ts — covers MCP, CLI, and subagent tool-bridge from one site.
PII scrubber src/core/eval-capture-scrub.ts — 6 regex families (email, phone, SSN, Luhn-verified CC, JWT, bearer). Adversarial-input safe.
Cross-process visibility via engine.logEvalCaptureFailure + gbrain doctor 24h breakdown.
gbrain eval export [--since DUR] [--limit N] [--tool ...] streams NDJSON with schema_version: 1 prefix per row. EPIPE-safe, deterministic ordering (created_at DESC, id DESC).
gbrain eval prune --older-than DUR [--dry-run] for explicit retention.
BrainEngine interface gains 5 methods (drives v0.25.0 minor bump for downstream custom-engine implementers).
hybridSearch opts gain onMeta?: (meta) => void callback. Cathedral II callers unaffected — return type stays Promise<SearchResult[]>.
Config eval: { capture?, scrub_pii? } in ~/.gbrain/config.json (file-plane only, both default true).

Public exports contract (R2)

test/public-exports.test.ts imports each of the 17 subpaths via package name + pins canary symbols.
scripts/check-exports-count.sh CI guard wired into bun test — count change fails the build.

Adversarial review fixes (commit c4dd09d)

gbrain doctor eval_capture check now distinguishes pre-v30 missing-table (ok / skipped) from RLS-denied SELECT (warn) and other DB errors (warn). Previously masked the most diagnostic class.
hybridSearch.onMeta invocation wrapped in try/catch — a throwing user-supplied callback can't break the search hot path.

Plan completion (CEO + Eng + Codex reviewed)
17/17 final-scope items shipped. 3 CHANGED with documented rationale: migration v25 → v30 (slots taken), hybridSearch meta as onMeta callback vs return shape, and v0.25.0 vs original v0.21.0 (rebased forward through Cathedral II). 3 deferred items (E3 bounded queue, eval log/stats, answer_text writes) handled correctly per plan.

Test Coverage

92% AI-assessed coverage. Two minor gaps documented:

src/commands/doctor.ts eval_capture check has no dedicated unit test for ok / warn / skipped branches.
src/core/config.ts eval key has no loadConfig() round-trip test.

v0.25.0 BrainBench-Real Substrate — Coverage
============================================
src/core/eval-capture-scrub.ts        ★★★  (17 cases — regex families + Luhn + adversarial)
src/core/eval-capture.ts              ★★★  (21 cases — build, classify, capture, gates)
src/core/search/hybrid.ts onMeta      ★★★  ( 7 cases — meta states + no-callback contract)
src/core/operations.ts query/search   ★★★  (11 cases — MCP/CLI/subagent + off-switch)
src/core/migrate.ts v30               ★★★  ( 7 unit + 5 E2E Postgres cases)
src/core/pglite-engine.ts (5 methods) ★★★  (14 cases — every method + CHECK + clamp)
src/core/postgres-engine.ts (5 mthds) ★★   ( 5 E2E cases via test/e2e/eval-capture.test.ts)
src/commands/eval-export.ts           ★★★  ( 9 cases — schema_version, filters, error exits)
src/commands/eval-prune.ts            ★★★  ( 5 cases — delete, dry-run, validation)
src/commands/eval.ts sub-dispatch     ★★   (covered indirectly via export+prune CLI tests)
src/commands/doctor.ts eval_capture   ★    GAP — no dedicated test for ok/warn/skipped
src/core/config.ts eval key           ★    GAP — no loadConfig roundtrip test
public exports + count guard          ★★★  (test/public-exports.test.ts + check-exports-count.sh)

Tests: 184 v0.25.0-related test cases (8 new files + migrate.test.ts extension + Postgres E2E gated on DATABASE_URL).

Pre-Landing Review

Ran prior to this PR (logged via plan-eng-review × 2 rounds + plan-ceo-review + codex outside-voice). Status CLEAR. Two surgical fixes applied this run from /ship adversarial review (commit c4dd09d).

Adversarial Review

Found 8 actionable items — 2 fixed in this PR (commit c4dd09d), 6 TODO'd as P1 v0.22.1 follow-ups in TODOS.md:

gbrain eval prune --dry-run — replace listEvalCandidates(limit:100k) + filter with real engine.countEvalCandidatesBefore(date) (today the 100k-cap warning at eval-prune.ts:107-109 is honest but a brain with > 100k rows could still confuse).
PII scrubber CC false-positive rate — Luhn-valid 16-digit order IDs / invoice numbers redact. Either contextual prefix gate or document the tradeoff.
eval_capture_failures.reason enum value 'scrubber_exception' is dead telemetry (regex-only scrubber never throws). Remove or wrap.
CLAUDE.md id DESC claim doesn't hold across overlapping windows when LIMIT < total. Either add an id-cursor for export or scope the doc claim.
6 of 17 public-exports subpaths have empty canary lists — pin a canary symbol per subpath.
EXPECTED_COUNT duplicated in scripts/check-exports-count.sh and test/public-exports.test.ts — single source.

Test triage

Full bun test reports 27 pre-existing failures across cathedral-ii-pglite.test.ts, cathedral-ii-brainbench.test.ts, sync.test.ts, reindex-code.test.ts — all pre-existing on master (git diff origin/master...HEAD --stat against those files is empty). Each failing file passes in isolation. Pattern: error: PGLite not connected. Call connect() first. — concurrent PGLite init exhaustion under bun's parallel runner. Not introduced by this branch. Tracked as P0 in TODOS.md for a dedicated investigation branch.

All v0.25.0 unit tests (184 cases across 10 files) pass cleanly.

Plan Completion

17/17 final-scope items shipped. See ~/.claude/plans/system-instruction-you-are-working-humming-giraffe.md for the full plan (CEO + Eng + Codex + second-eng-review).

TODOS

P0 added: PGLite test-runner concurrency flake
P1 added: 6-item v0.22.1 follow-up (adversarial review hardenings)

Documentation

Refreshed by /document-release after the CONTRIBUTOR_MODE pivot (commit 175524a5). All five entry-point docs now reflect the off-by-default capture model:

README.md — top "New in v0.25.0" callout rewritten to lead with GBRAIN_CONTRIBUTOR_MODE=1. Was implying capture-on-by-default; corrected. Contributor section near the end already pointed at docs/eval-bench.md from the prior commit.
AGENTS.md — new "Eval retrieval changes" common-task entry so non-Claude agents (Codex, Cursor, Aider) get the same one-line path: env var → export → replay.
CLAUDE.md — "Key commands added in v0.25.0" section gains the gbrain eval replay line and a CONTRIBUTOR_MODE bullet covering the resolution order. Existing eval-capture key-files entry already updated in 7a80ce25 to document the gate.
CHANGELOG.md — headline corrected (was "every real query gets captured"; now leads with the contributor benchmarking workflow). Stale "v0.22 ships the substrate" → v0.25. Test count 82 → 144 (added 16 replay + 9 CONTRIBUTOR_MODE + 8 v31-shape tests since the original count). Two new metric rows in the numbers table: default-off posture, in-tree replay tooling. "To take advantage" block split into user vs contributor branches with shell-rc instructions.
TODOS.md — v0.22.1 follow-up reference corrected to v0.25.1.

Plus docs/eval-capture.md and docs/eval-bench.md (added in earlier commits, now cross-referenced consistently). llms.txt + llms-full.txt regenerated.

Verification: bun run typecheck clean. 198/198 v0.25.0 tests still green. Branch is 14 commits ahead of master.

Test plan

bun test (174 files, 2651 pass — 27 pre-existing master flakes triaged)
v0.25.0 specific tests: 184/184 pass (10 files)
bun run typecheck clean
CI guards pass (check-jsonb-pattern.sh, check-progress-to-stdout.sh, check-wasm-embedded.sh, check-exports-count.sh)
Postgres E2E test ready (test/e2e/eval-capture.test.ts — gated on DATABASE_URL)
Manual: dogfood gbrain eval export --since 1d on Garry's brain post-merge

🤖 Generated with Claude Code

R1 substrate for BrainBench-Real, replayed onto master after Cathedral II landed. Migration v30 (slotted after master's v25-v29 Cathedral II wave) creates two tables: eval_candidates: per-call capture of MCP/CLI/subagent query+search traffic. Column set lets gbrain-evals replay with full fidelity — source_ids from v0.18 multi-source, vector_enabled/detail_resolved/ expansion_applied so replay knows what hybridSearch actually did, remote + job_id + subagent_id so rows are traceable to their origin. query is CHECK-capped at 50KB; PII scrubber (Lane 1B) runs before insert. eval_capture_failures: cross-process audit trail. In-process counters don't work because `gbrain doctor` runs in a separate process from the MCP server. Persistent rows let doctor query capture health via COUNT(*) GROUP BY reason over the last 24h. Both tables get RLS on Postgres gated on BYPASSRLS (matches v24/v29 posture). PGLite ignores RLS; sqlFor split carries only DDL. 5 new BrainEngine methods (breaking-interface addition, drives v0.22.0 minor bump): logEvalCandidate, listEvalCandidates, deleteEvalCandidatesBefore, logEvalCaptureFailure, listEvalCaptureFailures. listEvalCandidates uses ORDER BY created_at DESC, id DESC so `gbrain eval export` is deterministic across same-millisecond inserts. Also adds HybridSearchMeta type for the side-channel callback used by Lane 1C's op-layer capture (no change to hybridSearch return shape — that respects Cathedral II's existing SearchResult[] contract). Tests: 14 PGLite round-trip cases + 8 v30 structural assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replayed onto master post-Cathedral II. Same semantics as the original v0.21.0 work — only adjusted to import HybridSearchMeta from types.ts (canonical home) instead of redeclaring it locally. src/core/eval-capture-scrub.ts — pure-function regex scrubber with 6 pattern families: emails, phones (US + E.164), SSN (year-aware), Luhn-verified credit cards, JWT-shaped tokens, bearer tokens. Zero deps. Adversarial-input safe. src/core/eval-capture.ts — op-layer hook helper: - buildEvalCandidateInput(ctx, {scrub_pii}) — pure row builder - classifyCaptureFailure(err) — Postgres SQLSTATE → reason tag - captureEvalCandidate(engine, ctx, opts) — best-effort, never throws - isEvalCaptureEnabled / isEvalScrubEnabled — file-plane config checks GBrainConfig gains `eval?: {capture?, scrub_pii?}`. Both default ON. File-plane only — `gbrain config set` writes the DB plane, doesn't control capture. Tests: 17 scrubber + 21 capture-module cases. Zero regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replayed onto master. Adapted from the original v0.21.0 work to keep Cathedral II's contract intact: hybridSearch's return stays `Promise<SearchResult[]>` (unchanged), and meta surfaces via an optional `onMeta?: (meta: HybridSearchMeta) => void` callback in HybridSearchOpts. Cathedral II callers leave onMeta undefined and pay no cost. The op-layer capture wrapper passes a closure that threads meta into the captured row so gbrain-evals can distinguish: - "with OPENAI_API_KEY" vs "keyword-only fallback" (vector_enabled) - "expansion fired" vs "expansion requested + silently fell back" (expansion_applied) - what hybridSearch actually used after auto-detect (detail_resolved) Op-layer capture wired into both `query` and `search` op handlers in src/core/operations.ts. Single hook site catches MCP dispatch + CLI + subagent tool-bridge from the same place. Fire-and-forget, never throws, respects ctx.config.eval.capture off-switch. Tests: - test/hybrid-meta.test.ts (8 cases) — onMeta accuracy across the 4 return paths in hybridSearch + verification that omitting onMeta leaves Cathedral II callers unchanged. - test/mcp-eval-capture.test.ts (10 cases) — query/search ops capture correctly with MCP/CLI/subagent contexts, scrub on/off, capture=false off-switch, non-captured ops (list_pages, get_page), F1 failure isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Lane 1D) Replayed onto master. Same semantics as the original v0.21.0 work. CLI: gbrain eval export [--since DUR] [--limit N] [--tool query|search] NDJSON to stdout, every row prefixed with "schema_version":1 per docs/eval-capture.md contract. EPIPE-safe streaming, stderr heartbeats, deterministic ordering (created_at DESC, id DESC). gbrain eval prune --older-than DUR [--dry-run] Explicit retention cleanup. Requires --older-than (never deletes without a window). Duration strings: 30d, 7d, 1h, 90m, 3600s. Legacy bare `gbrain eval --qrels …` still works via sub-subcommand fall-through. gbrain doctor gains an eval_capture check between markdown_body_completeness and queue_health: reads eval_capture_failures for the last 24h, groups by reason, warns when non-zero. Pre-v30 brains get "Skipped (table unavailable)" — non-fatal. docs/eval-capture.md ships the stable NDJSON schema reference for gbrain-evals consumers. Tests: 9 export cases + 5 prune cases. Doctor check covered by existing doctor tests on master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…/ R2) Master locks 17 public subpath exports as gbrain's stable third-party contract. Zero enforcement existed. This PR locks the surface in two layers: 1. test/public-exports.test.ts — runtime contract test. Reads package.json "exports" at startup. For each subpath, imports via the package name ("gbrain/engine"), NOT the relative filesystem path — that's the difference between exercising the actual resolver and bypassing it. Every subpath gets a canary symbol pinned (e.g. gbrain/search/hybrid must export hybridSearch + rrfFusion) so a refactor that renames or removes one fails CI before downstream consumers (gbrain-evals) silently break. 2. scripts/check-exports-count.sh — CI structural guard. Wired into `bun test` after check-jsonb-pattern.sh + check-progress-to-stdout.sh + check-wasm-embedded.sh per master's precedent. EXPECTED_COUNT=17 baseline — shrinks fail loudly, growth also fails so the new canary must be pinned in the runtime test deliberately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ne 3) Bump VERSION + package.json to 0.22.0 (next free slot after master's v0.21.0 Code Cathedral II minor). CHANGELOG.md v0.22.0 entry follows the Garry voice template: - Bold 2-line headline - Lead paragraph contextualizing v0.20 + v0.21 + v0.22 progression - Numbers-that-matter table comparing v0.21.0 → v0.22.0 - "What this means for you" sectioned by audience - "## To take advantage of v0.22.0" operator runbook - Itemized changes CLAUDE.md updates: - Key files: 8 new module entries (eval-capture*, eval-export, eval-prune, docs/eval-capture.md, public-exports test). hybrid.ts entry rewritten to reflect the additive `onMeta` callback (return shape unchanged). - Key commands: new v0.22.0 section for `gbrain eval export`, `gbrain eval prune`, and the doctor `eval_capture` check, with the file-plane vs DB-plane config gotcha called out. README.md: one-paragraph pointer after the BrainBench blurb so anyone reading the landing page sees the new session-capture feature. llms.txt + llms-full.txt regenerated to pick up the doc additions. test/e2e/eval-capture.test.ts (Postgres-only E1 spec): - CHECK violation surfaces as Postgres SQLSTATE 23514 on oversize input - RLS is actually enabled on both eval_candidates + eval_capture_failures - 50 concurrent logEvalCandidate calls — no deadlock, all distinct IDs Skips gracefully when DATABASE_URL is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-existing on master, surfaces ~27 false failures when bun test runs all 174 files together. Each failing file passes in isolation. Tracked for a dedicated investigation branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two surgical fixes from /ship adversarial review, plus 6 follow-ups TODO'd into v0.22.1: - doctor.ts: distinguish pre-v30 missing-table (42P01, ok skip) from RLS-denied SELECT (42501, warn) and other DB errors (warn). The check exists specifically to surface capture-failure misconfigs cross-process, so silently reporting "ok / skipped" on the most diagnostic class defeated the purpose. - hybrid.ts: wrap onMeta invocation in try/catch via small emitMeta helper. The callback is part of the public gbrain/search/hybrid contract; a throwing user-supplied closure must never break the search hot path. - TODOS.md: 6 P1 follow-ups (eval prune real COUNT, scrubber CC false positives, dead 'scrubber_exception' enum value, id-cursor for cross-window dedup, public-export canary pinning, EXPECTED_COUNT dedup). - TODOS.md: P0 entry for the pre-existing PGLite test-runner concurrency flake (~27 false failures in full bun test on master). - CHANGELOG.md: 2 bullets noting the doctor + onMeta hardening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master is at v0.21.0. Open PRs claim v0.21.1 (#432) and v0.24.0 (#387). v0.25 is the first uncontested slot, so this branch claims it. Pure rename across VERSION, package.json, CHANGELOG header, and every "v0.22.0" reference in CLAUDE.md / README.md / TODOS.md / docs/eval-capture.md / src/ / test/ files. CHANGELOG date bumped to 2026-04-26. llms.txt + llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n-capture # Conflicts: # CHANGELOG.md # CLAUDE.md # VERSION # llms-full.txt # package.json # src/commands/doctor.ts # test/migrate.test.ts

…n-capture # Conflicts: # CHANGELOG.md # VERSION # package.json # src/core/migrate.ts # src/core/schema-embedded.ts # src/schema.sql

Closes the gap between "session capture works" (this PR's core) and "contributors actually use it before merging." Three artifacts: - src/commands/eval-replay.ts (~340 LOC) — reads NDJSON from `gbrain eval export`, re-runs each captured query/search against the current brain, computes set-Jaccard@k, top-1 stability, and latency delta. Stable JSON shape (schema_version:1) for CI gating; human mode prints a regression table sorted worst-first. Pure Bun, zero new deps. Stub-engine tests cover Jaccard math, NDJSON parser (including v2 forward-compat rejection + line-numbered errors), --limit, --verbose, --json, and graceful per-row error handling. 16/16 passing. - docs/eval-bench.md (~80 lines) — contributor guide. The 4-command loop (export → change → replay → diff), metric definitions with healthy ranges (Jaccard ≥0.85, top-1 ≥85%, latency Δ within ±50ms), trigger paths, CI integration snippet, hand-crafted NDJSON corpus path for fresh installs, and the off-switch. Pairs with the existing docs/eval-capture.md which is the consumer-facing wire format. - CONTRIBUTING.md gains a "Running real-world eval benchmarks (touching retrieval code)" section with the trigger paths and a link to docs/eval-bench.md. Reviewers now have a one-line ask: "did you run replay?" CLAUDE.md key files updated. CHANGELOG bullets added. llms.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eval capture was on for everyone in the v0.25.0 draft. Privacy footgun: end users had retrieval traffic accumulate in their brain DB without asking, even with PII scrubbing. Flips to off by default + explicit opt-in for contributors who actually use the replay loop. Resolution order in isEvalCaptureEnabled(): 1. config.eval.capture === true → on 2. config.eval.capture === false → off 3. process.env.GBRAIN_CONTRIBUTOR_MODE === '1' → on 4. otherwise → off The env var is the contributor-facing toggle (one line in .zshrc, no JSON edit). Explicit config wins both directions for users who want to override per-brain. PII scrubbing gate stays independent — default true regardless of CONTRIBUTOR_MODE — so any brain that does capture still scrubs. Tests rewritten: env var hygiene per-test (origMode preserved + restored in finally). 9/9 pass; total v0.25.0 suite is 198/198. Docs: - README.md gains a Contributing-section pointer to the env var. - CONTRIBUTING.md gains a "CONTRIBUTOR_MODE — turn on the dev loop" section with verification commands and resolution-order table. - docs/eval-bench.md leads with the prerequisite (must set the env var for the rest of the doc to be useful). - docs/eval-capture.md "Config" section split into Path A (env var) + Path B (config) with explicit resolution-order rules. - CHANGELOG v0.25.0 entry corrected ("on by default" was wrong) plus a new top itemized bullet calling out the gate change. - CLAUDE.md eval-capture entry annotated with the new gate logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cross-references every doc against the final state of the branch (CONTRIBUTOR_MODE flag, eval replay tool, off-by-default capture): - README.md: top callout rewritten — was implying capture-on-by-default contradicting the gate landed in 7a80ce2. Now leads with "contributor opt-in" and links docs/eval-bench.md alongside docs/eval-capture.md. - AGENTS.md: new "Eval retrieval changes" task entry with the CONTRIBUTOR_MODE+replay one-liner so non-Claude agents (Codex, Cursor, Aider) have the same path. - CLAUDE.md: "Key commands added in v0.25.0" gains the replay command and a CONTRIBUTOR_MODE bullet covering the resolution order. - CHANGELOG.md: headline rewritten to match the actual feature ("benchmark retrieval changes against real captured queries before merging" — was "every real query is captured"). Stale "v0.22 ships the substrate" → v0.25. Test count corrected 82 → 144 (added 16 replay + 9 CONTRIBUTOR_MODE + 8 v31-shape tests since the original count). Two metric rows added to the numbers table: default-off posture, in-tree replay tooling. "To take advantage" block split into user vs contributor branches with shell-rc instructions. - TODOS.md: v0.22.1 follow-up reference corrected to v0.25.1. llms.txt + llms-full.txt regenerated. Typecheck clean. 198/198 v0.25.0 tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan and others added 9 commits April 25, 2026 21:50

garrytan changed the title ~~v0.22.0 feat: BrainBench-Real session capture + public-exports contract test~~ v0.25.0 feat: BrainBench-Real session capture + public-exports contract test Apr 26, 2026

garrytan and others added 5 commits April 29, 2026 11:49

Merge remote-tracking branch 'origin/master' into garrytan/mcp-sessio…

9fd8c09

…n-capture # Conflicts: # CHANGELOG.md # CLAUDE.md # VERSION # llms-full.txt # package.json # src/commands/doctor.ts # test/migrate.test.ts

Merge remote-tracking branch 'origin/master' into garrytan/mcp-sessio…

ec32a56

…n-capture # Conflicts: # CHANGELOG.md # VERSION # package.json # src/core/migrate.ts # src/core/schema-embedded.ts # src/schema.sql

garrytan merged commit 736e8de into master May 1, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.25.0 feat: BrainBench-Real session capture + public-exports contract test#437

v0.25.0 feat: BrainBench-Real session capture + public-exports contract test#437
garrytan merged 14 commits into
masterfrom
garrytan/mcp-session-capture

garrytan commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Adversarial Review

Test triage

Plan Completion

TODOS

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Apr 26, 2026 •

edited

Loading