v0.25.0 feat: BrainBench-Real session capture + public-exports contract test#437
Merged
Conversation
R1 substrate for BrainBench-Real, replayed onto master after Cathedral II
landed. Migration v30 (slotted after master's v25-v29 Cathedral II wave)
creates two tables:
eval_candidates: per-call capture of MCP/CLI/subagent query+search
traffic. Column set lets gbrain-evals replay with full fidelity —
source_ids from v0.18 multi-source, vector_enabled/detail_resolved/
expansion_applied so replay knows what hybridSearch actually did,
remote + job_id + subagent_id so rows are traceable to their origin.
query is CHECK-capped at 50KB; PII scrubber (Lane 1B) runs before insert.
eval_capture_failures: cross-process audit trail. In-process counters
don't work because `gbrain doctor` runs in a separate process from
the MCP server. Persistent rows let doctor query capture health via
COUNT(*) GROUP BY reason over the last 24h.
Both tables get RLS on Postgres gated on BYPASSRLS (matches v24/v29
posture). PGLite ignores RLS; sqlFor split carries only DDL.
5 new BrainEngine methods (breaking-interface addition, drives v0.22.0
minor bump): logEvalCandidate, listEvalCandidates,
deleteEvalCandidatesBefore, logEvalCaptureFailure, listEvalCaptureFailures.
listEvalCandidates uses ORDER BY created_at DESC, id DESC so
`gbrain eval export` is deterministic across same-millisecond inserts.
Also adds HybridSearchMeta type for the side-channel callback used by
Lane 1C's op-layer capture (no change to hybridSearch return shape —
that respects Cathedral II's existing SearchResult[] contract).
Tests: 14 PGLite round-trip cases + 8 v30 structural assertions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replayed onto master post-Cathedral II. Same semantics as the original
v0.21.0 work — only adjusted to import HybridSearchMeta from types.ts
(canonical home) instead of redeclaring it locally.
src/core/eval-capture-scrub.ts — pure-function regex scrubber with 6
pattern families: emails, phones (US + E.164), SSN (year-aware),
Luhn-verified credit cards, JWT-shaped tokens, bearer tokens. Zero
deps. Adversarial-input safe.
src/core/eval-capture.ts — op-layer hook helper:
- buildEvalCandidateInput(ctx, {scrub_pii}) — pure row builder
- classifyCaptureFailure(err) — Postgres SQLSTATE → reason tag
- captureEvalCandidate(engine, ctx, opts) — best-effort, never throws
- isEvalCaptureEnabled / isEvalScrubEnabled — file-plane config checks
GBrainConfig gains `eval?: {capture?, scrub_pii?}`. Both default ON.
File-plane only — `gbrain config set` writes the DB plane, doesn't
control capture.
Tests: 17 scrubber + 21 capture-module cases. Zero regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replayed onto master. Adapted from the original v0.21.0 work to keep
Cathedral II's contract intact: hybridSearch's return stays
`Promise<SearchResult[]>` (unchanged), and meta surfaces via an optional
`onMeta?: (meta: HybridSearchMeta) => void` callback in HybridSearchOpts.
Cathedral II callers leave onMeta undefined and pay no cost. The
op-layer capture wrapper passes a closure that threads meta into the
captured row so gbrain-evals can distinguish:
- "with OPENAI_API_KEY" vs "keyword-only fallback" (vector_enabled)
- "expansion fired" vs "expansion requested + silently fell back" (expansion_applied)
- what hybridSearch actually used after auto-detect (detail_resolved)
Op-layer capture wired into both `query` and `search` op handlers in
src/core/operations.ts. Single hook site catches MCP dispatch + CLI +
subagent tool-bridge from the same place. Fire-and-forget, never throws,
respects ctx.config.eval.capture off-switch.
Tests:
- test/hybrid-meta.test.ts (8 cases) — onMeta accuracy across the 4
return paths in hybridSearch + verification that omitting onMeta
leaves Cathedral II callers unchanged.
- test/mcp-eval-capture.test.ts (10 cases) — query/search ops capture
correctly with MCP/CLI/subagent contexts, scrub on/off, capture=false
off-switch, non-captured ops (list_pages, get_page), F1 failure
isolation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Lane 1D)
Replayed onto master. Same semantics as the original v0.21.0 work.
CLI:
gbrain eval export [--since DUR] [--limit N] [--tool query|search]
NDJSON to stdout, every row prefixed with "schema_version":1 per
docs/eval-capture.md contract. EPIPE-safe streaming, stderr
heartbeats, deterministic ordering (created_at DESC, id DESC).
gbrain eval prune --older-than DUR [--dry-run]
Explicit retention cleanup. Requires --older-than (never deletes
without a window). Duration strings: 30d, 7d, 1h, 90m, 3600s.
Legacy bare `gbrain eval --qrels …` still works via sub-subcommand
fall-through.
gbrain doctor gains an eval_capture check between markdown_body_completeness
and queue_health: reads eval_capture_failures for the last 24h, groups by
reason, warns when non-zero. Pre-v30 brains get "Skipped (table
unavailable)" — non-fatal.
docs/eval-capture.md ships the stable NDJSON schema reference for
gbrain-evals consumers.
Tests: 9 export cases + 5 prune cases. Doctor check covered by
existing doctor tests on master.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/ R2)
Master locks 17 public subpath exports as gbrain's stable third-party
contract. Zero enforcement existed. This PR locks the surface in two
layers:
1. test/public-exports.test.ts — runtime contract test.
Reads package.json "exports" at startup. For each subpath, imports
via the package name ("gbrain/engine"), NOT the relative filesystem
path — that's the difference between exercising the actual resolver
and bypassing it. Every subpath gets a canary symbol pinned (e.g.
gbrain/search/hybrid must export hybridSearch + rrfFusion) so a
refactor that renames or removes one fails CI before downstream
consumers (gbrain-evals) silently break.
2. scripts/check-exports-count.sh — CI structural guard.
Wired into `bun test` after check-jsonb-pattern.sh +
check-progress-to-stdout.sh + check-wasm-embedded.sh per master's
precedent. EXPECTED_COUNT=17 baseline — shrinks fail loudly,
growth also fails so the new canary must be pinned in the runtime
test deliberately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ne 3)
Bump VERSION + package.json to 0.22.0 (next free slot after master's
v0.21.0 Code Cathedral II minor).
CHANGELOG.md v0.22.0 entry follows the Garry voice template:
- Bold 2-line headline
- Lead paragraph contextualizing v0.20 + v0.21 + v0.22 progression
- Numbers-that-matter table comparing v0.21.0 → v0.22.0
- "What this means for you" sectioned by audience
- "## To take advantage of v0.22.0" operator runbook
- Itemized changes
CLAUDE.md updates:
- Key files: 8 new module entries (eval-capture*, eval-export,
eval-prune, docs/eval-capture.md, public-exports test).
hybrid.ts entry rewritten to reflect the additive `onMeta` callback
(return shape unchanged).
- Key commands: new v0.22.0 section for `gbrain eval export`,
`gbrain eval prune`, and the doctor `eval_capture` check, with the
file-plane vs DB-plane config gotcha called out.
README.md: one-paragraph pointer after the BrainBench blurb so anyone
reading the landing page sees the new session-capture feature.
llms.txt + llms-full.txt regenerated to pick up the doc additions.
test/e2e/eval-capture.test.ts (Postgres-only E1 spec):
- CHECK violation surfaces as Postgres SQLSTATE 23514 on oversize input
- RLS is actually enabled on both eval_candidates + eval_capture_failures
- 50 concurrent logEvalCandidate calls — no deadlock, all distinct IDs
Skips gracefully when DATABASE_URL is unset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing on master, surfaces ~27 false failures when bun test runs all 174 files together. Each failing file passes in isolation. Tracked for a dedicated investigation branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two surgical fixes from /ship adversarial review, plus 6 follow-ups TODO'd into v0.22.1: - doctor.ts: distinguish pre-v30 missing-table (42P01, ok skip) from RLS-denied SELECT (42501, warn) and other DB errors (warn). The check exists specifically to surface capture-failure misconfigs cross-process, so silently reporting "ok / skipped" on the most diagnostic class defeated the purpose. - hybrid.ts: wrap onMeta invocation in try/catch via small emitMeta helper. The callback is part of the public gbrain/search/hybrid contract; a throwing user-supplied closure must never break the search hot path. - TODOS.md: 6 P1 follow-ups (eval prune real COUNT, scrubber CC false positives, dead 'scrubber_exception' enum value, id-cursor for cross-window dedup, public-export canary pinning, EXPECTED_COUNT dedup). - TODOS.md: P0 entry for the pre-existing PGLite test-runner concurrency flake (~27 false failures in full bun test on master). - CHANGELOG.md: 2 bullets noting the doctor + onMeta hardening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master is at v0.21.0. Open PRs claim v0.21.1 (#432) and v0.24.0 (#387). v0.25 is the first uncontested slot, so this branch claims it. Pure rename across VERSION, package.json, CHANGELOG header, and every "v0.22.0" reference in CLAUDE.md / README.md / TODOS.md / docs/eval-capture.md / src/ / test/ files. CHANGELOG date bumped to 2026-04-26. llms.txt + llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-capture # Conflicts: # CHANGELOG.md # CLAUDE.md # VERSION # llms-full.txt # package.json # src/commands/doctor.ts # test/migrate.test.ts
…n-capture # Conflicts: # CHANGELOG.md # VERSION # package.json # src/core/migrate.ts # src/core/schema-embedded.ts # src/schema.sql
Closes the gap between "session capture works" (this PR's core) and "contributors actually use it before merging." Three artifacts: - src/commands/eval-replay.ts (~340 LOC) — reads NDJSON from `gbrain eval export`, re-runs each captured query/search against the current brain, computes set-Jaccard@k, top-1 stability, and latency delta. Stable JSON shape (schema_version:1) for CI gating; human mode prints a regression table sorted worst-first. Pure Bun, zero new deps. Stub-engine tests cover Jaccard math, NDJSON parser (including v2 forward-compat rejection + line-numbered errors), --limit, --verbose, --json, and graceful per-row error handling. 16/16 passing. - docs/eval-bench.md (~80 lines) — contributor guide. The 4-command loop (export → change → replay → diff), metric definitions with healthy ranges (Jaccard ≥0.85, top-1 ≥85%, latency Δ within ±50ms), trigger paths, CI integration snippet, hand-crafted NDJSON corpus path for fresh installs, and the off-switch. Pairs with the existing docs/eval-capture.md which is the consumer-facing wire format. - CONTRIBUTING.md gains a "Running real-world eval benchmarks (touching retrieval code)" section with the trigger paths and a link to docs/eval-bench.md. Reviewers now have a one-line ask: "did you run replay?" CLAUDE.md key files updated. CHANGELOG bullets added. llms.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eval capture was on for everyone in the v0.25.0 draft. Privacy footgun:
end users had retrieval traffic accumulate in their brain DB without
asking, even with PII scrubbing. Flips to off by default + explicit
opt-in for contributors who actually use the replay loop.
Resolution order in isEvalCaptureEnabled():
1. config.eval.capture === true → on
2. config.eval.capture === false → off
3. process.env.GBRAIN_CONTRIBUTOR_MODE === '1' → on
4. otherwise → off
The env var is the contributor-facing toggle (one line in .zshrc, no
JSON edit). Explicit config wins both directions for users who want to
override per-brain.
PII scrubbing gate stays independent — default true regardless of
CONTRIBUTOR_MODE — so any brain that does capture still scrubs.
Tests rewritten: env var hygiene per-test (origMode preserved + restored
in finally). 9/9 pass; total v0.25.0 suite is 198/198.
Docs:
- README.md gains a Contributing-section pointer to the env var.
- CONTRIBUTING.md gains a "CONTRIBUTOR_MODE — turn on the dev loop"
section with verification commands and resolution-order table.
- docs/eval-bench.md leads with the prerequisite (must set the env var
for the rest of the doc to be useful).
- docs/eval-capture.md "Config" section split into Path A (env var) +
Path B (config) with explicit resolution-order rules.
- CHANGELOG v0.25.0 entry corrected ("on by default" was wrong) plus a
new top itemized bullet calling out the gate change.
- CLAUDE.md eval-capture entry annotated with the new gate logic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-references every doc against the final state of the branch (CONTRIBUTOR_MODE flag, eval replay tool, off-by-default capture): - README.md: top callout rewritten — was implying capture-on-by-default contradicting the gate landed in 7a80ce2. Now leads with "contributor opt-in" and links docs/eval-bench.md alongside docs/eval-capture.md. - AGENTS.md: new "Eval retrieval changes" task entry with the CONTRIBUTOR_MODE+replay one-liner so non-Claude agents (Codex, Cursor, Aider) have the same path. - CLAUDE.md: "Key commands added in v0.25.0" gains the replay command and a CONTRIBUTOR_MODE bullet covering the resolution order. - CHANGELOG.md: headline rewritten to match the actual feature ("benchmark retrieval changes against real captured queries before merging" — was "every real query is captured"). Stale "v0.22 ships the substrate" → v0.25. Test count corrected 82 → 144 (added 16 replay + 9 CONTRIBUTOR_MODE + 8 v31-shape tests since the original count). Two metric rows added to the numbers table: default-off posture, in-tree replay tooling. "To take advantage" block split into user vs contributor branches with shell-rc instructions. - TODOS.md: v0.22.1 follow-up reference corrected to v0.25.1. llms.txt + llms-full.txt regenerated. Typecheck clean. 198/198 v0.25.0 tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v0.25.0 ships BrainBench-Real: every real
queryandsearchyour agents run via MCP, CLI, or the subagent tool-bridge gets captured into aneval_candidatestable, scrubbed of PII at write, and streamable as NDJSON for replay against gbrain-evals. Plus a public-exports contract test locks the 17-subpath surface gbrain-evals depends on.Capture (R1)
eval_candidates+eval_capture_failures(Postgres + PGLite). RLS gated on BYPASSRLS, CHECK constraint onlength(query) <= 51200, indexes oncreated_at DESCfor export andts DESCfor doctor's 24h window.src/core/eval-capture.tsdecoratesquery/searchhandlers insrc/core/operations.ts— covers MCP, CLI, and subagent tool-bridge from one site.src/core/eval-capture-scrub.ts— 6 regex families (email, phone, SSN, Luhn-verified CC, JWT, bearer). Adversarial-input safe.engine.logEvalCaptureFailure+gbrain doctor24h breakdown.gbrain eval export [--since DUR] [--limit N] [--tool ...]streams NDJSON withschema_version: 1prefix per row. EPIPE-safe, deterministic ordering (created_at DESC, id DESC).gbrain eval prune --older-than DUR [--dry-run]for explicit retention.BrainEngineinterface gains 5 methods (drives v0.25.0 minor bump for downstream custom-engine implementers).hybridSearchopts gainonMeta?: (meta) => voidcallback. Cathedral II callers unaffected — return type staysPromise<SearchResult[]>.eval: { capture?, scrub_pii? }in~/.gbrain/config.json(file-plane only, both default true).Public exports contract (R2)
test/public-exports.test.tsimports each of the 17 subpaths via package name + pins canary symbols.scripts/check-exports-count.shCI guard wired intobun test— count change fails the build.Adversarial review fixes (commit c4dd09d)
gbrain doctoreval_capturecheck now distinguishes pre-v30 missing-table (ok / skipped) from RLS-denied SELECT (warn) and other DB errors (warn). Previously masked the most diagnostic class.hybridSearch.onMetainvocation wrapped in try/catch — a throwing user-supplied callback can't break the search hot path.Plan completion (CEO + Eng + Codex reviewed)
17/17 final-scope items shipped. 3 CHANGED with documented rationale: migration v25 → v30 (slots taken), hybridSearch meta as
onMetacallback vs return shape, and v0.25.0 vs original v0.21.0 (rebased forward through Cathedral II). 3 deferred items (E3 bounded queue, eval log/stats, answer_text writes) handled correctly per plan.Test Coverage
92% AI-assessed coverage. Two minor gaps documented:
src/commands/doctor.tseval_capturecheck has no dedicated unit test for ok / warn / skipped branches.src/core/config.tsevalkey has noloadConfig()round-trip test.Tests: 184 v0.25.0-related test cases (8 new files + migrate.test.ts extension + Postgres E2E gated on
DATABASE_URL).Pre-Landing Review
Ran prior to this PR (logged via
plan-eng-review× 2 rounds +plan-ceo-review+ codex outside-voice). Status CLEAR. Two surgical fixes applied this run from /ship adversarial review (commit c4dd09d).Adversarial Review
Found 8 actionable items — 2 fixed in this PR (commit c4dd09d), 6 TODO'd as P1 v0.22.1 follow-ups in
TODOS.md:gbrain eval prune --dry-run— replacelistEvalCandidates(limit:100k) + filterwith realengine.countEvalCandidatesBefore(date)(today the 100k-cap warning ateval-prune.ts:107-109is honest but a brain with > 100k rows could still confuse).eval_capture_failures.reasonenum value'scrubber_exception'is dead telemetry (regex-only scrubber never throws). Remove or wrap.id DESCclaim doesn't hold across overlapping windows when LIMIT < total. Either add anid-cursor for export or scope the doc claim.EXPECTED_COUNTduplicated inscripts/check-exports-count.shandtest/public-exports.test.ts— single source.Test triage
Full
bun testreports 27 pre-existing failures acrosscathedral-ii-pglite.test.ts,cathedral-ii-brainbench.test.ts,sync.test.ts,reindex-code.test.ts— all pre-existing on master (git diff origin/master...HEAD --statagainst those files is empty). Each failing file passes in isolation. Pattern:error: PGLite not connected. Call connect() first.— concurrent PGLite init exhaustion under bun's parallel runner. Not introduced by this branch. Tracked as P0 inTODOS.mdfor a dedicated investigation branch.All v0.25.0 unit tests (184 cases across 10 files) pass cleanly.
Plan Completion
17/17 final-scope items shipped. See
~/.claude/plans/system-instruction-you-are-working-humming-giraffe.mdfor the full plan (CEO + Eng + Codex + second-eng-review).TODOS
Documentation
Refreshed by
/document-releaseafter the CONTRIBUTOR_MODE pivot (commit175524a5). All five entry-point docs now reflect the off-by-default capture model:GBRAIN_CONTRIBUTOR_MODE=1. Was implying capture-on-by-default; corrected. Contributor section near the end already pointed atdocs/eval-bench.mdfrom the prior commit.gbrain eval replayline and a CONTRIBUTOR_MODE bullet covering the resolution order. Existing eval-capture key-files entry already updated in7a80ce25to document the gate.Plus
docs/eval-capture.mdanddocs/eval-bench.md(added in earlier commits, now cross-referenced consistently).llms.txt+llms-full.txtregenerated.Verification:
bun run typecheckclean. 198/198 v0.25.0 tests still green. Branch is 14 commits ahead of master.Test plan
bun test(174 files, 2651 pass — 27 pre-existing master flakes triaged)bun run typecheckcleancheck-jsonb-pattern.sh,check-progress-to-stdout.sh,check-wasm-embedded.sh,check-exports-count.sh)test/e2e/eval-capture.test.ts— gated onDATABASE_URL)gbrain eval export --since 1don Garry's brain post-merge🤖 Generated with Claude Code