feat(eval): cat14 calibration A/B — gates v0.36.1.0 calibration wave on advice quality by garrytan · Pull Request #9 · garrytan/gbrain-evals

garrytan · 2026-05-18T23:22:45Z

Summary

cat14 is the headline eval for gbrain v0.36.1.0's Hindsight calibration wave: does gbrain think --with-calibration produce better answers than plain gbrain think on questions where the user has a relevant track record?

If this category fails, the calibration wave is theater. If it passes, the wave moves the needle.

What ships

8 hand-authored probes across 6 categories (4 positive + 4 negative). Negatives include the strict failure modes that make calibration annoying instead of useful: force-fitting irrelevant bias, leaking academic voice, force-claiming on cold brains.
5-axis Haiku judge with structured tool-use output: mentions_relevant_bias_tag, presents_counter_prior, changes_recommendation_meaningfully, voice_conversational, doesnt_force_fit_irrelevant_bias.
Per-probe JSON dumps for the fix-feedback loop. Failing axes write rationale-tagged dumps to eval/reports/cat14-calibration/<probe_id>.json — you read the rationale field, find the failure mode in the README's fix-mapping table, and know exactly which file in gbrain to edit.
Gate logic that fails the run if win-rate < 55%, voice < 95%, or force-fit-prevention < 90%. Three thresholds chosen because over-eager bias surfacing is worse than under-claiming.
Hermetic smoke test runs without an API key — validates fixture schema, prompt-builder shape, empty-profile cold-brain path, and aggregate gate logic.

Why the failure-loop matters

This eval is designed for the if-it-fails-fix-the-feature loop. Each axis failure maps to a specific file in gbrain. Full table in eval/data/cat14-calibration/README.md.

Test plan

bun test eval/runner/cat14-calibration.test.ts — 8/8 hermetic tests pass
CAT14_DRY_RUN=1 bun eval/runner/cat14-calibration.ts — full runner exercises end-to-end with stubbed judge
Live run pending ANTHROPIC_API_KEY + ~$0.05 spend
First baseline vs gbrain v0.36.1.0; threshold-gate verdict drives v0.37 calibration prompt iteration

🤖 Generated with Claude Code

Sidecar to the existing AdapterConfig that lets the matrix runner pass {embedder, dim, reranker?, searchMode?, cell?} into per-cell adapter init without env-var parsing inside each adapter. Pure type + validator only. No imports from gbrain. Subsequent commits wire vector.ts + vector-grep-rrf-fusion.ts adapters to consume it via configureGateway() from gbrain/ai/gateway (exposed in gbrain v0.35.1.0, PR garrytan/gbrain#1055). 9 unit cases pin shape and rejection behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When AdapterConfig.shootout is set, the vector adapter: 1. Validates the typed config (assertEvalAdapterConfig) 2. Calls configureGateway({embedding_model, embedding_dimensions}) BEFORE any embed call, so every downstream embedBatch + embed routes through the configured provider. 3. Captures the configured embedder string into the BrainState receipt (instead of EMBEDDING_MODEL constant). When shootout is unset, behavior is identical to v0.35.0 — same OpenAI text-embedding-3-large + EMBEDDING_MODEL receipt. Existing test file unchanged (pure cosine helpers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rough engine config When AdapterConfig.shootout is set, the hybrid adapter: 1. assertEvalAdapterConfig() the typed config before any work. 2. configureGateway({embedding_model, embedding_dimensions}) BEFORE spinning up PGLite so the first importFromContent embed call uses the per-cell provider. 3. engine.setConfig('search.mode', shootout.searchMode ?? 'tokenmax') 4. If shootout.reranker is set: search.reranker.enabled = true search.reranker.model = shootout.reranker Else: search.reranker.enabled = false The explicit `false` is load-bearing — tokenmax's mode bundle defaults reranker=true, so without this the "no-rerank" matrix cells would silently reranker on whatever ZE config the env has. Behavior unchanged when shootout is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eval/runner/smoke.ts is the pre-flight check the per-cell wrapper runs BEFORE spending judge tokens on LongMemEval. Three phases: Phase 1 — wiring: 5 short queries × embed roundtrip. Asserts the returned vector dim matches the configured dim (catches a typo'd --dim arg BEFORE the first page is embedded). Phase 2 — long-haystack: 1 ~50K-token synthetic payload. Asserts the provider handles the long-content path without hitting a token-limit error or response cap. Phase 3 — rerank payload (only when --reranker is set): 30 docs of ~400 tokens each → real reranker call. Asserts response shape and that the payload stays under the recipe's max_payload_bytes cap. Fail-loud env check rejects (provider, missing-key) combos before any HTTP call so the operator sees the actionable error immediately. Exit codes: 0 OK, 1 usage, 2 config invalid, 3 missing env, 4 phase failure. Also fixes a latent bug in the two adapter rewrites: configureGateway() now passes `env: process.env` so the gateway's `resolveAuth` path can read provider API keys (the previous calls would throw at first embed). Verified locally against: - zeroentropyai:zembed-1 @ 2560 (no rerank) - zeroentropyai:zembed-1 @ 2560 + zerank-2 - voyage:voyage-4-large @ 2048 - zeroentropyai:zembed-1 @ 1280 + zerank-2 (Matryoshka ablation) All 4 returned [smoke] OK with the expected vector dims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the auto-built relational queries with a curated JSON subset loaded from eval/data/gold/brainbench-<NAME>-subset.json. Used by the v0.35.1.0 embedder-shootout to run a Cat 13 conceptual-recall cell that is actually embedder-sensitive — the existing relational corpus is graph/keyword-dominated (codex outside-voice finding from the plan review) and produces near-zero signal for embedder choice. Subset file shape: { "schema_version": 1, "subset": "<name>", "queries": [ { "id": "...", "text": "...", "relevant_chunk_ids": ["..."], "inclusion_reason": "..." (optional) }, ... ] } Loaded queries are normalized to the existing Query type and tagged "embedder-sensitive" so reviewers can spot them in the scorecard. Run the shootout matrix twice per cell: once without the flag (existing relational) and once with --include-subset=cat13-embedder. The Cat 13 subset itself lands in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tive queries) 50 hand-curated queries across 27 of the 30 concept pages in eval/data/world-v1/. Curation rule: a query is included only if the phrasing shares <2 distinct content words with the target page title. A grep/BM25 adapter would miss most of these (no keyword overlap with the slug); a competent semantic embedder should find them via the semantic neighborhood. Source material: the existing SYNONYMS dictionary in eval/runner/cat13-conceptual.ts (hand-authored by the BrainBench creator). This commit lifts the strongest paraphrases into a static JSON gold file so the multi-adapter runner can score them deterministically via --include-subset=cat13-embedder. Each entry carries an inclusion_reason field documenting the curation rationale. Spot-check: validator script confirms 0 queries share more than one title content-word with their target, and target-concept coverage spans 27 distinct concepts (vs 30 total — the three remaining were 'second-time-founder' / 'carbon-credits' / 'permitting-reform' whose synonyms shared too much title surface to be embedder-sensitive under the curation rule). Used by the v0.35.1.0 embedder shootout — see docs/designs/2026_05_EVAL_PLAN.md in gbrain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three operator-facing files for executing Sessions 4-5 of the gbrain embedder-shootout plan (docs/designs/2026_05_EVAL_PLAN.md): scripts/run-shootout-phase1.sh — 7 LongMemEval cells, serial. Per cell: smoke gate -> gbrain eval longmemeval --mode tokenmax --expansion (answer-gen mode) -> evaluate_qa.py scoring. Each cell is independently resumable via the v0.35.1.0 --resume-from flag I added in PR garrytan/gbrain#1055. Fails loud at startup if any of {OPENAI_API_KEY, ANTHROPIC_API_KEY, VOYAGE_API_KEY, ZEROENTROPY_API_KEY} is unset or if the LongMemEval evaluator isn't checked out. Cost: ~$476. Wallclock: ~10.5h serial. scripts/run-shootout-phase2.sh — 7 BrainBench cells, serial. Per cell: relational corpus then Cat 13 conceptual subset via the same HybridNoGraphAdapter wired with the per-cell {embedder, dim, reranker} config. References eval/runner/shootout-driver.ts which is a Session 5 gap (the wrapper exits non-zero with the actionable "add the driver" message if missing — fail-loud rather than silent-skip). Cost: ~$56. Wallclock: ~3.5h serial. scripts/RUNBOOK_SHOOTOUT.md — single doc the operator reads to execute the shootout. Covers: one-time prereqs (HF dataset, Python venv for evaluate_qa.py, API keys), kick-off both phases, what lands where, abort/recovery, cost dashboard. Phase 1 is the long pole and ready to run as-is. Phase 2's driver gap is the only remaining code task before either phase can actually execute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eval/runner/shootout-driver.ts is the per-cell entry point referenced by scripts/run-shootout-phase2.sh. Replaces what would otherwise be multi-adapter.ts wired with per-cell shootout config (the existing runner is fixed to N adapters × N runs and doesn't accept the {embedder, dim, reranker} matrix knobs). Flags: --embedder <provider:model> required --dim <N> required --output <path> required (receipt JSON destination) --reranker <provider:model> optional; sets search.reranker.enabled --subset <name> optional; loads brainbench-<name>-subset.json --cell <label> optional; A0/B1/C2/... for the receipt Per-cell behavior: 1. Load eval/data/world-v1 corpus. 2. Build queries: relational (multi-adapter.ts:buildQueries replicated) OR curated subset from eval/data/gold/. 3. Instantiate HybridNoGraphAdapter and pass AdapterConfig.shootout with the matrix config. Adapter's existing logic (from Session 2 commits) calls configureGateway() + sets search.reranker.enabled + search.mode. 4. scoreOneRun semantics: sanitize pages and queries (Day 9 sealed-qrels), run adapter.init then per-query adapter.query, sum P@K + R@K + correct. 5. Emit a deterministic JSON receipt for the comparison writeup. Help screen works without API spend (verified). End-to-end run gated on the ambient (provider, key) presence per the smoke harness; that's the operator's gate (see scripts/RUNBOOK_SHOOTOUT.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on wave moves advice quality The headline product question for gbrain v0.36.1.0's Hindsight calibration wave: does `gbrain think --with-calibration` produce better answers than plain `gbrain think` on questions where the user has a relevant track record? If this category fails, the entire calibration wave is theater. If it passes, the wave moves the needle. What ships: - eval/data/cat14-calibration/probes.jsonl 8 hand-authored probes covering 6 categories: - calibration-pattern-relevant (positive: bias is relevant to question) - calibration-pattern-confidence-boost (positive: strong track record reinforces) - calibration-empty-profile (negative: cold brain → behaves like baseline) - calibration-bias-irrelevant (negative: don't force-fit geography bias on tech question) - calibration-multi-bias (negative: triage which bias applies) - calibration-voice-stress (negative: voice stays friend-not-doctor under emotional framing) - eval/runner/cat14-calibration.ts Per-probe flow: build baseline + calibrated system prompts, run both through Anthropic chat in parallel, send (question, both answers, expected.*) to a Haiku judge with structured tool-use scoring on 5 axes: 1. mentions_relevant_bias_tag 2. presents_counter_prior 3. changes_recommendation_meaningfully 4. voice_conversational 5. doesnt_force_fit_irrelevant_bias Per-probe JSON dump for the fix-feedback loop (failing axes drive prompt iteration). Aggregate gate logic: - win_rate_calibrated >= 55% (calibration_net_negative threshold) - voice_conversational >= 95% (cheap axis must not regress) - doesnt_force_fit_irrelevant_bias >= 90% (over-eager bias surfacing is worse than under-claiming) - eval/data/cat14-calibration/README.md Failure-mode → fix-location playbook. Each axis failure maps to a specific file in gbrain (buildThinkSystemPrompt, buildCalibrationBlock, voice-gate rubric, profile aggregation math) so a failing run produces an actionable next step instead of a metric blob. - eval/runner/cat14-calibration.test.ts Hermetic smoke (no API key required). Tests: - fixture loads + schema is well-formed - empty-profile path produces baseline-shaped prompt (cold-brain regression) - non-empty profile injects bias tags + pattern statements - gate logic catches the documented failure modes Run: bun test eval/runner/cat14-calibration.test.ts # smoke, no API CAT14_DRY_RUN=1 bun eval/runner/cat14-calibration.ts # judge stubbed bun eval/runner/cat14-calibration.ts # full ~$0.05 Design choices flagged in README: - synthetic seeded brain (not user's real one) → known ground truth - per-probe JSON dumps even on pass → failure-loop demands per-example visibility - negative probes are strict half (force-fit is worse than under-claim) v2 follow-up (deferred): - 30+ probes covering long tail (mounts, cross-brain attribution, abandoned threads) - shadow eval against anonymized real-brain export - auto-iterate: failing probes → prompt mutations → re-run → delta Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two adds in one commit: ## cat15 — propose_takes precision/recall Companion to cat14. cat14 measures the OUTPUT side of calibration (does think --with-calibration produce better answers). cat15 measures the INPUT side (does extract-takes find the gradeable claims hiding in prose, so the calibration loop has fuel). 8 probes against the hand-labeled synthetic corpus shipped in gbrain v0.36.1.0 at test/fixtures/calibration/ (3 training + 5 holdout, 6 genre categories: concept-with-timeline, meeting-notes, daily-journal, people-page, essay-on-self-calibration, decision-log). Runner shape: - Read page body + ground-truth JSON - Call Sonnet with EXTRACT_TAKES_PROMPT, parse JSON array of claims - Call Haiku matcher judge with structured tool-use to label TP/FP/FN - Compute precision/recall/F1 per probe; aggregate per-split Gate thresholds: - training avg F1 >= 0.85 - holdout avg F1 >= 0.80 - train-holdout gap < 0.10 (overfitting signal) **First live run results (claude-sonnet-4-6 + claude-haiku-4-5-20251001):** training avg F1: 0.952 (target 0.85, +10 points) holdout avg F1: 0.922 (target 0.80, +12 points) train-holdout gap: 0.03 (no overfitting) 8/8 probes pass their individual F1 targets per-genre F1 floor: 0.80 (people-pages, the hardest genre) Cost: ~$0.10 per full run (8 pages * 2 LLM calls). ## cat14 iteration log Evidence the failure-loop methodology actually closes. Three prompt variants tested same-day: - v1 (original 5 rules): 75% win, 100% voice, gate PASS - v2 (split bias-tag direction): 63% win, 88% voice, gate FAIL - v3 (epistemic humility on both): 75% win, 75% voice + 75% force-fit, gate FAIL The eval caught two distinct regressions caused by over-correction. v1 was reverted; iteration-log.md preserves the loop as evidence that the 95% voice gate and 90% force-fit gate catch the failure modes that would make calibration annoying instead of useful. Lesson logged in the doc: longer prompts leak meta-language; the simplest working prompt wins; iterations that lose voice ground should not ship even if they win on other axes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…p eval) Follows the existing format established by 2026-04-23-cat13-conceptual.md and 2026-05-07-longmemeval-s.md. Sections: 1. Headline (75% calibrated win / 0% baseline, 0.92 holdout F1) 2. What is the calibration loop (4-phase pipeline explained) 3. Cat 14 detail (advice quality A/B, 5-axis rubric, iteration log) 4. Cat 15 detail (extraction F1, corpus design, per-genre breakdown) 5. What changes (4 takeaways) 6. Methodology (reproduction commands, models, variance bounds) 7. SOTA framing (honest — no prior published benchmark in this category) 8. Known gaps to close in v2 (7 items) SOTA framing is honest: Hindsight introduced the calibration-loop concept as a skills demo without quantified evaluation. Personal-AI projects like Mem0, MemPalace, Notion AI don't ship a calibration loop. Academic work covers human forecaster calibration, not AI implementations. cat14 + cat15 stake out the category as benchmarkable. The benchmark report is the load-bearing artifact for anyone evaluating whether the v0.36.1.0 calibration wave is worth adopting. Linked from gbrain's CHANGELOG and README "Receipts on the evals" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User feedback was that the original benchmark report was correct but inaccessible — it dropped jargon (F1, precision, recall, voice gate, force-fit, holdout, overfitting, judge model, rubric, axis) without explaining what any of it means in plain English first. The CLAUDE.md "lead ELI10, get precise after" rule applies to benchmark reports as much as it does to CHANGELOG entries. Rewritten following that structure: - Section 1 "The plain-English version" — opens with what the feature does and what we tested, in 200 words, before any technical term appears. - Section 2 "Why this matters" — explains the gap in current AI memory systems in everyday language. Names Hindsight as prior art. - Section 3 "The headline numbers" — gives the win-rate result, but explains F1 / precision / recall in plain English BEFORE the score table appears. - Section 4 "The four pieces of the calibration loop" — walks through the pipeline in plain English. Names what each step does and what cat14 vs cat15 tests. - Section 5 cat14 detail — opens with "what we're measuring in plain English" + a worked example BEFORE the 6-category test taxonomy and 5-axis rubric. The rubric axes are listed with their plain-English question first ("Does the calibrated answer mention the relevant bias when it should?") with the technical name in parens after. - Section 6 cat15 detail — opens with what the test is asking, then explains training-vs-holdout and overfitting BEFORE introducing those terms in the score tables. - Section 7 "What changes for gbrain users" — material impact in plain language. No code paths, no commit refs. - Section 8 "How to reproduce" — copy-pasteable commands, plus the honest caveat about why the corpus is synthetic. - Section 9 "What this is the first to publish" — honest SOTA framing about why the category is open, with named comparisons to adjacent systems and academic work. - Section 10 "Known gaps" — 7 things this report does NOT measure, with what would be needed to close each. - Section 11 "Per-test-case raw data" — points to the dumps and explains why they're load-bearing for future regression-prevention. Every section follows the same pattern: plain-English lead, then introduce the technical term as it comes up, then drill into precision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan and others added 12 commits May 15, 2026 18:50

garrytan merged commit 89445dd into main May 19, 2026

garrytan deleted the cat14-calibration branch May 19, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): cat14 calibration A/B — gates v0.36.1.0 calibration wave on advice quality#9

feat(eval): cat14 calibration A/B — gates v0.36.1.0 calibration wave on advice quality#9
garrytan merged 12 commits into
mainfrom
cat14-calibration

garrytan commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 18, 2026

Summary

What ships

Why the failure-loop matters

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant