feat(eval): cat14 calibration A/B — gates v0.36.1.0 calibration wave on advice quality#9
Merged
Conversation
Sidecar to the existing AdapterConfig that lets the matrix runner pass
{embedder, dim, reranker?, searchMode?, cell?} into per-cell adapter
init without env-var parsing inside each adapter.
Pure type + validator only. No imports from gbrain. Subsequent commits
wire vector.ts + vector-grep-rrf-fusion.ts adapters to consume it via
configureGateway() from gbrain/ai/gateway (exposed in gbrain v0.35.1.0,
PR garrytan/gbrain#1055).
9 unit cases pin shape and rejection behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When AdapterConfig.shootout is set, the vector adapter:
1. Validates the typed config (assertEvalAdapterConfig)
2. Calls configureGateway({embedding_model, embedding_dimensions})
BEFORE any embed call, so every downstream embedBatch + embed
routes through the configured provider.
3. Captures the configured embedder string into the BrainState
receipt (instead of EMBEDDING_MODEL constant).
When shootout is unset, behavior is identical to v0.35.0 — same OpenAI
text-embedding-3-large + EMBEDDING_MODEL receipt.
Existing test file unchanged (pure cosine helpers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rough engine config
When AdapterConfig.shootout is set, the hybrid adapter:
1. assertEvalAdapterConfig() the typed config before any work.
2. configureGateway({embedding_model, embedding_dimensions}) BEFORE
spinning up PGLite so the first importFromContent embed call uses
the per-cell provider.
3. engine.setConfig('search.mode', shootout.searchMode ?? 'tokenmax')
4. If shootout.reranker is set:
search.reranker.enabled = true
search.reranker.model = shootout.reranker
Else:
search.reranker.enabled = false
The explicit `false` is load-bearing — tokenmax's mode bundle
defaults reranker=true, so without this the "no-rerank" matrix
cells would silently reranker on whatever ZE config the env has.
Behavior unchanged when shootout is unset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eval/runner/smoke.ts is the pre-flight check the per-cell wrapper runs
BEFORE spending judge tokens on LongMemEval. Three phases:
Phase 1 — wiring: 5 short queries × embed roundtrip. Asserts the
returned vector dim matches the configured dim (catches a typo'd
--dim arg BEFORE the first page is embedded).
Phase 2 — long-haystack: 1 ~50K-token synthetic payload. Asserts the
provider handles the long-content path without hitting a token-limit
error or response cap.
Phase 3 — rerank payload (only when --reranker is set): 30 docs of
~400 tokens each → real reranker call. Asserts response shape and
that the payload stays under the recipe's max_payload_bytes cap.
Fail-loud env check rejects (provider, missing-key) combos before any
HTTP call so the operator sees the actionable error immediately. Exit
codes: 0 OK, 1 usage, 2 config invalid, 3 missing env, 4 phase failure.
Also fixes a latent bug in the two adapter rewrites: configureGateway()
now passes `env: process.env` so the gateway's `resolveAuth` path can
read provider API keys (the previous calls would throw at first embed).
Verified locally against:
- zeroentropyai:zembed-1 @ 2560 (no rerank)
- zeroentropyai:zembed-1 @ 2560 + zerank-2
- voyage:voyage-4-large @ 2048
- zeroentropyai:zembed-1 @ 1280 + zerank-2 (Matryoshka ablation)
All 4 returned [smoke] OK with the expected vector dims.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the auto-built relational queries with a curated JSON subset
loaded from eval/data/gold/brainbench-<NAME>-subset.json. Used by the
v0.35.1.0 embedder-shootout to run a Cat 13 conceptual-recall cell that
is actually embedder-sensitive — the existing relational corpus is
graph/keyword-dominated (codex outside-voice finding from the plan
review) and produces near-zero signal for embedder choice.
Subset file shape:
{
"schema_version": 1,
"subset": "<name>",
"queries": [
{ "id": "...", "text": "...", "relevant_chunk_ids": ["..."],
"inclusion_reason": "..." (optional) },
...
]
}
Loaded queries are normalized to the existing Query type and tagged
"embedder-sensitive" so reviewers can spot them in the scorecard.
Run the shootout matrix twice per cell: once without the flag (existing
relational) and once with --include-subset=cat13-embedder. The Cat 13
subset itself lands in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tive queries) 50 hand-curated queries across 27 of the 30 concept pages in eval/data/world-v1/. Curation rule: a query is included only if the phrasing shares <2 distinct content words with the target page title. A grep/BM25 adapter would miss most of these (no keyword overlap with the slug); a competent semantic embedder should find them via the semantic neighborhood. Source material: the existing SYNONYMS dictionary in eval/runner/cat13-conceptual.ts (hand-authored by the BrainBench creator). This commit lifts the strongest paraphrases into a static JSON gold file so the multi-adapter runner can score them deterministically via --include-subset=cat13-embedder. Each entry carries an inclusion_reason field documenting the curation rationale. Spot-check: validator script confirms 0 queries share more than one title content-word with their target, and target-concept coverage spans 27 distinct concepts (vs 30 total — the three remaining were 'second-time-founder' / 'carbon-credits' / 'permitting-reform' whose synonyms shared too much title surface to be embedder-sensitive under the curation rule). Used by the v0.35.1.0 embedder shootout — see docs/designs/2026_05_EVAL_PLAN.md in gbrain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three operator-facing files for executing Sessions 4-5 of the gbrain embedder-shootout plan (docs/designs/2026_05_EVAL_PLAN.md): scripts/run-shootout-phase1.sh — 7 LongMemEval cells, serial. Per cell: smoke gate -> gbrain eval longmemeval --mode tokenmax --expansion (answer-gen mode) -> evaluate_qa.py scoring. Each cell is independently resumable via the v0.35.1.0 --resume-from flag I added in PR garrytan/gbrain#1055. Fails loud at startup if any of {OPENAI_API_KEY, ANTHROPIC_API_KEY, VOYAGE_API_KEY, ZEROENTROPY_API_KEY} is unset or if the LongMemEval evaluator isn't checked out. Cost: ~$476. Wallclock: ~10.5h serial. scripts/run-shootout-phase2.sh — 7 BrainBench cells, serial. Per cell: relational corpus then Cat 13 conceptual subset via the same HybridNoGraphAdapter wired with the per-cell {embedder, dim, reranker} config. References eval/runner/shootout-driver.ts which is a Session 5 gap (the wrapper exits non-zero with the actionable "add the driver" message if missing — fail-loud rather than silent-skip). Cost: ~$56. Wallclock: ~3.5h serial. scripts/RUNBOOK_SHOOTOUT.md — single doc the operator reads to execute the shootout. Covers: one-time prereqs (HF dataset, Python venv for evaluate_qa.py, API keys), kick-off both phases, what lands where, abort/recovery, cost dashboard. Phase 1 is the long pole and ready to run as-is. Phase 2's driver gap is the only remaining code task before either phase can actually execute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eval/runner/shootout-driver.ts is the per-cell entry point referenced
by scripts/run-shootout-phase2.sh. Replaces what would otherwise be
multi-adapter.ts wired with per-cell shootout config (the existing
runner is fixed to N adapters × N runs and doesn't accept the
{embedder, dim, reranker} matrix knobs).
Flags:
--embedder <provider:model> required
--dim <N> required
--output <path> required (receipt JSON destination)
--reranker <provider:model> optional; sets search.reranker.enabled
--subset <name> optional; loads brainbench-<name>-subset.json
--cell <label> optional; A0/B1/C2/... for the receipt
Per-cell behavior:
1. Load eval/data/world-v1 corpus.
2. Build queries: relational (multi-adapter.ts:buildQueries replicated) OR
curated subset from eval/data/gold/.
3. Instantiate HybridNoGraphAdapter and pass AdapterConfig.shootout with
the matrix config. Adapter's existing logic (from Session 2 commits)
calls configureGateway() + sets search.reranker.enabled + search.mode.
4. scoreOneRun semantics: sanitize pages and queries (Day 9 sealed-qrels),
run adapter.init then per-query adapter.query, sum P@K + R@K + correct.
5. Emit a deterministic JSON receipt for the comparison writeup.
Help screen works without API spend (verified). End-to-end run gated on
the ambient (provider, key) presence per the smoke harness; that's the
operator's gate (see scripts/RUNBOOK_SHOOTOUT.md).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on wave moves advice quality
The headline product question for gbrain v0.36.1.0's Hindsight calibration
wave: does `gbrain think --with-calibration` produce better answers than
plain `gbrain think` on questions where the user has a relevant track
record?
If this category fails, the entire calibration wave is theater. If it
passes, the wave moves the needle.
What ships:
- eval/data/cat14-calibration/probes.jsonl
8 hand-authored probes covering 6 categories:
- calibration-pattern-relevant (positive: bias is relevant to question)
- calibration-pattern-confidence-boost (positive: strong track record reinforces)
- calibration-empty-profile (negative: cold brain → behaves like baseline)
- calibration-bias-irrelevant (negative: don't force-fit geography bias on tech question)
- calibration-multi-bias (negative: triage which bias applies)
- calibration-voice-stress (negative: voice stays friend-not-doctor under emotional framing)
- eval/runner/cat14-calibration.ts
Per-probe flow: build baseline + calibrated system prompts, run both
through Anthropic chat in parallel, send (question, both answers,
expected.*) to a Haiku judge with structured tool-use scoring on
5 axes:
1. mentions_relevant_bias_tag
2. presents_counter_prior
3. changes_recommendation_meaningfully
4. voice_conversational
5. doesnt_force_fit_irrelevant_bias
Per-probe JSON dump for the fix-feedback loop (failing axes drive
prompt iteration). Aggregate gate logic:
- win_rate_calibrated >= 55% (calibration_net_negative threshold)
- voice_conversational >= 95% (cheap axis must not regress)
- doesnt_force_fit_irrelevant_bias >= 90% (over-eager bias surfacing
is worse than under-claiming)
- eval/data/cat14-calibration/README.md
Failure-mode → fix-location playbook. Each axis failure maps to a
specific file in gbrain (buildThinkSystemPrompt, buildCalibrationBlock,
voice-gate rubric, profile aggregation math) so a failing run produces
an actionable next step instead of a metric blob.
- eval/runner/cat14-calibration.test.ts
Hermetic smoke (no API key required). Tests:
- fixture loads + schema is well-formed
- empty-profile path produces baseline-shaped prompt (cold-brain regression)
- non-empty profile injects bias tags + pattern statements
- gate logic catches the documented failure modes
Run:
bun test eval/runner/cat14-calibration.test.ts # smoke, no API
CAT14_DRY_RUN=1 bun eval/runner/cat14-calibration.ts # judge stubbed
bun eval/runner/cat14-calibration.ts # full ~$0.05
Design choices flagged in README:
- synthetic seeded brain (not user's real one) → known ground truth
- per-probe JSON dumps even on pass → failure-loop demands per-example visibility
- negative probes are strict half (force-fit is worse than under-claim)
v2 follow-up (deferred):
- 30+ probes covering long tail (mounts, cross-brain attribution, abandoned threads)
- shadow eval against anonymized real-brain export
- auto-iterate: failing probes → prompt mutations → re-run → delta
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two adds in one commit: ## cat15 — propose_takes precision/recall Companion to cat14. cat14 measures the OUTPUT side of calibration (does think --with-calibration produce better answers). cat15 measures the INPUT side (does extract-takes find the gradeable claims hiding in prose, so the calibration loop has fuel). 8 probes against the hand-labeled synthetic corpus shipped in gbrain v0.36.1.0 at test/fixtures/calibration/ (3 training + 5 holdout, 6 genre categories: concept-with-timeline, meeting-notes, daily-journal, people-page, essay-on-self-calibration, decision-log). Runner shape: - Read page body + ground-truth JSON - Call Sonnet with EXTRACT_TAKES_PROMPT, parse JSON array of claims - Call Haiku matcher judge with structured tool-use to label TP/FP/FN - Compute precision/recall/F1 per probe; aggregate per-split Gate thresholds: - training avg F1 >= 0.85 - holdout avg F1 >= 0.80 - train-holdout gap < 0.10 (overfitting signal) **First live run results (claude-sonnet-4-6 + claude-haiku-4-5-20251001):** training avg F1: 0.952 (target 0.85, +10 points) holdout avg F1: 0.922 (target 0.80, +12 points) train-holdout gap: 0.03 (no overfitting) 8/8 probes pass their individual F1 targets per-genre F1 floor: 0.80 (people-pages, the hardest genre) Cost: ~$0.10 per full run (8 pages * 2 LLM calls). ## cat14 iteration log Evidence the failure-loop methodology actually closes. Three prompt variants tested same-day: - v1 (original 5 rules): 75% win, 100% voice, gate PASS - v2 (split bias-tag direction): 63% win, 88% voice, gate FAIL - v3 (epistemic humility on both): 75% win, 75% voice + 75% force-fit, gate FAIL The eval caught two distinct regressions caused by over-correction. v1 was reverted; iteration-log.md preserves the loop as evidence that the 95% voice gate and 90% force-fit gate catch the failure modes that would make calibration annoying instead of useful. Lesson logged in the doc: longer prompts leak meta-language; the simplest working prompt wins; iterations that lose voice ground should not ship even if they win on other axes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p eval) Follows the existing format established by 2026-04-23-cat13-conceptual.md and 2026-05-07-longmemeval-s.md. Sections: 1. Headline (75% calibrated win / 0% baseline, 0.92 holdout F1) 2. What is the calibration loop (4-phase pipeline explained) 3. Cat 14 detail (advice quality A/B, 5-axis rubric, iteration log) 4. Cat 15 detail (extraction F1, corpus design, per-genre breakdown) 5. What changes (4 takeaways) 6. Methodology (reproduction commands, models, variance bounds) 7. SOTA framing (honest — no prior published benchmark in this category) 8. Known gaps to close in v2 (7 items) SOTA framing is honest: Hindsight introduced the calibration-loop concept as a skills demo without quantified evaluation. Personal-AI projects like Mem0, MemPalace, Notion AI don't ship a calibration loop. Academic work covers human forecaster calibration, not AI implementations. cat14 + cat15 stake out the category as benchmarkable. The benchmark report is the load-bearing artifact for anyone evaluating whether the v0.36.1.0 calibration wave is worth adopting. Linked from gbrain's CHANGELOG and README "Receipts on the evals" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback was that the original benchmark report was correct but
inaccessible — it dropped jargon (F1, precision, recall, voice gate,
force-fit, holdout, overfitting, judge model, rubric, axis) without
explaining what any of it means in plain English first.
The CLAUDE.md "lead ELI10, get precise after" rule applies to benchmark
reports as much as it does to CHANGELOG entries. Rewritten following
that structure:
- Section 1 "The plain-English version" — opens with what the feature
does and what we tested, in 200 words, before any technical term
appears.
- Section 2 "Why this matters" — explains the gap in current AI memory
systems in everyday language. Names Hindsight as prior art.
- Section 3 "The headline numbers" — gives the win-rate result, but
explains F1 / precision / recall in plain English BEFORE the score
table appears.
- Section 4 "The four pieces of the calibration loop" — walks through
the pipeline in plain English. Names what each step does and what
cat14 vs cat15 tests.
- Section 5 cat14 detail — opens with "what we're measuring in plain
English" + a worked example BEFORE the 6-category test taxonomy and
5-axis rubric. The rubric axes are listed with their plain-English
question first ("Does the calibrated answer mention the relevant bias
when it should?") with the technical name in parens after.
- Section 6 cat15 detail — opens with what the test is asking, then
explains training-vs-holdout and overfitting BEFORE introducing those
terms in the score tables.
- Section 7 "What changes for gbrain users" — material impact in plain
language. No code paths, no commit refs.
- Section 8 "How to reproduce" — copy-pasteable commands, plus the
honest caveat about why the corpus is synthetic.
- Section 9 "What this is the first to publish" — honest SOTA framing
about why the category is open, with named comparisons to adjacent
systems and academic work.
- Section 10 "Known gaps" — 7 things this report does NOT measure, with
what would be needed to close each.
- Section 11 "Per-test-case raw data" — points to the dumps and
explains why they're load-bearing for future regression-prevention.
Every section follows the same pattern: plain-English lead, then
introduce the technical term as it comes up, then drill into precision.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cat14 is the headline eval for gbrain v0.36.1.0's Hindsight calibration wave: does
gbrain think --with-calibrationproduce better answers than plaingbrain thinkon questions where the user has a relevant track record?If this category fails, the calibration wave is theater. If it passes, the wave moves the needle.
What ships
Why the failure-loop matters
This eval is designed for the if-it-fails-fix-the-feature loop. Each axis failure maps to a specific file in gbrain. Full table in eval/data/cat14-calibration/README.md.
Test plan
🤖 Generated with Claude Code