Skip to content

feat(eval): cat14 calibration A/B — gates v0.36.1.0 calibration wave on advice quality#9

Merged
garrytan merged 12 commits into
mainfrom
cat14-calibration
May 19, 2026
Merged

feat(eval): cat14 calibration A/B — gates v0.36.1.0 calibration wave on advice quality#9
garrytan merged 12 commits into
mainfrom
cat14-calibration

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

cat14 is the headline eval for gbrain v0.36.1.0's Hindsight calibration wave: does gbrain think --with-calibration produce better answers than plain gbrain think on questions where the user has a relevant track record?

If this category fails, the calibration wave is theater. If it passes, the wave moves the needle.

What ships

  • 8 hand-authored probes across 6 categories (4 positive + 4 negative). Negatives include the strict failure modes that make calibration annoying instead of useful: force-fitting irrelevant bias, leaking academic voice, force-claiming on cold brains.
  • 5-axis Haiku judge with structured tool-use output: mentions_relevant_bias_tag, presents_counter_prior, changes_recommendation_meaningfully, voice_conversational, doesnt_force_fit_irrelevant_bias.
  • Per-probe JSON dumps for the fix-feedback loop. Failing axes write rationale-tagged dumps to eval/reports/cat14-calibration/<probe_id>.json — you read the rationale field, find the failure mode in the README's fix-mapping table, and know exactly which file in gbrain to edit.
  • Gate logic that fails the run if win-rate < 55%, voice < 95%, or force-fit-prevention < 90%. Three thresholds chosen because over-eager bias surfacing is worse than under-claiming.
  • Hermetic smoke test runs without an API key — validates fixture schema, prompt-builder shape, empty-profile cold-brain path, and aggregate gate logic.

Why the failure-loop matters

This eval is designed for the if-it-fails-fix-the-feature loop. Each axis failure maps to a specific file in gbrain. Full table in eval/data/cat14-calibration/README.md.

Test plan

  • bun test eval/runner/cat14-calibration.test.ts — 8/8 hermetic tests pass
  • CAT14_DRY_RUN=1 bun eval/runner/cat14-calibration.ts — full runner exercises end-to-end with stubbed judge
  • Live run pending ANTHROPIC_API_KEY + ~$0.05 spend
  • First baseline vs gbrain v0.36.1.0; threshold-gate verdict drives v0.37 calibration prompt iteration

🤖 Generated with Claude Code

garrytan and others added 12 commits May 15, 2026 18:50
Sidecar to the existing AdapterConfig that lets the matrix runner pass
{embedder, dim, reranker?, searchMode?, cell?} into per-cell adapter
init without env-var parsing inside each adapter.

Pure type + validator only. No imports from gbrain. Subsequent commits
wire vector.ts + vector-grep-rrf-fusion.ts adapters to consume it via
configureGateway() from gbrain/ai/gateway (exposed in gbrain v0.35.1.0,
PR garrytan/gbrain#1055).

9 unit cases pin shape and rejection behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When AdapterConfig.shootout is set, the vector adapter:
  1. Validates the typed config (assertEvalAdapterConfig)
  2. Calls configureGateway({embedding_model, embedding_dimensions})
     BEFORE any embed call, so every downstream embedBatch + embed
     routes through the configured provider.
  3. Captures the configured embedder string into the BrainState
     receipt (instead of EMBEDDING_MODEL constant).

When shootout is unset, behavior is identical to v0.35.0 — same OpenAI
text-embedding-3-large + EMBEDDING_MODEL receipt.

Existing test file unchanged (pure cosine helpers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rough engine config

When AdapterConfig.shootout is set, the hybrid adapter:
  1. assertEvalAdapterConfig() the typed config before any work.
  2. configureGateway({embedding_model, embedding_dimensions}) BEFORE
     spinning up PGLite so the first importFromContent embed call uses
     the per-cell provider.
  3. engine.setConfig('search.mode', shootout.searchMode ?? 'tokenmax')
  4. If shootout.reranker is set:
       search.reranker.enabled = true
       search.reranker.model = shootout.reranker
     Else:
       search.reranker.enabled = false
     The explicit `false` is load-bearing — tokenmax's mode bundle
     defaults reranker=true, so without this the "no-rerank" matrix
     cells would silently reranker on whatever ZE config the env has.

Behavior unchanged when shootout is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eval/runner/smoke.ts is the pre-flight check the per-cell wrapper runs
BEFORE spending judge tokens on LongMemEval. Three phases:

  Phase 1 — wiring: 5 short queries × embed roundtrip. Asserts the
    returned vector dim matches the configured dim (catches a typo'd
    --dim arg BEFORE the first page is embedded).

  Phase 2 — long-haystack: 1 ~50K-token synthetic payload. Asserts the
    provider handles the long-content path without hitting a token-limit
    error or response cap.

  Phase 3 — rerank payload (only when --reranker is set): 30 docs of
    ~400 tokens each → real reranker call. Asserts response shape and
    that the payload stays under the recipe's max_payload_bytes cap.

Fail-loud env check rejects (provider, missing-key) combos before any
HTTP call so the operator sees the actionable error immediately. Exit
codes: 0 OK, 1 usage, 2 config invalid, 3 missing env, 4 phase failure.

Also fixes a latent bug in the two adapter rewrites: configureGateway()
now passes `env: process.env` so the gateway's `resolveAuth` path can
read provider API keys (the previous calls would throw at first embed).

Verified locally against:
  - zeroentropyai:zembed-1 @ 2560 (no rerank)
  - zeroentropyai:zembed-1 @ 2560 + zerank-2
  - voyage:voyage-4-large @ 2048
  - zeroentropyai:zembed-1 @ 1280 + zerank-2 (Matryoshka ablation)
All 4 returned [smoke] OK with the expected vector dims.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the auto-built relational queries with a curated JSON subset
loaded from eval/data/gold/brainbench-<NAME>-subset.json. Used by the
v0.35.1.0 embedder-shootout to run a Cat 13 conceptual-recall cell that
is actually embedder-sensitive — the existing relational corpus is
graph/keyword-dominated (codex outside-voice finding from the plan
review) and produces near-zero signal for embedder choice.

Subset file shape:
  {
    "schema_version": 1,
    "subset": "<name>",
    "queries": [
      { "id": "...", "text": "...", "relevant_chunk_ids": ["..."],
        "inclusion_reason": "..." (optional) },
      ...
    ]
  }

Loaded queries are normalized to the existing Query type and tagged
"embedder-sensitive" so reviewers can spot them in the scorecard.

Run the shootout matrix twice per cell: once without the flag (existing
relational) and once with --include-subset=cat13-embedder. The Cat 13
subset itself lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tive queries)

50 hand-curated queries across 27 of the 30 concept pages in
eval/data/world-v1/. Curation rule: a query is included only if the
phrasing shares <2 distinct content words with the target page title.
A grep/BM25 adapter would miss most of these (no keyword overlap with
the slug); a competent semantic embedder should find them via the
semantic neighborhood.

Source material: the existing SYNONYMS dictionary in
eval/runner/cat13-conceptual.ts (hand-authored by the BrainBench
creator). This commit lifts the strongest paraphrases into a static
JSON gold file so the multi-adapter runner can score them
deterministically via --include-subset=cat13-embedder.

Each entry carries an inclusion_reason field documenting the curation
rationale. Spot-check: validator script confirms 0 queries share more
than one title content-word with their target, and target-concept
coverage spans 27 distinct concepts (vs 30 total — the three remaining
were 'second-time-founder' / 'carbon-credits' / 'permitting-reform'
whose synonyms shared too much title surface to be embedder-sensitive
under the curation rule).

Used by the v0.35.1.0 embedder shootout — see
docs/designs/2026_05_EVAL_PLAN.md in gbrain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three operator-facing files for executing Sessions 4-5 of the gbrain
embedder-shootout plan (docs/designs/2026_05_EVAL_PLAN.md):

scripts/run-shootout-phase1.sh — 7 LongMemEval cells, serial.
  Per cell: smoke gate -> gbrain eval longmemeval --mode tokenmax
  --expansion (answer-gen mode) -> evaluate_qa.py scoring. Each cell
  is independently resumable via the v0.35.1.0 --resume-from flag I
  added in PR garrytan/gbrain#1055. Fails loud at startup if any of
  {OPENAI_API_KEY, ANTHROPIC_API_KEY, VOYAGE_API_KEY, ZEROENTROPY_API_KEY}
  is unset or if the LongMemEval evaluator isn't checked out.
  Cost: ~$476. Wallclock: ~10.5h serial.

scripts/run-shootout-phase2.sh — 7 BrainBench cells, serial.
  Per cell: relational corpus then Cat 13 conceptual subset via the
  same HybridNoGraphAdapter wired with the per-cell {embedder, dim,
  reranker} config. References eval/runner/shootout-driver.ts which
  is a Session 5 gap (the wrapper exits non-zero with the actionable
  "add the driver" message if missing — fail-loud rather than
  silent-skip).
  Cost: ~$56. Wallclock: ~3.5h serial.

scripts/RUNBOOK_SHOOTOUT.md — single doc the operator reads to
  execute the shootout. Covers: one-time prereqs (HF dataset, Python
  venv for evaluate_qa.py, API keys), kick-off both phases, what
  lands where, abort/recovery, cost dashboard.

Phase 1 is the long pole and ready to run as-is. Phase 2's driver gap
is the only remaining code task before either phase can actually
execute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eval/runner/shootout-driver.ts is the per-cell entry point referenced
by scripts/run-shootout-phase2.sh. Replaces what would otherwise be
multi-adapter.ts wired with per-cell shootout config (the existing
runner is fixed to N adapters × N runs and doesn't accept the
{embedder, dim, reranker} matrix knobs).

Flags:
  --embedder <provider:model>     required
  --dim <N>                       required
  --output <path>                 required (receipt JSON destination)
  --reranker <provider:model>     optional; sets search.reranker.enabled
  --subset <name>                 optional; loads brainbench-<name>-subset.json
  --cell <label>                  optional; A0/B1/C2/... for the receipt

Per-cell behavior:
  1. Load eval/data/world-v1 corpus.
  2. Build queries: relational (multi-adapter.ts:buildQueries replicated) OR
     curated subset from eval/data/gold/.
  3. Instantiate HybridNoGraphAdapter and pass AdapterConfig.shootout with
     the matrix config. Adapter's existing logic (from Session 2 commits)
     calls configureGateway() + sets search.reranker.enabled + search.mode.
  4. scoreOneRun semantics: sanitize pages and queries (Day 9 sealed-qrels),
     run adapter.init then per-query adapter.query, sum P@K + R@K + correct.
  5. Emit a deterministic JSON receipt for the comparison writeup.

Help screen works without API spend (verified). End-to-end run gated on
the ambient (provider, key) presence per the smoke harness; that's the
operator's gate (see scripts/RUNBOOK_SHOOTOUT.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on wave moves advice quality

The headline product question for gbrain v0.36.1.0's Hindsight calibration
wave: does `gbrain think --with-calibration` produce better answers than
plain `gbrain think` on questions where the user has a relevant track
record?

If this category fails, the entire calibration wave is theater. If it
passes, the wave moves the needle.

What ships:

- eval/data/cat14-calibration/probes.jsonl
  8 hand-authored probes covering 6 categories:
    - calibration-pattern-relevant (positive: bias is relevant to question)
    - calibration-pattern-confidence-boost (positive: strong track record reinforces)
    - calibration-empty-profile (negative: cold brain → behaves like baseline)
    - calibration-bias-irrelevant (negative: don't force-fit geography bias on tech question)
    - calibration-multi-bias (negative: triage which bias applies)
    - calibration-voice-stress (negative: voice stays friend-not-doctor under emotional framing)

- eval/runner/cat14-calibration.ts
  Per-probe flow: build baseline + calibrated system prompts, run both
  through Anthropic chat in parallel, send (question, both answers,
  expected.*) to a Haiku judge with structured tool-use scoring on
  5 axes:
    1. mentions_relevant_bias_tag
    2. presents_counter_prior
    3. changes_recommendation_meaningfully
    4. voice_conversational
    5. doesnt_force_fit_irrelevant_bias

  Per-probe JSON dump for the fix-feedback loop (failing axes drive
  prompt iteration). Aggregate gate logic:
    - win_rate_calibrated >= 55% (calibration_net_negative threshold)
    - voice_conversational >= 95% (cheap axis must not regress)
    - doesnt_force_fit_irrelevant_bias >= 90% (over-eager bias surfacing
      is worse than under-claiming)

- eval/data/cat14-calibration/README.md
  Failure-mode → fix-location playbook. Each axis failure maps to a
  specific file in gbrain (buildThinkSystemPrompt, buildCalibrationBlock,
  voice-gate rubric, profile aggregation math) so a failing run produces
  an actionable next step instead of a metric blob.

- eval/runner/cat14-calibration.test.ts
  Hermetic smoke (no API key required). Tests:
    - fixture loads + schema is well-formed
    - empty-profile path produces baseline-shaped prompt (cold-brain regression)
    - non-empty profile injects bias tags + pattern statements
    - gate logic catches the documented failure modes

Run:
  bun test eval/runner/cat14-calibration.test.ts        # smoke, no API
  CAT14_DRY_RUN=1 bun eval/runner/cat14-calibration.ts  # judge stubbed
  bun eval/runner/cat14-calibration.ts                  # full ~$0.05

Design choices flagged in README:
  - synthetic seeded brain (not user's real one) → known ground truth
  - per-probe JSON dumps even on pass → failure-loop demands per-example visibility
  - negative probes are strict half (force-fit is worse than under-claim)

v2 follow-up (deferred):
  - 30+ probes covering long tail (mounts, cross-brain attribution, abandoned threads)
  - shadow eval against anonymized real-brain export
  - auto-iterate: failing probes → prompt mutations → re-run → delta

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two adds in one commit:

## cat15 — propose_takes precision/recall

Companion to cat14. cat14 measures the OUTPUT side of calibration (does
think --with-calibration produce better answers). cat15 measures the
INPUT side (does extract-takes find the gradeable claims hiding in
prose, so the calibration loop has fuel).

8 probes against the hand-labeled synthetic corpus shipped in gbrain
v0.36.1.0 at test/fixtures/calibration/ (3 training + 5 holdout, 6
genre categories: concept-with-timeline, meeting-notes, daily-journal,
people-page, essay-on-self-calibration, decision-log).

Runner shape:
  - Read page body + ground-truth JSON
  - Call Sonnet with EXTRACT_TAKES_PROMPT, parse JSON array of claims
  - Call Haiku matcher judge with structured tool-use to label TP/FP/FN
  - Compute precision/recall/F1 per probe; aggregate per-split

Gate thresholds:
  - training avg F1 >= 0.85
  - holdout avg F1 >= 0.80
  - train-holdout gap < 0.10 (overfitting signal)

**First live run results (claude-sonnet-4-6 + claude-haiku-4-5-20251001):**

  training avg F1: 0.952  (target 0.85, +10 points)
  holdout  avg F1: 0.922  (target 0.80, +12 points)
  train-holdout gap: 0.03 (no overfitting)
  8/8 probes pass their individual F1 targets
  per-genre F1 floor: 0.80 (people-pages, the hardest genre)

Cost: ~$0.10 per full run (8 pages * 2 LLM calls).

## cat14 iteration log

Evidence the failure-loop methodology actually closes. Three prompt
variants tested same-day:
  - v1 (original 5 rules): 75% win, 100% voice, gate PASS
  - v2 (split bias-tag direction): 63% win, 88% voice, gate FAIL
  - v3 (epistemic humility on both): 75% win, 75% voice + 75% force-fit, gate FAIL

The eval caught two distinct regressions caused by over-correction. v1
was reverted; iteration-log.md preserves the loop as evidence that the
95% voice gate and 90% force-fit gate catch the failure modes that
would make calibration annoying instead of useful.

Lesson logged in the doc: longer prompts leak meta-language; the
simplest working prompt wins; iterations that lose voice ground should
not ship even if they win on other axes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p eval)

Follows the existing format established by 2026-04-23-cat13-conceptual.md
and 2026-05-07-longmemeval-s.md. Sections:

  1. Headline (75% calibrated win / 0% baseline, 0.92 holdout F1)
  2. What is the calibration loop (4-phase pipeline explained)
  3. Cat 14 detail (advice quality A/B, 5-axis rubric, iteration log)
  4. Cat 15 detail (extraction F1, corpus design, per-genre breakdown)
  5. What changes (4 takeaways)
  6. Methodology (reproduction commands, models, variance bounds)
  7. SOTA framing (honest — no prior published benchmark in this category)
  8. Known gaps to close in v2 (7 items)

SOTA framing is honest: Hindsight introduced the calibration-loop
concept as a skills demo without quantified evaluation. Personal-AI
projects like Mem0, MemPalace, Notion AI don't ship a calibration
loop. Academic work covers human forecaster calibration, not AI
implementations. cat14 + cat15 stake out the category as benchmarkable.

The benchmark report is the load-bearing artifact for anyone evaluating
whether the v0.36.1.0 calibration wave is worth adopting. Linked from
gbrain's CHANGELOG and README "Receipts on the evals" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback was that the original benchmark report was correct but
inaccessible — it dropped jargon (F1, precision, recall, voice gate,
force-fit, holdout, overfitting, judge model, rubric, axis) without
explaining what any of it means in plain English first.

The CLAUDE.md "lead ELI10, get precise after" rule applies to benchmark
reports as much as it does to CHANGELOG entries. Rewritten following
that structure:

- Section 1 "The plain-English version" — opens with what the feature
  does and what we tested, in 200 words, before any technical term
  appears.
- Section 2 "Why this matters" — explains the gap in current AI memory
  systems in everyday language. Names Hindsight as prior art.
- Section 3 "The headline numbers" — gives the win-rate result, but
  explains F1 / precision / recall in plain English BEFORE the score
  table appears.
- Section 4 "The four pieces of the calibration loop" — walks through
  the pipeline in plain English. Names what each step does and what
  cat14 vs cat15 tests.
- Section 5 cat14 detail — opens with "what we're measuring in plain
  English" + a worked example BEFORE the 6-category test taxonomy and
  5-axis rubric. The rubric axes are listed with their plain-English
  question first ("Does the calibrated answer mention the relevant bias
  when it should?") with the technical name in parens after.
- Section 6 cat15 detail — opens with what the test is asking, then
  explains training-vs-holdout and overfitting BEFORE introducing those
  terms in the score tables.
- Section 7 "What changes for gbrain users" — material impact in plain
  language. No code paths, no commit refs.
- Section 8 "How to reproduce" — copy-pasteable commands, plus the
  honest caveat about why the corpus is synthetic.
- Section 9 "What this is the first to publish" — honest SOTA framing
  about why the category is open, with named comparisons to adjacent
  systems and academic work.
- Section 10 "Known gaps" — 7 things this report does NOT measure, with
  what would be needed to close each.
- Section 11 "Per-test-case raw data" — points to the dumps and
  explains why they're load-bearing for future regression-prevention.

Every section follows the same pattern: plain-English lead, then
introduce the technical term as it comes up, then drill into precision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 89445dd into main May 19, 2026
@garrytan garrytan deleted the cat14-calibration branch May 19, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant