Skip to content

v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from)#1055

Merged
garrytan merged 5 commits into
masterfrom
garrytan/lyon-v3
May 16, 2026
Merged

v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from)#1055
garrytan merged 5 commits into
masterfrom
garrytan/lyon-v3

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Infrastructure release setting up the OpenAI vs Voyage vs ZeroEntropy comparison documented in docs/designs/2026_05_EVAL_PLAN.md. Three additive changes; no breaking changes; no migration required.

  • Pricing: voyage:voyage-4-large ($0.18/MTok) + zeroentropyai:zembed-1 ($0.05/MTok) added to EMBEDDING_PRICING. gbrain upgrade's cost-estimate prompt no longer falls through to "estimate unavailable" for these two models.
  • Exports: gbrain/ai/gateway is now in the package.json exports map (17→18 entries). External eval consumers can call configureGateway() directly without forking gbrain.
  • --resume-from: gbrain eval longmemeval survives mid-run aborts. Re-run with --resume-from <jsonl> to skip already-answered question_ids and continue in append mode.

Bisect-friendly commit chain

5 commits, each independently green:

  1. docs(designs): 2026-05 embedder shootout eval plan — the plan that drove this release.
  2. feat(pricing): add voyage-4-large + zembed-1 to EMBEDDING_PRICING — 11 unit test cases.
  3. feat(exports): expose gbrain/ai/gateway with canary test — exports count bumped 17→18, canary symbols pinned.
  4. feat(eval): add --resume-from <jsonl> to gbrain eval longmemeval — 6 new test cases; new exported loadResumeSet() helper.
  5. chore: v0.35.1.0 — VERSION + package.json + CHANGELOG rollup.

Test plan

  • bun run verify — green (typecheck + all 12 pre-checks)
  • bun run test — 6573 pass, 0 fail, 0 skip
  • Trio audit — VERSION / package.json / CHANGELOG all show 0.35.1.0
  • New test/embedding-pricing.test.ts — 11 cases, 46 expect() calls
  • test/eval-longmemeval.test.ts — 18 cases (12 prior + 6 new), all green

🤖 Generated with Claude Code

garrytan and others added 5 commits May 15, 2026 18:15
Adds docs/designs/2026_05_EVAL_PLAN.md — the approved plan + 6 Conductor session
briefs for the OpenAI vs Voyage vs ZeroEntropy embedder comparison.

Why: produce a publishable comparison report for v0.35.x release notes pinning
"which embedder wins, and does zerank-2 carry the win for ZeroEntropy" against
public LongMemEval + in-house BrainBench.

Each session brief is self-contained — repo, branch, commits, verify, ship,
deliverable, hand-off. Stewardable one section per Conductor session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.35.0.0 shipped ZeroEntropy zembed-1 + zerank-2 reranker support and
expanded the Voyage allow-list to include voyage-4-large. The pricing
table missed both, so `gbrain upgrade`'s post-upgrade reembed prompt
silently fell back to "estimate unavailable" for users on these models.

- voyage:voyage-4-large @ $0.18/MTok (same as voyage-3-large)
- zeroentropyai:zembed-1 @ $0.05/MTok

New test file pins both entries plus the openai/voyage-3-large baselines,
case-insensitive provider matching, bare-model openai-default fallback,
table integrity (lowercase providers, finite non-negative prices), and
the estimateCostFromChars approximation. 11 cases, 46 expect() calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ./ai/gateway to the package.json exports map so external eval
consumers (notably gbrain-evals, the sibling repo running the embedder
shootout in docs/designs/2026_05_EVAL_PLAN.md) can call configureGateway
directly to swap embedding providers per cell.

Why: pre-v0.35.1.0, gbrain-evals adapters hardcoded gbrain/embedding,
which means every retrieval adapter was OpenAI-only. The newly-exposed
gateway lets adapters route through Voyage and ZeroEntropy without
forking gbrain or duplicating the recipe wiring.

- package.json: add "./ai/gateway" -> "./src/core/ai/gateway.ts"
- scripts/check-exports-count.sh: bump expected count 17 -> 18
- test/public-exports.test.ts: add canary pinning configureGateway + embed,
  bump expected count assertion

Pre-existing import-resolution failures in this test file (16 on master)
are unrelated to this change — they're a longstanding Bun package
self-import behavior. The count + EXPECTED_EXPORTS list-match assertions
both pass cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-cell embedder shootouts spend $50+/cell on the gpt-4o judge after
gbrain emits hypotheses. A mid-run abort (rate-limit, cost-cap, OS
interrupt, SIGKILL) previously meant re-paying the full cell. This flag
makes those aborts cheap: re-invoke with --resume-from pointed at the
partial JSONL and only the unanswered question_ids re-run.

Behavior:
- Read question_ids from the file; skip them on this run.
- Rows with non-empty hypothesis count as done.
- Rows with hypothesis="" AND an error field are NOT skipped (retry case
  for per-question failures recorded by the existing try/catch).
- Corrupt trailing lines (SIGKILL'd writer mid-line) are silently skipped
  with a stderr warn.
- When --resume-from path == --output path, the output emitter opens the
  file in append mode instead of truncating, so the existing rows survive.
- Empty resume case (all questions already done) returns immediately
  without spinning up the brain or calling the client.

New exported helper loadResumeSet() makes the parser unit-testable.

6 new test cases pinning:
- File-not-found returns empty set
- Well-formed JSONL load
- Error-row retry semantics (empty hypothesis + error -> not in set)
- Truncated final line recovery
- End-to-end resume against the 5-question mini fixture
- All-done early-return (stub client must NOT be invoked)

All 18 cases in test/eval-longmemeval.test.ts green; bun run typecheck
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps VERSION + package.json + CHANGELOG entry for the embedder-shootout
prereq release. Three additive changes from the prior 4 commits:

- pricing: voyage-4-large + zembed-1 entries
- exports: gbrain/ai/gateway is now public
- eval: gbrain eval longmemeval --resume-from <jsonl>

Each commit on this branch is independently bisect-friendly and CI-green;
the CHANGELOG entry is the user-facing rollup. No migrations, no breaking
changes — the gateway export expands the surface, the resume-from flag is
additive, the pricing patch only changes "estimate unavailable" -> a real
dollar figure for two specific models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 3933eb6 into master May 16, 2026
7 checks passed
@garrytan garrytan deleted the garrytan/lyon-v3 branch May 16, 2026 01:54
garrytan added a commit to garrytan/gbrain-evals that referenced this pull request May 16, 2026
Three operator-facing files for executing Sessions 4-5 of the gbrain
embedder-shootout plan (docs/designs/2026_05_EVAL_PLAN.md):

scripts/run-shootout-phase1.sh — 7 LongMemEval cells, serial.
  Per cell: smoke gate -> gbrain eval longmemeval --mode tokenmax
  --expansion (answer-gen mode) -> evaluate_qa.py scoring. Each cell
  is independently resumable via the v0.35.1.0 --resume-from flag I
  added in PR garrytan/gbrain#1055. Fails loud at startup if any of
  {OPENAI_API_KEY, ANTHROPIC_API_KEY, VOYAGE_API_KEY, ZEROENTROPY_API_KEY}
  is unset or if the LongMemEval evaluator isn't checked out.
  Cost: ~$476. Wallclock: ~10.5h serial.

scripts/run-shootout-phase2.sh — 7 BrainBench cells, serial.
  Per cell: relational corpus then Cat 13 conceptual subset via the
  same HybridNoGraphAdapter wired with the per-cell {embedder, dim,
  reranker} config. References eval/runner/shootout-driver.ts which
  is a Session 5 gap (the wrapper exits non-zero with the actionable
  "add the driver" message if missing — fail-loud rather than
  silent-skip).
  Cost: ~$56. Wallclock: ~3.5h serial.

scripts/RUNBOOK_SHOOTOUT.md — single doc the operator reads to
  execute the shootout. Covers: one-time prereqs (HF dataset, Python
  venv for evaluate_qa.py, API keys), kick-off both phases, what
  lands where, abort/recovery, cost dashboard.

Phase 1 is the long pole and ready to run as-is. Phase 2's driver gap
is the only remaining code task before either phase can actually
execute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit to garrytan/gbrain-evals that referenced this pull request May 19, 2026
…on advice quality (#9)

* feat(eval): typed EvalAdapterConfig for embedder-shootout matrix

Sidecar to the existing AdapterConfig that lets the matrix runner pass
{embedder, dim, reranker?, searchMode?, cell?} into per-cell adapter
init without env-var parsing inside each adapter.

Pure type + validator only. No imports from gbrain. Subsequent commits
wire vector.ts + vector-grep-rrf-fusion.ts adapters to consume it via
configureGateway() from gbrain/ai/gateway (exposed in gbrain v0.35.1.0,
PR garrytan/gbrain#1055).

9 unit cases pin shape and rejection behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(adapter): vector accepts EvalAdapterConfig + configureGateway swap

When AdapterConfig.shootout is set, the vector adapter:
  1. Validates the typed config (assertEvalAdapterConfig)
  2. Calls configureGateway({embedding_model, embedding_dimensions})
     BEFORE any embed call, so every downstream embedBatch + embed
     routes through the configured provider.
  3. Captures the configured embedder string into the BrainState
     receipt (instead of EMBEDDING_MODEL constant).

When shootout is unset, behavior is identical to v0.35.0 — same OpenAI
text-embedding-3-large + EMBEDDING_MODEL receipt.

Existing test file unchanged (pure cosine helpers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(adapter): vector-grep-rrf-fusion wires reranker + search.mode through engine config

When AdapterConfig.shootout is set, the hybrid adapter:
  1. assertEvalAdapterConfig() the typed config before any work.
  2. configureGateway({embedding_model, embedding_dimensions}) BEFORE
     spinning up PGLite so the first importFromContent embed call uses
     the per-cell provider.
  3. engine.setConfig('search.mode', shootout.searchMode ?? 'tokenmax')
  4. If shootout.reranker is set:
       search.reranker.enabled = true
       search.reranker.model = shootout.reranker
     Else:
       search.reranker.enabled = false
     The explicit `false` is load-bearing — tokenmax's mode bundle
     defaults reranker=true, so without this the "no-rerank" matrix
     cells would silently reranker on whatever ZE config the env has.

Behavior unchanged when shootout is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(smoke): 3-phase smoke harness gates each matrix cell

eval/runner/smoke.ts is the pre-flight check the per-cell wrapper runs
BEFORE spending judge tokens on LongMemEval. Three phases:

  Phase 1 — wiring: 5 short queries × embed roundtrip. Asserts the
    returned vector dim matches the configured dim (catches a typo'd
    --dim arg BEFORE the first page is embedded).

  Phase 2 — long-haystack: 1 ~50K-token synthetic payload. Asserts the
    provider handles the long-content path without hitting a token-limit
    error or response cap.

  Phase 3 — rerank payload (only when --reranker is set): 30 docs of
    ~400 tokens each → real reranker call. Asserts response shape and
    that the payload stays under the recipe's max_payload_bytes cap.

Fail-loud env check rejects (provider, missing-key) combos before any
HTTP call so the operator sees the actionable error immediately. Exit
codes: 0 OK, 1 usage, 2 config invalid, 3 missing env, 4 phase failure.

Also fixes a latent bug in the two adapter rewrites: configureGateway()
now passes `env: process.env` so the gateway's `resolveAuth` path can
read provider API keys (the previous calls would throw at first embed).

Verified locally against:
  - zeroentropyai:zembed-1 @ 2560 (no rerank)
  - zeroentropyai:zembed-1 @ 2560 + zerank-2
  - voyage:voyage-4-large @ 2048
  - zeroentropyai:zembed-1 @ 1280 + zerank-2 (Matryoshka ablation)
All 4 returned [smoke] OK with the expected vector dims.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(runner): --include-subset NAME flag for BrainBench

Replaces the auto-built relational queries with a curated JSON subset
loaded from eval/data/gold/brainbench-<NAME>-subset.json. Used by the
v0.35.1.0 embedder-shootout to run a Cat 13 conceptual-recall cell that
is actually embedder-sensitive — the existing relational corpus is
graph/keyword-dominated (codex outside-voice finding from the plan
review) and produces near-zero signal for embedder choice.

Subset file shape:
  {
    "schema_version": 1,
    "subset": "<name>",
    "queries": [
      { "id": "...", "text": "...", "relevant_chunk_ids": ["..."],
        "inclusion_reason": "..." (optional) },
      ...
    ]
  }

Loaded queries are normalized to the existing Query type and tagged
"embedder-sensitive" so reviewers can spot them in the scorecard.

Run the shootout matrix twice per cell: once without the flag (existing
relational) and once with --include-subset=cat13-embedder. The Cat 13
subset itself lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): curate Cat 13 conceptual-recall subset (50 embedder-sensitive queries)

50 hand-curated queries across 27 of the 30 concept pages in
eval/data/world-v1/. Curation rule: a query is included only if the
phrasing shares <2 distinct content words with the target page title.
A grep/BM25 adapter would miss most of these (no keyword overlap with
the slug); a competent semantic embedder should find them via the
semantic neighborhood.

Source material: the existing SYNONYMS dictionary in
eval/runner/cat13-conceptual.ts (hand-authored by the BrainBench
creator). This commit lifts the strongest paraphrases into a static
JSON gold file so the multi-adapter runner can score them
deterministically via --include-subset=cat13-embedder.

Each entry carries an inclusion_reason field documenting the curation
rationale. Spot-check: validator script confirms 0 queries share more
than one title content-word with their target, and target-concept
coverage spans 27 distinct concepts (vs 30 total — the three remaining
were 'second-time-founder' / 'carbon-credits' / 'permitting-reform'
whose synonyms shared too much title surface to be embedder-sensitive
under the curation rule).

Used by the v0.35.1.0 embedder shootout — see
docs/designs/2026_05_EVAL_PLAN.md in gbrain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(shootout): wrapper scripts + RUNBOOK for Phase 1 + Phase 2

Three operator-facing files for executing Sessions 4-5 of the gbrain
embedder-shootout plan (docs/designs/2026_05_EVAL_PLAN.md):

scripts/run-shootout-phase1.sh — 7 LongMemEval cells, serial.
  Per cell: smoke gate -> gbrain eval longmemeval --mode tokenmax
  --expansion (answer-gen mode) -> evaluate_qa.py scoring. Each cell
  is independently resumable via the v0.35.1.0 --resume-from flag I
  added in PR garrytan/gbrain#1055. Fails loud at startup if any of
  {OPENAI_API_KEY, ANTHROPIC_API_KEY, VOYAGE_API_KEY, ZEROENTROPY_API_KEY}
  is unset or if the LongMemEval evaluator isn't checked out.
  Cost: ~$476. Wallclock: ~10.5h serial.

scripts/run-shootout-phase2.sh — 7 BrainBench cells, serial.
  Per cell: relational corpus then Cat 13 conceptual subset via the
  same HybridNoGraphAdapter wired with the per-cell {embedder, dim,
  reranker} config. References eval/runner/shootout-driver.ts which
  is a Session 5 gap (the wrapper exits non-zero with the actionable
  "add the driver" message if missing — fail-loud rather than
  silent-skip).
  Cost: ~$56. Wallclock: ~3.5h serial.

scripts/RUNBOOK_SHOOTOUT.md — single doc the operator reads to
  execute the shootout. Covers: one-time prereqs (HF dataset, Python
  venv for evaluate_qa.py, API keys), kick-off both phases, what
  lands where, abort/recovery, cost dashboard.

Phase 1 is the long pole and ready to run as-is. Phase 2's driver gap
is the only remaining code task before either phase can actually
execute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(driver): single-cell shootout driver for Phase 2

eval/runner/shootout-driver.ts is the per-cell entry point referenced
by scripts/run-shootout-phase2.sh. Replaces what would otherwise be
multi-adapter.ts wired with per-cell shootout config (the existing
runner is fixed to N adapters × N runs and doesn't accept the
{embedder, dim, reranker} matrix knobs).

Flags:
  --embedder <provider:model>     required
  --dim <N>                       required
  --output <path>                 required (receipt JSON destination)
  --reranker <provider:model>     optional; sets search.reranker.enabled
  --subset <name>                 optional; loads brainbench-<name>-subset.json
  --cell <label>                  optional; A0/B1/C2/... for the receipt

Per-cell behavior:
  1. Load eval/data/world-v1 corpus.
  2. Build queries: relational (multi-adapter.ts:buildQueries replicated) OR
     curated subset from eval/data/gold/.
  3. Instantiate HybridNoGraphAdapter and pass AdapterConfig.shootout with
     the matrix config. Adapter's existing logic (from Session 2 commits)
     calls configureGateway() + sets search.reranker.enabled + search.mode.
  4. scoreOneRun semantics: sanitize pages and queries (Day 9 sealed-qrels),
     run adapter.init then per-query adapter.query, sum P@K + R@K + correct.
  5. Emit a deterministic JSON receipt for the comparison writeup.

Help screen works without API spend (verified). End-to-end run gated on
the ambient (provider, key) presence per the smoke harness; that's the
operator's gate (see scripts/RUNBOOK_SHOOTOUT.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): cat14 calibration A/B — gates whether v0.36.1.0 calibration wave moves advice quality

The headline product question for gbrain v0.36.1.0's Hindsight calibration
wave: does `gbrain think --with-calibration` produce better answers than
plain `gbrain think` on questions where the user has a relevant track
record?

If this category fails, the entire calibration wave is theater. If it
passes, the wave moves the needle.

What ships:

- eval/data/cat14-calibration/probes.jsonl
  8 hand-authored probes covering 6 categories:
    - calibration-pattern-relevant (positive: bias is relevant to question)
    - calibration-pattern-confidence-boost (positive: strong track record reinforces)
    - calibration-empty-profile (negative: cold brain → behaves like baseline)
    - calibration-bias-irrelevant (negative: don't force-fit geography bias on tech question)
    - calibration-multi-bias (negative: triage which bias applies)
    - calibration-voice-stress (negative: voice stays friend-not-doctor under emotional framing)

- eval/runner/cat14-calibration.ts
  Per-probe flow: build baseline + calibrated system prompts, run both
  through Anthropic chat in parallel, send (question, both answers,
  expected.*) to a Haiku judge with structured tool-use scoring on
  5 axes:
    1. mentions_relevant_bias_tag
    2. presents_counter_prior
    3. changes_recommendation_meaningfully
    4. voice_conversational
    5. doesnt_force_fit_irrelevant_bias

  Per-probe JSON dump for the fix-feedback loop (failing axes drive
  prompt iteration). Aggregate gate logic:
    - win_rate_calibrated >= 55% (calibration_net_negative threshold)
    - voice_conversational >= 95% (cheap axis must not regress)
    - doesnt_force_fit_irrelevant_bias >= 90% (over-eager bias surfacing
      is worse than under-claiming)

- eval/data/cat14-calibration/README.md
  Failure-mode → fix-location playbook. Each axis failure maps to a
  specific file in gbrain (buildThinkSystemPrompt, buildCalibrationBlock,
  voice-gate rubric, profile aggregation math) so a failing run produces
  an actionable next step instead of a metric blob.

- eval/runner/cat14-calibration.test.ts
  Hermetic smoke (no API key required). Tests:
    - fixture loads + schema is well-formed
    - empty-profile path produces baseline-shaped prompt (cold-brain regression)
    - non-empty profile injects bias tags + pattern statements
    - gate logic catches the documented failure modes

Run:
  bun test eval/runner/cat14-calibration.test.ts        # smoke, no API
  CAT14_DRY_RUN=1 bun eval/runner/cat14-calibration.ts  # judge stubbed
  bun eval/runner/cat14-calibration.ts                  # full ~$0.05

Design choices flagged in README:
  - synthetic seeded brain (not user's real one) → known ground truth
  - per-probe JSON dumps even on pass → failure-loop demands per-example visibility
  - negative probes are strict half (force-fit is worse than under-claim)

v2 follow-up (deferred):
  - 30+ probes covering long tail (mounts, cross-brain attribution, abandoned threads)
  - shadow eval against anonymized real-brain export
  - auto-iterate: failing probes → prompt mutations → re-run → delta

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): cat15 propose_takes F1 + cat14 iteration log

Two adds in one commit:

## cat15 — propose_takes precision/recall

Companion to cat14. cat14 measures the OUTPUT side of calibration (does
think --with-calibration produce better answers). cat15 measures the
INPUT side (does extract-takes find the gradeable claims hiding in
prose, so the calibration loop has fuel).

8 probes against the hand-labeled synthetic corpus shipped in gbrain
v0.36.1.0 at test/fixtures/calibration/ (3 training + 5 holdout, 6
genre categories: concept-with-timeline, meeting-notes, daily-journal,
people-page, essay-on-self-calibration, decision-log).

Runner shape:
  - Read page body + ground-truth JSON
  - Call Sonnet with EXTRACT_TAKES_PROMPT, parse JSON array of claims
  - Call Haiku matcher judge with structured tool-use to label TP/FP/FN
  - Compute precision/recall/F1 per probe; aggregate per-split

Gate thresholds:
  - training avg F1 >= 0.85
  - holdout avg F1 >= 0.80
  - train-holdout gap < 0.10 (overfitting signal)

**First live run results (claude-sonnet-4-6 + claude-haiku-4-5-20251001):**

  training avg F1: 0.952  (target 0.85, +10 points)
  holdout  avg F1: 0.922  (target 0.80, +12 points)
  train-holdout gap: 0.03 (no overfitting)
  8/8 probes pass their individual F1 targets
  per-genre F1 floor: 0.80 (people-pages, the hardest genre)

Cost: ~$0.10 per full run (8 pages * 2 LLM calls).

## cat14 iteration log

Evidence the failure-loop methodology actually closes. Three prompt
variants tested same-day:
  - v1 (original 5 rules): 75% win, 100% voice, gate PASS
  - v2 (split bias-tag direction): 63% win, 88% voice, gate FAIL
  - v3 (epistemic humility on both): 75% win, 75% voice + 75% force-fit, gate FAIL

The eval caught two distinct regressions caused by over-correction. v1
was reverted; iteration-log.md preserves the loop as evidence that the
95% voice gate and 90% force-fit gate catch the failure modes that
would make calibration annoying instead of useful.

Lesson logged in the doc: longer prompts leak meta-language; the
simplest working prompt wins; iterations that lose voice ground should
not ship even if they win on other axes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: cat14 + cat15 benchmark report (first published calibration-loop eval)

Follows the existing format established by 2026-04-23-cat13-conceptual.md
and 2026-05-07-longmemeval-s.md. Sections:

  1. Headline (75% calibrated win / 0% baseline, 0.92 holdout F1)
  2. What is the calibration loop (4-phase pipeline explained)
  3. Cat 14 detail (advice quality A/B, 5-axis rubric, iteration log)
  4. Cat 15 detail (extraction F1, corpus design, per-genre breakdown)
  5. What changes (4 takeaways)
  6. Methodology (reproduction commands, models, variance bounds)
  7. SOTA framing (honest — no prior published benchmark in this category)
  8. Known gaps to close in v2 (7 items)

SOTA framing is honest: Hindsight introduced the calibration-loop
concept as a skills demo without quantified evaluation. Personal-AI
projects like Mem0, MemPalace, Notion AI don't ship a calibration
loop. Academic work covers human forecaster calibration, not AI
implementations. cat14 + cat15 stake out the category as benchmarkable.

The benchmark report is the load-bearing artifact for anyone evaluating
whether the v0.36.1.0 calibration wave is worth adopting. Linked from
gbrain's CHANGELOG and README "Receipts on the evals" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: rewrite cat14/cat15 benchmark report ELI10-first

User feedback was that the original benchmark report was correct but
inaccessible — it dropped jargon (F1, precision, recall, voice gate,
force-fit, holdout, overfitting, judge model, rubric, axis) without
explaining what any of it means in plain English first.

The CLAUDE.md "lead ELI10, get precise after" rule applies to benchmark
reports as much as it does to CHANGELOG entries. Rewritten following
that structure:

- Section 1 "The plain-English version" — opens with what the feature
  does and what we tested, in 200 words, before any technical term
  appears.
- Section 2 "Why this matters" — explains the gap in current AI memory
  systems in everyday language. Names Hindsight as prior art.
- Section 3 "The headline numbers" — gives the win-rate result, but
  explains F1 / precision / recall in plain English BEFORE the score
  table appears.
- Section 4 "The four pieces of the calibration loop" — walks through
  the pipeline in plain English. Names what each step does and what
  cat14 vs cat15 tests.
- Section 5 cat14 detail — opens with "what we're measuring in plain
  English" + a worked example BEFORE the 6-category test taxonomy and
  5-axis rubric. The rubric axes are listed with their plain-English
  question first ("Does the calibrated answer mention the relevant bias
  when it should?") with the technical name in parens after.
- Section 6 cat15 detail — opens with what the test is asking, then
  explains training-vs-holdout and overfitting BEFORE introducing those
  terms in the score tables.
- Section 7 "What changes for gbrain users" — material impact in plain
  language. No code paths, no commit refs.
- Section 8 "How to reproduce" — copy-pasteable commands, plus the
  honest caveat about why the corpus is synthetic.
- Section 9 "What this is the first to publish" — honest SOTA framing
  about why the category is open, with named comparisons to adjacent
  systems and academic work.
- Section 10 "Known gaps" — 7 things this report does NOT measure, with
  what would be needed to close each.
- Section 11 "Per-test-case raw data" — points to the dumps and
  explains why they're load-bearing for future regression-prevention.

Every section follows the same pattern: plain-English lead, then
introduce the technical term as it comes up, then drill into precision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brandonlipman added a commit to brandonlipman/gbrain that referenced this pull request May 29, 2026
* upstream/master:
  v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from) (garrytan#1055)
  v0.35.0.0 feat: ZeroEntropy zembed-1 + zerank-2 reranker (garrytan#1008)
  v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) (garrytan#991)
  v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes (garrytan#1003)
  v0.34.2.0 fix(import): path-based checkpoint resume — kills parallel-drop + failed-file-skip + sort-flip bugs (garrytan#988)
  v0.34.1.0 fix(mcp): MCP fix wave — source-isolation P0 + PKCE DCR + federated_read + 3 more (garrytan#996)
  v0.34.0.0 feat: Cathedral III — recursive code intelligence + Leiden clusters + eval gate (garrytan#994)
  v0.33.3.0 feat(v0.33.3): code intelligence MCP foundation (v0.34 W0a-c + W3) (garrytan#934)
  v0.33.2.1 docs: fork-PR workflow for garrytan-agents (garrytan#992)
  fix(sync): raise maxBuffer to 100 MiB to prevent silent ENOBUFS crash (garrytan#982)
  v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting (garrytan#897)
  v0.33.1.1 fix: Voyage output_dimension + flexible-dim guard + OOM-cap rethrow (garrytan#962)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant