v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins) by garrytan · Pull Request #1298 · garrytan/gbrain

garrytan · 2026-05-22T16:12:05Z

Summary

Three things change about how gbrain measures itself.

1. Per-question-type recall is now machine-readable. gbrain eval longmemeval --by-type emits a {kind:"by_type_summary", recall_by_type, aggregate} line at the end of the JSONL output. Every per-question row also gains question, question_type, and (when ground truth is available) recall_hit. Resume-safe — the summary is rebuilt from existing rows so a 5-times-resumed 500-question run ends with one cumulative summary, not five. Optional --by-type-floor F exits non-zero per breached type.

2. PRs touching src/core/search/** now run a hermetic qrels gate. A new unit-shard test (test/eval-replay-gate.test.ts) seeds a PGLite engine with synthetic placeholder pages, embeds them via basis vectors, and runs 12 hand-curated queries against engine.searchVector. Asserts top1_match_rate >= 0.80 AND recall_at_10 >= 0.85. Zero API keys, no DATABASE_URL, fully deterministic. Refresh discipline via Why: commit-body convention.

3. gbrain eval cross-modal --batch runs quality scoring over a LongMemEval slice in one command with cost guardrails. --batch <jsonl> [--limit N] [--concurrent N] [--max-usd FLOAT] [--yes]. Semaphore-bounded fan-out (default 3 questions × 3 model slots = 9 simultaneous calls, below tier-1 rate limits across all 3 providers). Pre-flight cost cap; per-question receipts land in a tempdir and are cleaned up so ~/.gbrain/eval-receipts/ stays the audit trail, not a dumping ground. Exit precedence: ERROR > FAIL > INCONCLUSIVE > PASS.

Plus a 4th opt-in surface: nightly autopilot quality probe (autopilot.nightly_quality_probe.enabled) wraps the longmemeval + cross-modal pipeline into a 24h-cadenced check. Default DISABLED. ~$0.35/night expected, $5/run cap. Doctor surfaces health via new nightly_quality_probe_health check. Phase callable in isolation today; cycle-loop dispatcher wiring filed as v0.41+ TODO.

Test Coverage

Lane	Files	New tests	Status
`eval longmemeval --by-type`	`src/commands/eval-longmemeval.ts`, `test/eval-longmemeval.test.ts`	6 (5 base + 1 CDX-3 regression)	★★ partial → ★★★ after CDX-3 fix
Hermetic qrels gate	`test/eval-replay-gate.test.ts` + `qrels-search.json`	5	★★★
`eval cross-modal --batch`	`src/commands/eval-cross-modal.ts`, `test/eval-cross-modal-batch.test.ts`	17 (15 base + 2 CDX-1/2 regressions)	★★ → ★★★ after CDX fixes
Nightly probe + audit + doctor	4 src + 1 fixture + `test/nightly-quality-probe.test.ts`	22 (14 base + 7 doctor branches + 1 CDX-5 mix + parser strict)	★★★

Total: 76 hermetic tests, 0 fail, 993 expect() calls, ~14s wallclock.

Pre-Landing Review

Pre-merge code review surfaced 1 substantive finding: CHANGELOG voice violations ((T1 + T2), (per D8 reshape), Codex #6, "What we caught and fixed" section) — all forbidden by CLAUDE.md's CHANGELOG voice rules. Fixed in commit 0719d11a. Other findings (prototype-pollution surface in seedRecallByTypeFromFile, Windows path-separator in summaryDir) were classified low-confidence and deferred.

Codex Adversarial Review

Codex independent review flagged 5 real bugs in the Track D code. 4 closed in this PR via commit ac8132d2:

CDX-1: cross-modal --batch silently dropped upstream-error rows. gbrain eval longmemeval emits {error:..., hypothesis:''} on runOneQuestion throws. Pre-fix, the batch reader's missing-field skip threw those away and shrank the denominator — a green eval on a surviving subset was possible. Now error rows carry question + question_type, the batch reader returns {rows, upstream_errors, malformed_count}, BatchSummary gains upstream_error_count and malformed_count, and ERROR exit precedence widens to include both.
CDX-2: --limit 0 was a CI bypass. Empty result fell through to verdict='pass' with total:0. Fixed with a hard limit >= 1 gate.
CDX-3: resume + --by-type-floor was a real gate skip. No-op resume returned early before summary emission AND floor enforcement. Now no-op resume still seeds recallByType from the existing file, emits the by_type_summary, and runs the floor gate.
CDX-5: doctor flagged only 3 of 7 non-PASS outcomes. no_embedding_key / rate_limited / inconclusive were silently reported as PASS, hiding misconfigurations. Filter is now outcome !== 'pass' and the counts string surfaces every bucket.

The 5th finding (CDX-4: --max-usd as preflight estimate, --concurrent unbounded) was acknowledged as a defense-in-depth concern. The --max-usd cap is honored at preflight; runtime overshoot is bounded by --concurrent N × cycles × 3 slots. Operator-controlled --concurrent is design intent. Not a silent bypass — filed as v0.41+ TODO for runtime spend-tracking.

Plan Completion

12 plan tasks: 10 DONE, 2 deferred to v0.41+ TODOs (autopilot scheduler wiring; contributor-mode CI capture for BrainBench-Real replay). All plan-track checkboxes resolved.

Test plan

bun run typecheck clean
bun run verify clean (privacy + jsonb + progress + wasm + typecheck + 10 other guards)
76/76 Track D tests pass in isolation
8410+/8414 full parallel unit sweep (4 flakes pre-existing cross-file gateway/PGLite contention in shards 1 + 3 + 4; all pass in isolation; documented in v0.41+ test-isolation cleanup TODO)
Three-line trio audit: VERSION = package.json = CHANGELOG header = 0.40.1.0
llms-full.txt regenerated via bun run build:llms

Things to watch after upgrade

gbrain doctor will show nightly_quality_probe_health: disabled (opt-in) until you set the config flag. Informational, not a warning.
The LongMemEval --by-type summary is appended as a new JSON line at the tail. Existing consumers (LongMemEval's evaluate_qa.py) ignore unknown fields.
The qrels gate uses synthetic queries with placeholder names. When a real ranking change moves expected slugs, refresh test/fixtures/eval-baselines/qrels-search.json directly with a Why: line in the commit body.

🤖 Generated with Claude Code

Per-question JSONL row gains `question`, `question_type`, and (when ground truth is available) `recall_hit` — additive fields that existing consumers (LongMemEval's `evaluate_qa.py`) ignore. New `--by-type` flag emits a `{kind:"by_type_summary", recall_by_type, aggregate}` line at the end of the output, resume-safe: rebuilt from existing rows so the final aggregate covers cumulative resumed questions, prior summary at the tail replaced rather than appended. New `--by-type-floor F` exits non-zero per breached question_type. Empty-bucket guard emits null rate not NaN. Exports `buildByTypeSummary` + `emitByTypeSummary` + `seedRecallByTypeFromFile` for unit testing.

Adds `--batch <jsonl> [--limit N] [--concurrent N] [--max-usd FLOAT] [--yes]` to the existing eval cross-modal command. Mutually exclusive with --task. Reads LongMemEval-shape JSONL output, filters by_type_summary rows automatically, fans out via a new `runWithLimit<T>` semaphore primitive (default --concurrent 3 x 3 model slots = 9 simultaneous calls; below tier-1 rate limits on all 3 providers). Pre-flight cost estimate refuses past --max-usd (default $5) unless --yes. Per-question receipts written to a per-batch tempdir + deleted at end of run so ~/.gbrain/eval-receipts/ stays clean; summary receipt inlines verdicts. Exit precedence (new batch-level policy, not inherited from aggregate.ts): ERROR > FAIL > INCONCLUSIVE > PASS — any per-question runtime error exits 2. New `runEvalCrossModal(args, opts?: {runEval?})` DI seam mirrors the existing eval-longmemeval pattern. Tests pass a stub runEval so unit tests don't need API keys; gateway availability check is also skipped when opts.runEval is provided. Pinned by 17 cases.

…rpus Adds test/eval-replay-gate.test.ts as a unit-shard test (NOT under test/e2e/ — the unit-shard CI matrix runs every PR via bun test; test/e2e/ is fixed-file). Seeds a PGLite engine with synthetic placeholder-name pages whose embeddings are basis vectors (same pattern as test/e2e/search-quality.test.ts:23-28) so retrieval is hermetic — no API keys, no DATABASE_URL, fully deterministic. The qrels fixture at test/fixtures/eval-baselines/qrels-search.json has 12 hand-curated queries; each maps to a ranked list of relevant slugs + `first_relevant_slug` (expected top-1). For each query, the gate asserts `top1_match_rate >= 0.80` AND `recall_at_10 >= 0.85`. Env-overridable floors via GBRAIN_REPLAY_GATE_TOP1_FLOOR / GBRAIN_REPLAY_GATE_RECALL_FLOOR through withEnv(). Gate-fire prints per-query HIT/miss + recall to stderr. When ranking changes intentionally move expected slugs, edit qrels-search.json directly with a 'Why:' line in the commit body — documented in docs/eval-bench.md. scripts/check-test-real-names.sh allowlist gains 6 entries for the privacy-grep regression guard inside the test, which must literally spell the names it forbids to assert they're NOT in the fixture (same meta-rule exception as skillpack-harvest privacy tests).

Composes `gbrain eval longmemeval --by-type` + `gbrain eval cross-modal --batch` into a 24h-cadenced quality check. Default DISABLED — opt-in via `gbrain config set autopilot.nightly_quality_probe.enabled true` so new users don't discover background API spend. src/core/cycle/nightly-quality-probe.ts ships the phase implementation with a full NightlyProbeDeps DI surface (isEnabled, hasEmbeddingProvider, resolveMaxUsd, resolveRepoRoot, runLongMemEval, runCrossModalBatch, now) so tests stub every external effect — no PGLite, no real LLM calls. Pure `shouldRunNightly(now, recentEvents, windowMs?)` rate-limit fn. src/core/audit-quality-probe.ts is the ISO-week-rotated JSONL writer (mirrors audit-slug-fallback.ts; honors GBRAIN_AUDIT_DIR). One event per run: outcome (pass/fail/inconclusive/error/budget_exceeded/rate_limited/ no_embedding_key), exit code, pass/fail/error counts, est_cost_usd, fixture_sha8. src/commands/doctor.ts gains a `nightly_quality_probe_health` check: SKIPPED with paste-ready enable command when disabled; OK with timestamp when all PASS in last 7 days; WARN with per-outcome counts when any FAIL/ERROR/BUDGET_EXCEEDED. Extracted as pure `computeNightlyQualityProbeHealthCheck(probeEnabled, events)` for unit testing. test/fixtures/longmemeval-nightly.jsonl is a 10-question placeholder dataset (synthetic names only) distinct from the existing 5-question mini fixture so the probe has consistent regression signal. Real expected cost: ~$0.35/night = ~$10.50/month. Worst-case at default $5 cap: $150/month. Pinned by 21 cases in test/nightly-quality-probe.test.ts covering the rate-limit pure function, every outcome branch, and all 7 branches of the doctor check. Autopilot scheduler wiring deferred to v0.41+ — the phase is callable in isolation today (via the DI surface); cycle-loop dispatcher integration filed in TODOS.md as a follow-up.

docs/eval-bench.md gains a 'v0.40.1.0 Track D — Eval infrastructure' section covering: --by-type usage + resume-replace semantics, the hermetic qrels gate workflow + 'Why:' commit-body refresh convention, --batch end-to-end with cost-bound + concurrency knobs, and the opt-in nightly probe enable workflow + cost ceiling. TODOS.md files two follow-ups: - v0.41+: contributor-mode CI capture for BrainBench-Real replay gate (the deferred original Task 2 design — replay against real captured queries is more valuable than synthetic qrels long-term, but needs CI secret + nightly capture pipeline + commit automation; deferred to a dedicated wave) - v0.41+: wire the nightly quality probe into autopilot scheduling (phase callable in isolation today; cycle-loop dispatcher integration is a ~3-hour follow-up) CLAUDE.md Key Files annotations extended for the four lanes: eval-longmemeval gains the --by-type description, eval-cross-modal gains the --batch + DI seam description, new entries for the qrels gate test + the nightly probe + audit-quality-probe writer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex adversarial review on the Track D wave found 4 real ways the new eval-gate code could silently bypass its gates. Each fix below either counts what was previously dropped, fails fast on a parser edge case, or enforces a gate that was previously skipped on an early-return path. CDX-1: cross-modal --batch silently dropped failed/corrupt LongMemEval rows. `gbrain eval longmemeval` emits {error:..., hypothesis:''} when runOneQuestion throws; the batch reader's missing-field skip threw those rows away, shrinking the denominator. A green eval on a subset is now impossible: - eval-longmemeval.ts: error rows now carry `question` + `question_type` so the batch consumer can identify them as upstream failures, not skip them as malformed. - eval-cross-modal.ts: readBatchRows now returns {rows, upstream_errors, malformed_count}. Upstream errors fold into per_question with verdict 'upstream_error'. BatchSummary gains `upstream_error_count` and `malformed_count`. ERROR exit precedence widens to include both, so any upstream failure exits 2. CDX-2: --limit 0 was a direct CI bypass — zero-row check fired before slicing, then the empty result fell through to verdict='pass'. Fixed with a hard `limit >= 1` check. CDX-3: --resume-from + --by-type-floor was a real gate skip. When a prior run had every question answered, the early "nothing to do" return fired BEFORE summary emission and floor enforcement. Now the no-op resume path still seeds recallByType from the existing file, emits the by_type_summary at the tail, and runs the floor gate. CDX-5: doctor nightly_quality_probe_health only flagged fail / error / budget_exceeded as warn. no_embedding_key / rate_limited / inconclusive were silently reported as PASS — hiding misconfigurations and queue backpressure. The bad-event filter is now `outcome !== 'pass'`, and the counts string surfaces every bucket so the operator sees exactly what went wrong. scripts/check-privacy.sh: adds test/eval-replay-gate.test.ts to the allowlist (the qrels test's privacy-grep regression guard literally names what it forbids, same meta-rule exception as the existing test/recency-decay.test.ts + skillpack-harvest allowlist entries). Pinned by 8 new regression cases across eval-longmemeval (CDX-3), eval-cross-modal-batch (CDX-1 + CDX-2), and nightly-quality-probe (CDX-5). 76 Track D tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eval-infra # Conflicts: # CHANGELOG.md # VERSION # package.json

* upstream/master: (22 commits) v0.41.4.0 wave: local providers + cross-platform stdin + gateway-routed dream judge (6 community PRs) (garrytan#1377) v0.41.3.0 fix(security/mcp): OAuth CORS lockdown + pre-register without DCR + validator surface (garrytan#1403) v0.41.2.0 feat: lens packs + epistemology unification — atoms + concepts as first-class units, calibration profile widening, gstack-learnings bridge (garrytan#1364) v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP (garrytan#1352) v0.41.0.0 feat(minions): fleet you supervise (4 field bugs + cathedral) (garrytan#1367) v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed (garrytan#1351) v0.40.9.0 feat(chunker): .sql indexing via tree-sitter + code-def on SQL DDL (garrytan#1173) (garrytan#1350) v0.40.8.1 docs: README rewrite + personal-brain + company-brain tutorials (garrytan#1345) v0.40.8.0 test: e2e + unit gap coverage + master flake root-cause fixes (garrytan#1313) v0.40.6.1 docs(todos): file v0.41 wave commitments + 7 verified-missing items (garrytan#1333) v0.40.7.0 Schema Cathedral v3 — agent-on-ramp + production rebuild of PR garrytan#1321 (garrytan#1327) v0.40.6.0 feat(sync): parallel sync --all + per-source lock invariant + sources status dashboard (productionized from PR garrytan#1314) (garrytan#1324) v0.40.5.0 Federated Sync v2 — parallel source sync + push triggers + per-source health (garrytan#1322) v0.40.4.0 feat(search): selective graph signals + per-stage attribution + audit-writer unification (garrytan#1300) v0.40.3.0 feat: contextual retrieval + cache invalidation gate + 4 deferred-item closures (garrytan#1323) v0.40.2.0 feat: trajectory routing for temporal + knowledge_update (gbrain think + LongMemEval) (garrytan#1296) v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins) (garrytan#1298) v0.40.0.0 feat: agent-voice (Mars + Venus) + copy-into-host-repo skillpack paradigm (garrytan#1128) v0.39.3.0: productionize the v0.38 ingestion cathedral (smoke-test fix wave from PR garrytan#1299) (garrytan#1308) v0.39.2.0 feat(autopilot): per-source fan-out + cycle lock primitive + phase taxonomy (garrytan#1295) ...

garrytan and others added 10 commits May 22, 2026 09:00

chore: bump version and changelog (v0.40.1.0)

0719d11

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into garrytan/v0.40.1.0-…

d456bf9

…eval-infra # Conflicts: # CHANGELOG.md # VERSION # package.json

Merge remote-tracking branch 'origin/master' into garrytan/v0.40.1.0-…

09c2a41

…eval-infra # Conflicts: # CHANGELOG.md # VERSION # package.json

Merge remote-tracking branch 'origin/master' into garrytan/v0.40.1.0-…

481b34d

…eval-infra # Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan merged commit 94aaf7e into master May 23, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins)#1298

v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins)#1298
garrytan merged 10 commits into
masterfrom
garrytan/v0.40.1.0-eval-infra

garrytan commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 22, 2026

Summary

Test Coverage

Pre-Landing Review

Codex Adversarial Review

Plan Completion

Test plan

Things to watch after upgrade

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant