Skip to content

v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins)#1298

Merged
garrytan merged 10 commits into
masterfrom
garrytan/v0.40.1.0-eval-infra
May 23, 2026
Merged

v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins)#1298
garrytan merged 10 commits into
masterfrom
garrytan/v0.40.1.0-eval-infra

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Three things change about how gbrain measures itself.

1. Per-question-type recall is now machine-readable. gbrain eval longmemeval --by-type emits a {kind:"by_type_summary", recall_by_type, aggregate} line at the end of the JSONL output. Every per-question row also gains question, question_type, and (when ground truth is available) recall_hit. Resume-safe — the summary is rebuilt from existing rows so a 5-times-resumed 500-question run ends with one cumulative summary, not five. Optional --by-type-floor F exits non-zero per breached type.

2. PRs touching src/core/search/** now run a hermetic qrels gate. A new unit-shard test (test/eval-replay-gate.test.ts) seeds a PGLite engine with synthetic placeholder pages, embeds them via basis vectors, and runs 12 hand-curated queries against engine.searchVector. Asserts top1_match_rate >= 0.80 AND recall_at_10 >= 0.85. Zero API keys, no DATABASE_URL, fully deterministic. Refresh discipline via Why: commit-body convention.

3. gbrain eval cross-modal --batch runs quality scoring over a LongMemEval slice in one command with cost guardrails. --batch <jsonl> [--limit N] [--concurrent N] [--max-usd FLOAT] [--yes]. Semaphore-bounded fan-out (default 3 questions × 3 model slots = 9 simultaneous calls, below tier-1 rate limits across all 3 providers). Pre-flight cost cap; per-question receipts land in a tempdir and are cleaned up so ~/.gbrain/eval-receipts/ stays the audit trail, not a dumping ground. Exit precedence: ERROR > FAIL > INCONCLUSIVE > PASS.

Plus a 4th opt-in surface: nightly autopilot quality probe (autopilot.nightly_quality_probe.enabled) wraps the longmemeval + cross-modal pipeline into a 24h-cadenced check. Default DISABLED. ~$0.35/night expected, $5/run cap. Doctor surfaces health via new nightly_quality_probe_health check. Phase callable in isolation today; cycle-loop dispatcher wiring filed as v0.41+ TODO.

Test Coverage

Lane Files New tests Status
eval longmemeval --by-type src/commands/eval-longmemeval.ts, test/eval-longmemeval.test.ts 6 (5 base + 1 CDX-3 regression) ★★ partial → ★★★ after CDX-3 fix
Hermetic qrels gate test/eval-replay-gate.test.ts + qrels-search.json 5 ★★★
eval cross-modal --batch src/commands/eval-cross-modal.ts, test/eval-cross-modal-batch.test.ts 17 (15 base + 2 CDX-1/2 regressions) ★★ → ★★★ after CDX fixes
Nightly probe + audit + doctor 4 src + 1 fixture + test/nightly-quality-probe.test.ts 22 (14 base + 7 doctor branches + 1 CDX-5 mix + parser strict) ★★★

Total: 76 hermetic tests, 0 fail, 993 expect() calls, ~14s wallclock.

Pre-Landing Review

Pre-merge code review surfaced 1 substantive finding: CHANGELOG voice violations ((T1 + T2), (per D8 reshape), Codex #6, "What we caught and fixed" section) — all forbidden by CLAUDE.md's CHANGELOG voice rules. Fixed in commit 0719d11a. Other findings (prototype-pollution surface in seedRecallByTypeFromFile, Windows path-separator in summaryDir) were classified low-confidence and deferred.

Codex Adversarial Review

Codex independent review flagged 5 real bugs in the Track D code. 4 closed in this PR via commit ac8132d2:

  • CDX-1: cross-modal --batch silently dropped upstream-error rows. gbrain eval longmemeval emits {error:..., hypothesis:''} on runOneQuestion throws. Pre-fix, the batch reader's missing-field skip threw those away and shrank the denominator — a green eval on a surviving subset was possible. Now error rows carry question + question_type, the batch reader returns {rows, upstream_errors, malformed_count}, BatchSummary gains upstream_error_count and malformed_count, and ERROR exit precedence widens to include both.
  • CDX-2: --limit 0 was a CI bypass. Empty result fell through to verdict='pass' with total:0. Fixed with a hard limit >= 1 gate.
  • CDX-3: resume + --by-type-floor was a real gate skip. No-op resume returned early before summary emission AND floor enforcement. Now no-op resume still seeds recallByType from the existing file, emits the by_type_summary, and runs the floor gate.
  • CDX-5: doctor flagged only 3 of 7 non-PASS outcomes. no_embedding_key / rate_limited / inconclusive were silently reported as PASS, hiding misconfigurations. Filter is now outcome !== 'pass' and the counts string surfaces every bucket.

The 5th finding (CDX-4: --max-usd as preflight estimate, --concurrent unbounded) was acknowledged as a defense-in-depth concern. The --max-usd cap is honored at preflight; runtime overshoot is bounded by --concurrent N × cycles × 3 slots. Operator-controlled --concurrent is design intent. Not a silent bypass — filed as v0.41+ TODO for runtime spend-tracking.

Plan Completion

12 plan tasks: 10 DONE, 2 deferred to v0.41+ TODOs (autopilot scheduler wiring; contributor-mode CI capture for BrainBench-Real replay). All plan-track checkboxes resolved.

Test plan

  • bun run typecheck clean
  • bun run verify clean (privacy + jsonb + progress + wasm + typecheck + 10 other guards)
  • 76/76 Track D tests pass in isolation
  • 8410+/8414 full parallel unit sweep (4 flakes pre-existing cross-file gateway/PGLite contention in shards 1 + 3 + 4; all pass in isolation; documented in v0.41+ test-isolation cleanup TODO)
  • Three-line trio audit: VERSION = package.json = CHANGELOG header = 0.40.1.0
  • llms-full.txt regenerated via bun run build:llms

Things to watch after upgrade

  • gbrain doctor will show nightly_quality_probe_health: disabled (opt-in) until you set the config flag. Informational, not a warning.
  • The LongMemEval --by-type summary is appended as a new JSON line at the tail. Existing consumers (LongMemEval's evaluate_qa.py) ignore unknown fields.
  • The qrels gate uses synthetic queries with placeholder names. When a real ranking change moves expected slugs, refresh test/fixtures/eval-baselines/qrels-search.json directly with a Why: line in the commit body.

🤖 Generated with Claude Code

garrytan and others added 10 commits May 22, 2026 09:00
Per-question JSONL row gains `question`, `question_type`, and (when
ground truth is available) `recall_hit` — additive fields that existing
consumers (LongMemEval's `evaluate_qa.py`) ignore. New `--by-type` flag
emits a `{kind:"by_type_summary", recall_by_type, aggregate}` line at
the end of the output, resume-safe: rebuilt from existing rows so the
final aggregate covers cumulative resumed questions, prior summary at
the tail replaced rather than appended. New `--by-type-floor F` exits
non-zero per breached question_type. Empty-bucket guard emits null rate
not NaN. Exports `buildByTypeSummary` + `emitByTypeSummary` +
`seedRecallByTypeFromFile` for unit testing.
Adds `--batch <jsonl> [--limit N] [--concurrent N] [--max-usd FLOAT]
[--yes]` to the existing eval cross-modal command. Mutually exclusive
with --task. Reads LongMemEval-shape JSONL output, filters by_type_summary
rows automatically, fans out via a new `runWithLimit<T>` semaphore
primitive (default --concurrent 3 x 3 model slots = 9 simultaneous calls;
below tier-1 rate limits on all 3 providers). Pre-flight cost estimate
refuses past --max-usd (default $5) unless --yes. Per-question receipts
written to a per-batch tempdir + deleted at end of run so
~/.gbrain/eval-receipts/ stays clean; summary receipt inlines verdicts.

Exit precedence (new batch-level policy, not inherited from aggregate.ts):
ERROR > FAIL > INCONCLUSIVE > PASS — any per-question runtime error exits 2.

New `runEvalCrossModal(args, opts?: {runEval?})` DI seam mirrors the
existing eval-longmemeval pattern. Tests pass a stub runEval so unit tests
don't need API keys; gateway availability check is also skipped when
opts.runEval is provided. Pinned by 17 cases.
…rpus

Adds test/eval-replay-gate.test.ts as a unit-shard test (NOT under
test/e2e/ — the unit-shard CI matrix runs every PR via bun test;
test/e2e/ is fixed-file). Seeds a PGLite engine with synthetic
placeholder-name pages whose embeddings are basis vectors (same pattern
as test/e2e/search-quality.test.ts:23-28) so retrieval is hermetic — no
API keys, no DATABASE_URL, fully deterministic.

The qrels fixture at test/fixtures/eval-baselines/qrels-search.json has
12 hand-curated queries; each maps to a ranked list of relevant slugs +
`first_relevant_slug` (expected top-1). For each query, the gate asserts
`top1_match_rate >= 0.80` AND `recall_at_10 >= 0.85`. Env-overridable
floors via GBRAIN_REPLAY_GATE_TOP1_FLOOR / GBRAIN_REPLAY_GATE_RECALL_FLOOR
through withEnv(). Gate-fire prints per-query HIT/miss + recall to stderr.

When ranking changes intentionally move expected slugs, edit
qrels-search.json directly with a 'Why:' line in the commit body —
documented in docs/eval-bench.md.

scripts/check-test-real-names.sh allowlist gains 6 entries for the
privacy-grep regression guard inside the test, which must literally
spell the names it forbids to assert they're NOT in the fixture (same
meta-rule exception as skillpack-harvest privacy tests).
Composes `gbrain eval longmemeval --by-type` + `gbrain eval cross-modal
--batch` into a 24h-cadenced quality check. Default DISABLED — opt-in via
`gbrain config set autopilot.nightly_quality_probe.enabled true` so new
users don't discover background API spend.

src/core/cycle/nightly-quality-probe.ts ships the phase implementation
with a full NightlyProbeDeps DI surface (isEnabled, hasEmbeddingProvider,
resolveMaxUsd, resolveRepoRoot, runLongMemEval, runCrossModalBatch, now)
so tests stub every external effect — no PGLite, no real LLM calls.
Pure `shouldRunNightly(now, recentEvents, windowMs?)` rate-limit fn.

src/core/audit-quality-probe.ts is the ISO-week-rotated JSONL writer
(mirrors audit-slug-fallback.ts; honors GBRAIN_AUDIT_DIR). One event per
run: outcome (pass/fail/inconclusive/error/budget_exceeded/rate_limited/
no_embedding_key), exit code, pass/fail/error counts, est_cost_usd,
fixture_sha8.

src/commands/doctor.ts gains a `nightly_quality_probe_health` check:
SKIPPED with paste-ready enable command when disabled; OK with timestamp
when all PASS in last 7 days; WARN with per-outcome counts when any
FAIL/ERROR/BUDGET_EXCEEDED. Extracted as pure
`computeNightlyQualityProbeHealthCheck(probeEnabled, events)` for
unit testing.

test/fixtures/longmemeval-nightly.jsonl is a 10-question placeholder
dataset (synthetic names only) distinct from the existing 5-question
mini fixture so the probe has consistent regression signal.

Real expected cost: ~$0.35/night = ~$10.50/month. Worst-case at
default $5 cap: $150/month.

Pinned by 21 cases in test/nightly-quality-probe.test.ts covering the
rate-limit pure function, every outcome branch, and all 7 branches of
the doctor check.

Autopilot scheduler wiring deferred to v0.41+ — the phase is callable
in isolation today (via the DI surface); cycle-loop dispatcher
integration filed in TODOS.md as a follow-up.
docs/eval-bench.md gains a 'v0.40.1.0 Track D — Eval infrastructure'
section covering: --by-type usage + resume-replace semantics, the
hermetic qrels gate workflow + 'Why:' commit-body refresh convention,
--batch end-to-end with cost-bound + concurrency knobs, and the opt-in
nightly probe enable workflow + cost ceiling.

TODOS.md files two follow-ups:
- v0.41+: contributor-mode CI capture for BrainBench-Real replay gate
  (the deferred original Task 2 design — replay against real captured
  queries is more valuable than synthetic qrels long-term, but needs CI
  secret + nightly capture pipeline + commit automation; deferred to a
  dedicated wave)
- v0.41+: wire the nightly quality probe into autopilot scheduling
  (phase callable in isolation today; cycle-loop dispatcher integration
  is a ~3-hour follow-up)

CLAUDE.md Key Files annotations extended for the four lanes:
eval-longmemeval gains the --by-type description, eval-cross-modal
gains the --batch + DI seam description, new entries for the qrels
gate test + the nightly probe + audit-quality-probe writer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex adversarial review on the Track D wave found 4 real ways the new
eval-gate code could silently bypass its gates. Each fix below either
counts what was previously dropped, fails fast on a parser edge case,
or enforces a gate that was previously skipped on an early-return path.

CDX-1: cross-modal --batch silently dropped failed/corrupt LongMemEval
rows. `gbrain eval longmemeval` emits {error:..., hypothesis:''} when
runOneQuestion throws; the batch reader's missing-field skip threw those
rows away, shrinking the denominator. A green eval on a subset is now
impossible:
  - eval-longmemeval.ts: error rows now carry `question` + `question_type`
    so the batch consumer can identify them as upstream failures, not
    skip them as malformed.
  - eval-cross-modal.ts: readBatchRows now returns {rows, upstream_errors,
    malformed_count}. Upstream errors fold into per_question with verdict
    'upstream_error'. BatchSummary gains `upstream_error_count` and
    `malformed_count`. ERROR exit precedence widens to include both, so
    any upstream failure exits 2.

CDX-2: --limit 0 was a direct CI bypass — zero-row check fired before
slicing, then the empty result fell through to verdict='pass'. Fixed
with a hard `limit >= 1` check.

CDX-3: --resume-from + --by-type-floor was a real gate skip. When a
prior run had every question answered, the early "nothing to do" return
fired BEFORE summary emission and floor enforcement. Now the no-op
resume path still seeds recallByType from the existing file, emits the
by_type_summary at the tail, and runs the floor gate.

CDX-5: doctor nightly_quality_probe_health only flagged fail / error /
budget_exceeded as warn. no_embedding_key / rate_limited / inconclusive
were silently reported as PASS — hiding misconfigurations and queue
backpressure. The bad-event filter is now `outcome !== 'pass'`, and the
counts string surfaces every bucket so the operator sees exactly what
went wrong.

scripts/check-privacy.sh: adds test/eval-replay-gate.test.ts to the
allowlist (the qrels test's privacy-grep regression guard literally
names what it forbids, same meta-rule exception as the existing
test/recency-decay.test.ts + skillpack-harvest allowlist entries).

Pinned by 8 new regression cases across eval-longmemeval (CDX-3),
eval-cross-modal-batch (CDX-1 + CDX-2), and nightly-quality-probe
(CDX-5). 76 Track D tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eval-infra

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
…eval-infra

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
…eval-infra

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@garrytan garrytan merged commit 94aaf7e into master May 23, 2026
8 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master: (22 commits)
  v0.41.4.0 wave: local providers + cross-platform stdin + gateway-routed dream judge (6 community PRs) (garrytan#1377)
  v0.41.3.0 fix(security/mcp): OAuth CORS lockdown + pre-register without DCR + validator surface (garrytan#1403)
  v0.41.2.0 feat: lens packs + epistemology unification — atoms + concepts as first-class units, calibration profile widening, gstack-learnings bridge (garrytan#1364)
  v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP (garrytan#1352)
  v0.41.0.0 feat(minions): fleet you supervise (4 field bugs + cathedral) (garrytan#1367)
  v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed (garrytan#1351)
  v0.40.9.0 feat(chunker): .sql indexing via tree-sitter + code-def on SQL DDL (garrytan#1173) (garrytan#1350)
  v0.40.8.1 docs: README rewrite + personal-brain + company-brain tutorials (garrytan#1345)
  v0.40.8.0 test: e2e + unit gap coverage + master flake root-cause fixes (garrytan#1313)
  v0.40.6.1 docs(todos): file v0.41 wave commitments + 7 verified-missing items (garrytan#1333)
  v0.40.7.0 Schema Cathedral v3 — agent-on-ramp + production rebuild of PR garrytan#1321 (garrytan#1327)
  v0.40.6.0 feat(sync): parallel sync --all + per-source lock invariant + sources status dashboard (productionized from PR garrytan#1314) (garrytan#1324)
  v0.40.5.0 Federated Sync v2 — parallel source sync + push triggers + per-source health (garrytan#1322)
  v0.40.4.0 feat(search): selective graph signals + per-stage attribution + audit-writer unification (garrytan#1300)
  v0.40.3.0 feat: contextual retrieval + cache invalidation gate + 4 deferred-item closures (garrytan#1323)
  v0.40.2.0 feat: trajectory routing for temporal + knowledge_update (gbrain think + LongMemEval) (garrytan#1296)
  v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins) (garrytan#1298)
  v0.40.0.0 feat: agent-voice (Mars + Venus) + copy-into-host-repo skillpack paradigm (garrytan#1128)
  v0.39.3.0: productionize the v0.38 ingestion cathedral (smoke-test fix wave from PR garrytan#1299) (garrytan#1308)
  v0.39.2.0 feat(autopilot): per-source fan-out + cycle lock primitive + phase taxonomy (garrytan#1295)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant