v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins)#1298
Merged
Merged
Conversation
Per-question JSONL row gains `question`, `question_type`, and (when
ground truth is available) `recall_hit` — additive fields that existing
consumers (LongMemEval's `evaluate_qa.py`) ignore. New `--by-type` flag
emits a `{kind:"by_type_summary", recall_by_type, aggregate}` line at
the end of the output, resume-safe: rebuilt from existing rows so the
final aggregate covers cumulative resumed questions, prior summary at
the tail replaced rather than appended. New `--by-type-floor F` exits
non-zero per breached question_type. Empty-bucket guard emits null rate
not NaN. Exports `buildByTypeSummary` + `emitByTypeSummary` +
`seedRecallByTypeFromFile` for unit testing.
Adds `--batch <jsonl> [--limit N] [--concurrent N] [--max-usd FLOAT]
[--yes]` to the existing eval cross-modal command. Mutually exclusive
with --task. Reads LongMemEval-shape JSONL output, filters by_type_summary
rows automatically, fans out via a new `runWithLimit<T>` semaphore
primitive (default --concurrent 3 x 3 model slots = 9 simultaneous calls;
below tier-1 rate limits on all 3 providers). Pre-flight cost estimate
refuses past --max-usd (default $5) unless --yes. Per-question receipts
written to a per-batch tempdir + deleted at end of run so
~/.gbrain/eval-receipts/ stays clean; summary receipt inlines verdicts.
Exit precedence (new batch-level policy, not inherited from aggregate.ts):
ERROR > FAIL > INCONCLUSIVE > PASS — any per-question runtime error exits 2.
New `runEvalCrossModal(args, opts?: {runEval?})` DI seam mirrors the
existing eval-longmemeval pattern. Tests pass a stub runEval so unit tests
don't need API keys; gateway availability check is also skipped when
opts.runEval is provided. Pinned by 17 cases.
…rpus Adds test/eval-replay-gate.test.ts as a unit-shard test (NOT under test/e2e/ — the unit-shard CI matrix runs every PR via bun test; test/e2e/ is fixed-file). Seeds a PGLite engine with synthetic placeholder-name pages whose embeddings are basis vectors (same pattern as test/e2e/search-quality.test.ts:23-28) so retrieval is hermetic — no API keys, no DATABASE_URL, fully deterministic. The qrels fixture at test/fixtures/eval-baselines/qrels-search.json has 12 hand-curated queries; each maps to a ranked list of relevant slugs + `first_relevant_slug` (expected top-1). For each query, the gate asserts `top1_match_rate >= 0.80` AND `recall_at_10 >= 0.85`. Env-overridable floors via GBRAIN_REPLAY_GATE_TOP1_FLOOR / GBRAIN_REPLAY_GATE_RECALL_FLOOR through withEnv(). Gate-fire prints per-query HIT/miss + recall to stderr. When ranking changes intentionally move expected slugs, edit qrels-search.json directly with a 'Why:' line in the commit body — documented in docs/eval-bench.md. scripts/check-test-real-names.sh allowlist gains 6 entries for the privacy-grep regression guard inside the test, which must literally spell the names it forbids to assert they're NOT in the fixture (same meta-rule exception as skillpack-harvest privacy tests).
Composes `gbrain eval longmemeval --by-type` + `gbrain eval cross-modal --batch` into a 24h-cadenced quality check. Default DISABLED — opt-in via `gbrain config set autopilot.nightly_quality_probe.enabled true` so new users don't discover background API spend. src/core/cycle/nightly-quality-probe.ts ships the phase implementation with a full NightlyProbeDeps DI surface (isEnabled, hasEmbeddingProvider, resolveMaxUsd, resolveRepoRoot, runLongMemEval, runCrossModalBatch, now) so tests stub every external effect — no PGLite, no real LLM calls. Pure `shouldRunNightly(now, recentEvents, windowMs?)` rate-limit fn. src/core/audit-quality-probe.ts is the ISO-week-rotated JSONL writer (mirrors audit-slug-fallback.ts; honors GBRAIN_AUDIT_DIR). One event per run: outcome (pass/fail/inconclusive/error/budget_exceeded/rate_limited/ no_embedding_key), exit code, pass/fail/error counts, est_cost_usd, fixture_sha8. src/commands/doctor.ts gains a `nightly_quality_probe_health` check: SKIPPED with paste-ready enable command when disabled; OK with timestamp when all PASS in last 7 days; WARN with per-outcome counts when any FAIL/ERROR/BUDGET_EXCEEDED. Extracted as pure `computeNightlyQualityProbeHealthCheck(probeEnabled, events)` for unit testing. test/fixtures/longmemeval-nightly.jsonl is a 10-question placeholder dataset (synthetic names only) distinct from the existing 5-question mini fixture so the probe has consistent regression signal. Real expected cost: ~$0.35/night = ~$10.50/month. Worst-case at default $5 cap: $150/month. Pinned by 21 cases in test/nightly-quality-probe.test.ts covering the rate-limit pure function, every outcome branch, and all 7 branches of the doctor check. Autopilot scheduler wiring deferred to v0.41+ — the phase is callable in isolation today (via the DI surface); cycle-loop dispatcher integration filed in TODOS.md as a follow-up.
docs/eval-bench.md gains a 'v0.40.1.0 Track D — Eval infrastructure' section covering: --by-type usage + resume-replace semantics, the hermetic qrels gate workflow + 'Why:' commit-body refresh convention, --batch end-to-end with cost-bound + concurrency knobs, and the opt-in nightly probe enable workflow + cost ceiling. TODOS.md files two follow-ups: - v0.41+: contributor-mode CI capture for BrainBench-Real replay gate (the deferred original Task 2 design — replay against real captured queries is more valuable than synthetic qrels long-term, but needs CI secret + nightly capture pipeline + commit automation; deferred to a dedicated wave) - v0.41+: wire the nightly quality probe into autopilot scheduling (phase callable in isolation today; cycle-loop dispatcher integration is a ~3-hour follow-up) CLAUDE.md Key Files annotations extended for the four lanes: eval-longmemeval gains the --by-type description, eval-cross-modal gains the --batch + DI seam description, new entries for the qrels gate test + the nightly probe + audit-quality-probe writer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex adversarial review on the Track D wave found 4 real ways the new
eval-gate code could silently bypass its gates. Each fix below either
counts what was previously dropped, fails fast on a parser edge case,
or enforces a gate that was previously skipped on an early-return path.
CDX-1: cross-modal --batch silently dropped failed/corrupt LongMemEval
rows. `gbrain eval longmemeval` emits {error:..., hypothesis:''} when
runOneQuestion throws; the batch reader's missing-field skip threw those
rows away, shrinking the denominator. A green eval on a subset is now
impossible:
- eval-longmemeval.ts: error rows now carry `question` + `question_type`
so the batch consumer can identify them as upstream failures, not
skip them as malformed.
- eval-cross-modal.ts: readBatchRows now returns {rows, upstream_errors,
malformed_count}. Upstream errors fold into per_question with verdict
'upstream_error'. BatchSummary gains `upstream_error_count` and
`malformed_count`. ERROR exit precedence widens to include both, so
any upstream failure exits 2.
CDX-2: --limit 0 was a direct CI bypass — zero-row check fired before
slicing, then the empty result fell through to verdict='pass'. Fixed
with a hard `limit >= 1` check.
CDX-3: --resume-from + --by-type-floor was a real gate skip. When a
prior run had every question answered, the early "nothing to do" return
fired BEFORE summary emission and floor enforcement. Now the no-op
resume path still seeds recallByType from the existing file, emits the
by_type_summary at the tail, and runs the floor gate.
CDX-5: doctor nightly_quality_probe_health only flagged fail / error /
budget_exceeded as warn. no_embedding_key / rate_limited / inconclusive
were silently reported as PASS — hiding misconfigurations and queue
backpressure. The bad-event filter is now `outcome !== 'pass'`, and the
counts string surfaces every bucket so the operator sees exactly what
went wrong.
scripts/check-privacy.sh: adds test/eval-replay-gate.test.ts to the
allowlist (the qrels test's privacy-grep regression guard literally
names what it forbids, same meta-rule exception as the existing
test/recency-decay.test.ts + skillpack-harvest allowlist entries).
Pinned by 8 new regression cases across eval-longmemeval (CDX-3),
eval-cross-modal-batch (CDX-1 + CDX-2), and nightly-quality-probe
(CDX-5). 76 Track D tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eval-infra # Conflicts: # CHANGELOG.md # VERSION # package.json
…eval-infra # Conflicts: # CHANGELOG.md # VERSION # package.json
…eval-infra # Conflicts: # CHANGELOG.md # VERSION # package.json
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: (22 commits) v0.41.4.0 wave: local providers + cross-platform stdin + gateway-routed dream judge (6 community PRs) (garrytan#1377) v0.41.3.0 fix(security/mcp): OAuth CORS lockdown + pre-register without DCR + validator surface (garrytan#1403) v0.41.2.0 feat: lens packs + epistemology unification — atoms + concepts as first-class units, calibration profile widening, gstack-learnings bridge (garrytan#1364) v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP (garrytan#1352) v0.41.0.0 feat(minions): fleet you supervise (4 field bugs + cathedral) (garrytan#1367) v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed (garrytan#1351) v0.40.9.0 feat(chunker): .sql indexing via tree-sitter + code-def on SQL DDL (garrytan#1173) (garrytan#1350) v0.40.8.1 docs: README rewrite + personal-brain + company-brain tutorials (garrytan#1345) v0.40.8.0 test: e2e + unit gap coverage + master flake root-cause fixes (garrytan#1313) v0.40.6.1 docs(todos): file v0.41 wave commitments + 7 verified-missing items (garrytan#1333) v0.40.7.0 Schema Cathedral v3 — agent-on-ramp + production rebuild of PR garrytan#1321 (garrytan#1327) v0.40.6.0 feat(sync): parallel sync --all + per-source lock invariant + sources status dashboard (productionized from PR garrytan#1314) (garrytan#1324) v0.40.5.0 Federated Sync v2 — parallel source sync + push triggers + per-source health (garrytan#1322) v0.40.4.0 feat(search): selective graph signals + per-stage attribution + audit-writer unification (garrytan#1300) v0.40.3.0 feat: contextual retrieval + cache invalidation gate + 4 deferred-item closures (garrytan#1323) v0.40.2.0 feat: trajectory routing for temporal + knowledge_update (gbrain think + LongMemEval) (garrytan#1296) v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins) (garrytan#1298) v0.40.0.0 feat: agent-voice (Mars + Venus) + copy-into-host-repo skillpack paradigm (garrytan#1128) v0.39.3.0: productionize the v0.38 ingestion cathedral (smoke-test fix wave from PR garrytan#1299) (garrytan#1308) v0.39.2.0 feat(autopilot): per-source fan-out + cycle lock primitive + phase taxonomy (garrytan#1295) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three things change about how gbrain measures itself.
1. Per-question-type recall is now machine-readable.
gbrain eval longmemeval --by-typeemits a{kind:"by_type_summary", recall_by_type, aggregate}line at the end of the JSONL output. Every per-question row also gainsquestion,question_type, and (when ground truth is available)recall_hit. Resume-safe — the summary is rebuilt from existing rows so a 5-times-resumed 500-question run ends with one cumulative summary, not five. Optional--by-type-floor Fexits non-zero per breached type.2. PRs touching
src/core/search/**now run a hermetic qrels gate. A new unit-shard test (test/eval-replay-gate.test.ts) seeds a PGLite engine with synthetic placeholder pages, embeds them via basis vectors, and runs 12 hand-curated queries againstengine.searchVector. Assertstop1_match_rate >= 0.80ANDrecall_at_10 >= 0.85. Zero API keys, noDATABASE_URL, fully deterministic. Refresh discipline viaWhy:commit-body convention.3.
gbrain eval cross-modal --batchruns quality scoring over a LongMemEval slice in one command with cost guardrails.--batch <jsonl> [--limit N] [--concurrent N] [--max-usd FLOAT] [--yes]. Semaphore-bounded fan-out (default 3 questions × 3 model slots = 9 simultaneous calls, below tier-1 rate limits across all 3 providers). Pre-flight cost cap; per-question receipts land in a tempdir and are cleaned up so~/.gbrain/eval-receipts/stays the audit trail, not a dumping ground. Exit precedence: ERROR > FAIL > INCONCLUSIVE > PASS.Plus a 4th opt-in surface: nightly autopilot quality probe (
autopilot.nightly_quality_probe.enabled) wraps the longmemeval + cross-modal pipeline into a 24h-cadenced check. Default DISABLED. ~$0.35/night expected, $5/run cap. Doctor surfaces health via newnightly_quality_probe_healthcheck. Phase callable in isolation today; cycle-loop dispatcher wiring filed as v0.41+ TODO.Test Coverage
eval longmemeval --by-typesrc/commands/eval-longmemeval.ts,test/eval-longmemeval.test.tstest/eval-replay-gate.test.ts+qrels-search.jsoneval cross-modal --batchsrc/commands/eval-cross-modal.ts,test/eval-cross-modal-batch.test.tstest/nightly-quality-probe.test.tsTotal: 76 hermetic tests, 0 fail, 993 expect() calls, ~14s wallclock.
Pre-Landing Review
Pre-merge code review surfaced 1 substantive finding: CHANGELOG voice violations (
(T1 + T2),(per D8 reshape),Codex #6, "What we caught and fixed" section) — all forbidden by CLAUDE.md's CHANGELOG voice rules. Fixed in commit0719d11a. Other findings (prototype-pollution surface inseedRecallByTypeFromFile, Windows path-separator insummaryDir) were classified low-confidence and deferred.Codex Adversarial Review
Codex independent review flagged 5 real bugs in the Track D code. 4 closed in this PR via commit
ac8132d2:gbrain eval longmemevalemits{error:..., hypothesis:''}on runOneQuestion throws. Pre-fix, the batch reader's missing-field skip threw those away and shrank the denominator — a green eval on a surviving subset was possible. Now error rows carryquestion+question_type, the batch reader returns{rows, upstream_errors, malformed_count}, BatchSummary gainsupstream_error_countandmalformed_count, and ERROR exit precedence widens to include both.total:0. Fixed with a hardlimit >= 1gate.recallByTypefrom the existing file, emits the by_type_summary, and runs the floor gate.no_embedding_key/rate_limited/inconclusivewere silently reported as PASS, hiding misconfigurations. Filter is nowoutcome !== 'pass'and the counts string surfaces every bucket.The 5th finding (CDX-4: --max-usd as preflight estimate, --concurrent unbounded) was acknowledged as a defense-in-depth concern. The
--max-usdcap is honored at preflight; runtime overshoot is bounded by--concurrent N × cycles × 3 slots. Operator-controlled--concurrentis design intent. Not a silent bypass — filed as v0.41+ TODO for runtime spend-tracking.Plan Completion
12 plan tasks: 10 DONE, 2 deferred to v0.41+ TODOs (autopilot scheduler wiring; contributor-mode CI capture for BrainBench-Real replay). All plan-track checkboxes resolved.
Test plan
bun run typecheckcleanbun run verifyclean (privacy + jsonb + progress + wasm + typecheck + 10 other guards)0.40.1.0bun run build:llmsThings to watch after upgrade
gbrain doctorwill shownightly_quality_probe_health: disabled (opt-in)until you set the config flag. Informational, not a warning.--by-typesummary is appended as a new JSON line at the tail. Existing consumers (LongMemEval'sevaluate_qa.py) ignore unknown fields.test/fixtures/eval-baselines/qrels-search.jsondirectly with aWhy:line in the commit body.🤖 Generated with Claude Code