Skip to content

v0.41-launch: hermetic baseline + qrels for gbrain eval gate#13

Open
garrytan wants to merge 1 commit into
mainfrom
garrytan/v0.41-launch-baselines
Open

v0.41-launch: hermetic baseline + qrels for gbrain eval gate#13
garrytan wants to merge 1 commit into
mainfrom
garrytan/v0.41-launch-baselines

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

  • Adds the first published baseline + qrels for gbrain eval gate (the new CI verb landing in gbrain v0.41.0.0). Both files are hermetic-synthetic — placeholder names only per gbrain's D9 privacy posture.
  • baselines/v0.41-launch.baseline.ndjson (12 captured rows, source_hash 34e88041…) drives the regression gate.
  • qrels/v0.41-launch.qrels.json (12 hand-curated queries with known-right answers) drives the correctness gate.
  • scripts/generate-v0.41-launch.ts is the reproducible recipe; same input + fixed published_at → byte-identical output.

Closes the LOOP that gbrain v0.41 ships: users point CI at these files via gbrain eval gate --baseline X --qrels Y and fail PRs on retrieval regressions OR correctness drops without bootstrapping their own baseline.

What's in the box

File Purpose
baselines/v0.41-launch.baseline.ndjson Regression gate target. NDJSON: metadata header (_kind: 'baseline_metadata' + thresholds + source_hash) + 12 captured rows.
baselines/README.md Privacy posture, file format, refresh discipline.
qrels/v0.41-launch.qrels.json Correctness gate target. JSON object: {schema_version, queries: [...]}. Promoted from gbrain's existing test/fixtures/eval-baselines/qrels-search.json fixture.
qrels/README.md File format docs (legacy + federated source_id-aware shapes).
scripts/generate-v0.41-launch.ts Deterministic regenerator. Set GBRAIN_SRC=<path-to-gbrain> to use a local gbrain checkout instead of the npm dep.

Privacy posture (gbrain D9)

Every slug in both files is a *-example placeholder (people/alice-example, companies/widget-co-example, etc.) per gbrain's CLAUDE.md privacy rule. Real-user captures stay local in ~/.gbrain/baselines/ on each user's machine and never enter the public benchmark surface.

Refresh discipline (gbrain D4)

When a ranking change intentionally moves expected slugs, edit the qrels or regenerate the baseline, then include a Why: line in the commit body so future maintainers can audit the trail. Without that discipline, the gate degrades to rubber-stamp within months.

Test plan

  • Both files parse cleanly through gbrain's v0.41 parseBaselineFile + parseQrelsFile (verified locally).
  • Generator is deterministic — re-running with the same GBRAIN_SRC produces byte-identical output.
  • When gbrain v0.41.0.0 lands on master, run gbrain eval gate --baseline baselines/v0.41-launch.baseline.ndjson --qrels qrels/v0.41-launch.qrels.json against a known-good gbrain build and confirm exit 0.

🤖 Generated with Claude Code

Coordinated drop alongside gbrain v0.41.0.0. Both files are
hermetic-synthetic — placeholder names only per gbrain D9 privacy
posture. No real user queries, people, or companies.

- baselines/v0.41-launch.baseline.ndjson — 12 captured rows from a
  fixture-seeded brain (source_hash 34e88041..., mean latency 27ms).
  Consumed by gbrain eval gate --baseline. Catches retrieval
  REGRESSIONS during refactors.
- qrels/v0.41-launch.qrels.json — 12 hand-curated queries with known-
  right answers (promoted from gbrain's existing
  test/fixtures/eval-baselines/qrels-search.json). Consumed by
  gbrain eval gate --qrels. Catches retrieval QUALITY drops via
  recall@K + first-relevant-hit-rate + expected_top1-hit-rate.
- scripts/generate-v0.41-launch.ts — reproducible regenerator.
  Deterministic: same input + fixed published_at timestamp → byte-
  identical output. Same recipe usable for future v0.42+ baselines.
- baselines/README.md + qrels/README.md — privacy posture, file
  format, refresh discipline (D4: include a "Why:" line in any
  commit body that intentionally moves expected slugs).

This closes the LOOP gbrain v0.41 ships: users can now point CI at
these files via gbrain eval gate --baseline X --qrels Y and fail PRs
on retrieval regressions OR correctness drops without bootstrapping
their own baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant