v0.41-launch: hermetic baseline + qrels for gbrain eval gate#13
Open
garrytan wants to merge 1 commit into
Open
v0.41-launch: hermetic baseline + qrels for gbrain eval gate#13garrytan wants to merge 1 commit into
garrytan wants to merge 1 commit into
Conversation
Coordinated drop alongside gbrain v0.41.0.0. Both files are hermetic-synthetic — placeholder names only per gbrain D9 privacy posture. No real user queries, people, or companies. - baselines/v0.41-launch.baseline.ndjson — 12 captured rows from a fixture-seeded brain (source_hash 34e88041..., mean latency 27ms). Consumed by gbrain eval gate --baseline. Catches retrieval REGRESSIONS during refactors. - qrels/v0.41-launch.qrels.json — 12 hand-curated queries with known- right answers (promoted from gbrain's existing test/fixtures/eval-baselines/qrels-search.json). Consumed by gbrain eval gate --qrels. Catches retrieval QUALITY drops via recall@K + first-relevant-hit-rate + expected_top1-hit-rate. - scripts/generate-v0.41-launch.ts — reproducible regenerator. Deterministic: same input + fixed published_at timestamp → byte- identical output. Same recipe usable for future v0.42+ baselines. - baselines/README.md + qrels/README.md — privacy posture, file format, refresh discipline (D4: include a "Why:" line in any commit body that intentionally moves expected slugs). This closes the LOOP gbrain v0.41 ships: users can now point CI at these files via gbrain eval gate --baseline X --qrels Y and fail PRs on retrieval regressions OR correctness drops without bootstrapping their own baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gbrain eval gate(the new CI verb landing in gbrain v0.41.0.0). Both files are hermetic-synthetic — placeholder names only per gbrain's D9 privacy posture.baselines/v0.41-launch.baseline.ndjson(12 captured rows, source_hash34e88041…) drives the regression gate.qrels/v0.41-launch.qrels.json(12 hand-curated queries with known-right answers) drives the correctness gate.scripts/generate-v0.41-launch.tsis the reproducible recipe; same input + fixedpublished_at→ byte-identical output.Closes the LOOP that gbrain v0.41 ships: users point CI at these files via
gbrain eval gate --baseline X --qrels Yand fail PRs on retrieval regressions OR correctness drops without bootstrapping their own baseline.What's in the box
baselines/v0.41-launch.baseline.ndjson_kind: 'baseline_metadata'+ thresholds +source_hash) + 12 captured rows.baselines/README.mdqrels/v0.41-launch.qrels.json{schema_version, queries: [...]}. Promoted from gbrain's existingtest/fixtures/eval-baselines/qrels-search.jsonfixture.qrels/README.mdsource_id-aware shapes).scripts/generate-v0.41-launch.tsGBRAIN_SRC=<path-to-gbrain>to use a local gbrain checkout instead of the npm dep.Privacy posture (gbrain D9)
Every slug in both files is a
*-exampleplaceholder (people/alice-example,companies/widget-co-example, etc.) per gbrain's CLAUDE.md privacy rule. Real-user captures stay local in~/.gbrain/baselines/on each user's machine and never enter the public benchmark surface.Refresh discipline (gbrain D4)
When a ranking change intentionally moves expected slugs, edit the qrels or regenerate the baseline, then include a
Why:line in the commit body so future maintainers can audit the trail. Without that discipline, the gate degrades to rubber-stamp within months.Test plan
parseBaselineFile+parseQrelsFile(verified locally).GBRAIN_SRCproduces byte-identical output.gbrain eval gate --baseline baselines/v0.41-launch.baseline.ndjson --qrels qrels/v0.41-launch.qrels.jsonagainst a known-good gbrain build and confirm exit 0.🤖 Generated with Claude Code