Populate bench_version on baseline JSONL output (closes #66)#67
Conversation
The ProblemResult dataclass has bench_version as a field that defaults
to an empty string. The LLM runner plumbs it through correctly via
cli.py:run() → run_benchmark() → every ProblemResult call site. The
baseline runner had no equivalent wiring, so every row in
{python,typescript,aver}-baseline.jsonl shipped with bench_version="".
Result: baseline files have no version attribution today. They aren't
filename-stamped either (plain `{lang}-baseline.jsonl` rather than the
LLM result files' `{model}-{lang}-bench-{ver}-...jsonl`), so the only
attribution was mtime + which bench was installed when the run
happened. That's brittle for analyses that span versions.
Fix:
- cli.py:baselines() fetches `vera_bench.__version__` and passes it
to run_all_baselines, mirroring how cli.py:run() already handles it.
- run_all_baselines accepts a `bench_version` kwarg and stamps it onto
every ProblemResult between the per-language runner call and the
to_jsonl() write. Stamping centrally rather than threading it
through each per-language runner's ~18 ProblemResult call sites
keeps the attribution plumbing in one place.
- New regression test (test_baselines_populates_bench_version) loads
the produced JSONL via the CliRunner integration path and asserts
every row's bench_version equals vera_bench.__version__.
No bench-version bump: this is a bugfix (a field that was always
meant to be populated, wasn't) rather than a methodology change.
Same scoring, same problem counts, same run-order behaviour. Pre-fix
baselines have bench_version="" (an unambiguous signal that they
predate this fix); post-fix baselines have the populated value. No
semantic collision.
Verified end-to-end:
- pytest tests/ → 495 passed (the 494 from before + 1 new)
- regenerated all three baselines locally; bench_versions={'0.0.11'}
across 36 + 36 + 60 = 132 rows.
Closes #66.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe pull request propagates the vera_bench package version through the baselines CLI command to the baseline runner, which now stamps each result with ChangesBaseline Version Attribution
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #67 +/- ##
==========================================
+ Coverage 83.62% 83.65% +0.03%
==========================================
Files 10 10
Lines 1392 1395 +3
==========================================
+ Hits 1164 1167 +3
Misses 228 228
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Closes #66.
{python,typescript,aver}-baseline.jsonlrows shipped withbench_version=""because the baseline runner didn't plumb version info, even though the corresponding fields in LLM result files are populated correctly. This PR closes the gap.What changed
cli.py:baselines()fetchesvera_bench.__version__and passes it torun_all_baselines, mirroring howcli.py:run()already handles it for LLM runs.run_all_baselines()accepts abench_versionkwarg and stamps it onto everyProblemResultbetween the per-language runner call and theto_jsonl()write. Stamping centrally keeps the plumbing in one place rather than threading it through ~18 per-languageProblemResultcall sites.TestBaselinesCLI::test_baselines_populates_bench_versionloads the produced JSONL via the CliRunner integration path and asserts every row'sbench_versionequalsvera_bench.__version__.No version bump
This is a bugfix —
bench_versionis a field that was always meant to be populated and wasn't. Scoring values are identical (samerun_correct, samecheck_pass, same problem counts), and pre-fix baselines (bench_version="") vs post-fix baselines (bench_version="0.0.11") carry an unambiguous distinguishing signal with no semantic collision. Mirrors the precedent set by PR #60 (Anthropic prompt caching) which also added attribution-style metadata without a bump.Out of scope (deferred to follow-up)
The
vera_versionfield is misnamed for non-Vera baselines and isn't populated for any language baseline by this PR. For Aver baselines, the natural attribution would be anaver_versionfield, which doesn't exist onProblemResulttoday. Adding it would touch the dataclass + LLM runner call sites; that's a separate enhancement noted in the issue.Test plan
pytest tests/→ 495 passed (494 + 1 new)pytest tests/test_baseline.py -v→ 24 passed (all baseline tests including the new one)ruff check+ruff format --checkcleanbench_version: "0.0.11"🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Tests