Skip to content

Populate bench_version on baseline JSONL output (closes #66)#67

Merged
aallan merged 1 commit into
mainfrom
fix/baseline-bench-version
May 5, 2026
Merged

Populate bench_version on baseline JSONL output (closes #66)#67
aallan merged 1 commit into
mainfrom
fix/baseline-bench-version

Conversation

@aallan

@aallan aallan commented May 5, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #66.

{python,typescript,aver}-baseline.jsonl rows shipped with bench_version="" because the baseline runner didn't plumb version info, even though the corresponding fields in LLM result files are populated correctly. This PR closes the gap.

What changed

  • cli.py:baselines() fetches vera_bench.__version__ and passes it to run_all_baselines, mirroring how cli.py:run() already handles it for LLM runs.
  • run_all_baselines() accepts a bench_version kwarg and stamps it onto every ProblemResult between the per-language runner call and the to_jsonl() write. Stamping centrally keeps the plumbing in one place rather than threading it through ~18 per-language ProblemResult call sites.
  • New regression test TestBaselinesCLI::test_baselines_populates_bench_version loads the produced JSONL via the CliRunner integration path and asserts every row's bench_version equals vera_bench.__version__.

No version bump

This is a bugfix — bench_version is a field that was always meant to be populated and wasn't. Scoring values are identical (same run_correct, same check_pass, same problem counts), and pre-fix baselines (bench_version="") vs post-fix baselines (bench_version="0.0.11") carry an unambiguous distinguishing signal with no semantic collision. Mirrors the precedent set by PR #60 (Anthropic prompt caching) which also added attribution-style metadata without a bump.

Out of scope (deferred to follow-up)

The vera_version field is misnamed for non-Vera baselines and isn't populated for any language baseline by this PR. For Aver baselines, the natural attribution would be an aver_version field, which doesn't exist on ProblemResult today. Adding it would touch the dataclass + LLM runner call sites; that's a separate enhancement noted in the issue.

Test plan

  • pytest tests/ → 495 passed (494 + 1 new)
  • pytest tests/test_baseline.py -v → 24 passed (all baseline tests including the new one)
  • ruff check + ruff format --check clean
  • End-to-end: regenerated all three baselines locally; every row across 36 + 36 + 60 = 132 rows now has bench_version: "0.0.11"

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Baseline results now include benchmark version information for each run.
  • Tests

    • Added regression test to verify that version information is correctly captured in baseline output.

The ProblemResult dataclass has bench_version as a field that defaults
to an empty string. The LLM runner plumbs it through correctly via
cli.py:run() → run_benchmark() → every ProblemResult call site. The
baseline runner had no equivalent wiring, so every row in
{python,typescript,aver}-baseline.jsonl shipped with bench_version="".

Result: baseline files have no version attribution today. They aren't
filename-stamped either (plain `{lang}-baseline.jsonl` rather than the
LLM result files' `{model}-{lang}-bench-{ver}-...jsonl`), so the only
attribution was mtime + which bench was installed when the run
happened. That's brittle for analyses that span versions.

Fix:
- cli.py:baselines() fetches `vera_bench.__version__` and passes it
  to run_all_baselines, mirroring how cli.py:run() already handles it.
- run_all_baselines accepts a `bench_version` kwarg and stamps it onto
  every ProblemResult between the per-language runner call and the
  to_jsonl() write. Stamping centrally rather than threading it
  through each per-language runner's ~18 ProblemResult call sites
  keeps the attribution plumbing in one place.
- New regression test (test_baselines_populates_bench_version) loads
  the produced JSONL via the CliRunner integration path and asserts
  every row's bench_version equals vera_bench.__version__.

No bench-version bump: this is a bugfix (a field that was always
meant to be populated, wasn't) rather than a methodology change.
Same scoring, same problem counts, same run-order behaviour. Pre-fix
baselines have bench_version="" (an unambiguous signal that they
predate this fix); post-fix baselines have the populated value. No
semantic collision.

Verified end-to-end:
- pytest tests/ → 495 passed (the 494 from before + 1 new)
- regenerated all three baselines locally; bench_versions={'0.0.11'}
  across 36 + 36 + 60 = 132 rows.

Closes #66.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8e179c8e-b6b9-44a0-a7b4-d738d4f5d6c4

📥 Commits

Reviewing files that changed from the base of the PR and between 5b5b2a8 and 1026c37.

📒 Files selected for processing (3)
  • tests/test_baseline.py
  • vera_bench/baseline_runner.py
  • vera_bench/cli.py

📝 Walkthrough

Walkthrough

The pull request propagates the vera_bench package version through the baselines CLI command to the baseline runner, which now stamps each result with bench_version. A new test validates that the populated version reaches the JSONL output.

Changes

Baseline Version Attribution

Layer / File(s) Summary
Function Signature
vera_bench/baseline_runner.py
run_all_baselines adds bench_version: str = "" parameter with docstring update describing per-result stamping.
Core Stamping Logic
vera_bench/baseline_runner.py
After each language-specific baseline runner call, result.bench_version = bench_version applies the version centrally to all ProblemResult rows.
CLI Integration
vera_bench/cli.py
baselines command now imports vera_bench, computes bench_ver = vera_bench.__version__, and passes it to run_all_baselines(bench_version=bench_ver).
Regression Test
tests/test_baseline.py
New test_baselines_populates_bench_version invokes the CLI, loads python-baseline.jsonl, parses all JSONL rows, and asserts each carries bench_version matching the installed package version.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • aallan/vera-bench#35: Adds version propagation into benchmark records and JSONL output; directly related feature work.
  • aallan/vera-bench#8: Extends baseline runner and CLI changes to thread versions through the evaluation flow.
  • aallan/vera-bench#36: Also addresses recording benchmark version in baseline JSONL and filenames.

Suggested labels

harness

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: populating the bench_version field on baseline JSONL output, directly addressing the linked issue #66.
Linked Issues check ✅ Passed All coding requirements from issue #66 are met: bench_version is now populated via run_all_baselines(), cli.baselines() fetches and passes vera_bench.version, and a regression test validates the implementation.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #66 scope: bench_version population in baseline runner and CLI, plus supporting regression test. No vera_version or compiler-specific fields added as correctly noted as out-of-scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/baseline-bench-version

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.65%. Comparing base (5b5b2a8) to head (1026c37).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #67      +/-   ##
==========================================
+ Coverage   83.62%   83.65%   +0.03%     
==========================================
  Files          10       10              
  Lines        1392     1395       +3     
==========================================
+ Hits         1164     1167       +3     
  Misses        228      228              
Flag Coverage Δ
python 83.65% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aallan aallan merged commit e697ea2 into main May 5, 2026
10 checks passed
@aallan aallan deleted the fix/baseline-bench-version branch May 5, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Baseline JSONL: populate bench_version field (currently empty)

1 participant