Populate bench_version on baseline JSONL output (closes #66) by aallan · Pull Request #67 · aallan/vera-bench

aallan · 2026-05-05T10:38:26Z

Summary

Closes #66.

{python,typescript,aver}-baseline.jsonl rows shipped with bench_version="" because the baseline runner didn't plumb version info, even though the corresponding fields in LLM result files are populated correctly. This PR closes the gap.

What changed

cli.py:baselines() fetches vera_bench.__version__ and passes it to run_all_baselines, mirroring how cli.py:run() already handles it for LLM runs.
run_all_baselines() accepts a bench_version kwarg and stamps it onto every ProblemResult between the per-language runner call and the to_jsonl() write. Stamping centrally keeps the plumbing in one place rather than threading it through ~18 per-language ProblemResult call sites.
New regression test TestBaselinesCLI::test_baselines_populates_bench_version loads the produced JSONL via the CliRunner integration path and asserts every row's bench_version equals vera_bench.__version__.

No version bump

This is a bugfix — bench_version is a field that was always meant to be populated and wasn't. Scoring values are identical (same run_correct, same check_pass, same problem counts), and pre-fix baselines (bench_version="") vs post-fix baselines (bench_version="0.0.11") carry an unambiguous distinguishing signal with no semantic collision. Mirrors the precedent set by PR #60 (Anthropic prompt caching) which also added attribution-style metadata without a bump.

Out of scope (deferred to follow-up)

The vera_version field is misnamed for non-Vera baselines and isn't populated for any language baseline by this PR. For Aver baselines, the natural attribution would be an aver_version field, which doesn't exist on ProblemResult today. Adding it would touch the dataclass + LLM runner call sites; that's a separate enhancement noted in the issue.

Test plan

pytest tests/ → 495 passed (494 + 1 new)
pytest tests/test_baseline.py -v → 24 passed (all baseline tests including the new one)
ruff check + ruff format --check clean
End-to-end: regenerated all three baselines locally; every row across 36 + 36 + 60 = 132 rows now has bench_version: "0.0.11"

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Baseline results now include benchmark version information for each run.
Tests
- Added regression test to verify that version information is correctly captured in baseline output.

The ProblemResult dataclass has bench_version as a field that defaults to an empty string. The LLM runner plumbs it through correctly via cli.py:run() → run_benchmark() → every ProblemResult call site. The baseline runner had no equivalent wiring, so every row in {python,typescript,aver}-baseline.jsonl shipped with bench_version="". Result: baseline files have no version attribution today. They aren't filename-stamped either (plain `{lang}-baseline.jsonl` rather than the LLM result files' `{model}-{lang}-bench-{ver}-...jsonl`), so the only attribution was mtime + which bench was installed when the run happened. That's brittle for analyses that span versions. Fix: - cli.py:baselines() fetches `vera_bench.__version__` and passes it to run_all_baselines, mirroring how cli.py:run() already handles it. - run_all_baselines accepts a `bench_version` kwarg and stamps it onto every ProblemResult between the per-language runner call and the to_jsonl() write. Stamping centrally rather than threading it through each per-language runner's ~18 ProblemResult call sites keeps the attribution plumbing in one place. - New regression test (test_baselines_populates_bench_version) loads the produced JSONL via the CliRunner integration path and asserts every row's bench_version equals vera_bench.__version__. No bench-version bump: this is a bugfix (a field that was always meant to be populated, wasn't) rather than a methodology change. Same scoring, same problem counts, same run-order behaviour. Pre-fix baselines have bench_version="" (an unambiguous signal that they predate this fix); post-fix baselines have the populated value. No semantic collision. Verified end-to-end: - pytest tests/ → 495 passed (the 494 from before + 1 new) - regenerated all three baselines locally; bench_versions={'0.0.11'} across 36 + 36 + 60 = 132 rows. Closes #66. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-05T10:38:36Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8e179c8e-b6b9-44a0-a7b4-d738d4f5d6c4

📥 Commits

Reviewing files that changed from the base of the PR and between 5b5b2a8 and 1026c37.

📒 Files selected for processing (3)

tests/test_baseline.py
vera_bench/baseline_runner.py
vera_bench/cli.py

📝 Walkthrough

Walkthrough

The pull request propagates the vera_bench package version through the baselines CLI command to the baseline runner, which now stamps each result with bench_version. A new test validates that the populated version reaches the JSONL output.

Changes

Baseline Version Attribution

Layer / File(s)	Summary
Function Signature `vera_bench/baseline_runner.py`	`run_all_baselines` adds `bench_version: str = ""` parameter with docstring update describing per-result stamping.
Core Stamping Logic `vera_bench/baseline_runner.py`	After each language-specific baseline runner call, `result.bench_version = bench_version` applies the version centrally to all `ProblemResult` rows.
CLI Integration `vera_bench/cli.py`	`baselines` command now imports `vera_bench`, computes `bench_ver = vera_bench.__version__`, and passes it to `run_all_baselines(bench_version=bench_ver)`.
Regression Test `tests/test_baseline.py`	New `test_baselines_populates_bench_version` invokes the CLI, loads `python-baseline.jsonl`, parses all JSONL rows, and asserts each carries `bench_version` matching the installed package version.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

aallan/vera-bench#35: Adds version propagation into benchmark records and JSONL output; directly related feature work.
aallan/vera-bench#8: Extends baseline runner and CLI changes to thread versions through the evaluation flow.
aallan/vera-bench#36: Also addresses recording benchmark version in baseline JSONL and filenames.

Suggested labels

harness

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: populating the bench_version field on baseline JSONL output, directly addressing the linked issue `#66`.
Linked Issues check	✅ Passed	All coding requirements from issue `#66` are met: bench_version is now populated via run_all_baselines(), cli.baselines() fetches and passes vera_bench.version, and a regression test validates the implementation.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#66` scope: bench_version population in baseline runner and CLI, plus supporting regression test. No vera_version or compiler-specific fields added as correctly noted as out-of-scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/baseline-bench-version

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-05T10:44:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.65%. Comparing base (5b5b2a8) to head (1026c37).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #67      +/-   ##
==========================================
+ Coverage   83.62%   83.65%   +0.03%     
==========================================
  Files          10       10              
  Lines        1392     1395       +3     
==========================================
+ Hits         1164     1167       +3     
  Misses        228      228

Flag	Coverage Δ
python	`83.65% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

aallan merged commit e697ea2 into main May 5, 2026
10 checks passed

aallan deleted the fix/baseline-bench-version branch May 5, 2026 11:02

aallan mentioned this pull request May 7, 2026

Migrate kimi-k2-turbo-preview → kimi-k2.6 (closes #68) #69

Merged

4 tasks

aallan mentioned this pull request May 22, 2026

Add scripts/plot_slide.py — v0.0.7 talk-slide renderer at 16:9 #71

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate bench_version on baseline JSONL output (closes #66)#67

Populate bench_version on baseline JSONL output (closes #66)#67
aallan merged 1 commit into
mainfrom
fix/baseline-bench-version

aallan commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aallan commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

No version bump

Out of scope (deferred to follow-up)

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 5, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aallan commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading