Implement baseline runner (Phase 3) by aallan · Pull Request #8 · aallan/vera-bench

aallan · 2026-03-29T21:18:01Z

Summary

Add Python baseline execution for cross-language comparison against Vera LLM results.

baseline_runner.py: Subprocess-based execution with generated wrapper scripts. Each problem gets an isolated wrapper that imports the entry_point function, runs test cases, and prints JSON results.
cli.py: New vera-bench baselines command that runs all 24 testable problems and outputs results/python-baseline.jsonl.
11 new tests: File lookup, wrapper generation, actual execution (Tier 1 + Tier 4), error handling, JSONL serialization.

Usage

# Run Python baselines
vera-bench baselines

# Generate combined report (Vera + Python side-by-side)
vera-bench report results/

Design decisions

Subprocess over importlib: Each baseline runs in an isolated process. Avoids namespace collisions from ADT class definitions (List, Tree, Option) that differ across Tier 3/4 files.
Generated wrapper scripts: The baseline files have if __name__ blocks with hardcoded assertions, not a test harness interface. The wrapper dynamically generates import + call + JSON output from the problem's test_cases.
24 of 50 problems testable: Tier 2/3 have empty test_cases (string/ADT args can't be passed via vera run CLI). Same limitation applies to baselines — we only compare what's testable.
TypeScript deferred: Requires bun or ts-node runtime. Python-only is the practical MVP.

Test plan

296 tests pass (285 existing + 11 new)
Ruff clean
vera-bench baselines produces python-baseline.jsonl
vera-bench report results/ shows both Vera and Python results

Generated with Claude Code

Summary by CodeRabbit

New Features
- Run Python baseline solutions against per-problem test cases with timeouts, per-problem summaries, timestamps and optional incremental JSONL output.
CLI
- New baselines command to discover problems, run baselines, show progress and print aggregated metrics; configurable language (python) and output directory.
Tests
- Added end-to-end tests covering baseline discovery, wrapper generation, execution outcomes and CLI integration.
Documentation
- README clarified installation steps and quick-start guidance.

Add Python baseline execution for cross-language comparison: baseline_runner.py: - Subprocess-based execution with generated wrapper scripts - Each problem gets an isolated wrapper that imports the entry_point, runs test cases, and prints JSON results - Handles timeouts, missing files, and execution errors - Incremental JSONL output (same format as LLM runner) cli.py: - Add 'vera-bench baselines' command - Runs all 24 testable problems (those with test_cases) - Skips 26 problems with empty test_cases (Tier 2/3 ADT/string) - Outputs to results/python-baseline.jsonl Tests: 11 new tests covering file lookup, wrapper generation, actual execution (Tier 1 + Tier 4), error handling, JSONL serialization, and CLI command registration. 296 total tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-29T21:18:12Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds a Python-only baseline execution harness, CLI integration, and tests: discovers Python baseline files, generates temporary wrappers to run entry points in subprocesses, captures per-test JSON results with timeout and error handling, computes ProblemResult fields (including None for empty tests), and exposes a baselines CLI command that writes JSONL results. (50 words)

Changes

Cohort / File(s)	Summary
Baseline Execution Engine `vera_bench/baseline_runner.py`	Adds Python-only baseline discovery under `solutions/python`, `_find_baseline_file()` with prefix matching, `_build_python_wrapper()` to generate a temp runner, `run_python_baseline()` to execute wrappers with timeout/error/JSON parsing and compute `tests_total`/`tests_passed`/`run_correct`, `run_all_baselines()` to iterate problems and optionally append JSONL results, and `_now()` timestamp helper.
CLI Integration `vera_bench/cli.py`	Adds `@main.command()` `baselines` with `--language` (only `python`) and `--output-dir`; loads `problems/*/VB_.json`, removes existing `<output-dir>/<language>-baseline.jsonl` before running, calls `run_all_baselines()`, and prints metrics and output location. Also updates `run(...)` to unlink existing output file before writing.
Test Suite `tests/test_baseline.py`	New test file covering baseline discovery, wrapper generation (imports, entry-point invocation, JSON output, empty `test_cases`), `run_python_baseline()` behaviours (successful Tier runs, empty tests -> None, missing baseline -> failure with error message, timeout/error/non-JSON handling), serialization via `to_jsonl()`, and CLI `baselines` producing `python-baseline.jsonl`.
Documentation `README.md`	Minor wording and formatting changes: clarifies Vera language phrasing, adjusts prerequisites list formatting, expands Vera install guidance and quick-start note.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Fix run_correct reporting: show '-' when no tests exist #4 — related changes to run_correct handling for problems with no test_cases and CLI/metrics formatting.
Implement LLM runner harness (Phase 2) #3 — related modifications to vera_bench/cli.py and ProblemResult-style output handling.

Suggested labels

harness, docs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly reflects the primary purpose of the changeset: implementing a Python baseline runner component for the Vera benchmark, enabling cross-language code generation comparison.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/baseline-runner

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-03-29T21:18:34Z

Codecov Report

❌ Patch coverage is 89.28571% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.06%. Comparing base (52c921b) to head (4ce586e).

Files with missing lines	Patch %	Lines
vera_bench/baseline_runner.py	90.24%	8 Missing ⚠️
vera_bench/cli.py	86.66%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
+ Coverage   59.88%   66.06%   +6.17%     
==========================================
  Files           9       10       +1     
  Lines         713      825     +112     
==========================================
+ Hits          427      545     +118     
+ Misses        286      280       -6

Flag	Coverage Δ
python	`66.06% <89.28%> (+6.17%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_baseline.py`:
- Around line 127-131: Replace the shallow registration check with an end-to-end
CLI invocation: use click.testing.CliRunner to invoke the Click group `main`'s
"baselines" command (or add a new test) and pass a pytest `tmp_path` as the
output directory, then assert the runner.exit_code is 0 and that the expected
output files or artifacts are actually created under `tmp_path`; target the
`main` Click group and the `baselines` command so you exercise the command path
rather than only checking `main.commands`.

In `@vera_bench/baseline_runner.py`:
- Around line 127-168: The ProblemResult returns for the timeout, non-zero-exit,
and bad-JSON branches currently omit the real test count and default to
tests_total=0; update each of those ProblemResult constructors (the one returned
on timeout, the one when result.returncode != 0, and the one in the
json.JSONDecodeError except block) to include tests_total=tests_total so failed
runs retain the actual test count, keeping other fields (problem_id,
model="baseline", language="python", attempt, check_pass, run_correct,
error_message, wall_time_s, timestamp=_now()) unchanged.
- Around line 194-215: The temp directory created by tempfile.mkdtemp() stored
in work_dir leaks on errors; change to use tempfile.TemporaryDirectory() as a
context manager and set work_dir = Path(tmpdir.name) (or Path(tmpdir) depending
on API) so the directory is automatically removed on exit; wrap the block that
uses work_dir (the with Progress(...) loop that calls run_python_baseline)
inside the TemporaryDirectory() context and import tempfile.TemporaryDirectory
if necessary to ensure cleanup on both success and failure.

In `@vera_bench/cli.py`:
- Around line 223-238: The output JSONL file at output_path must be truncated
before appending new results so stale rows don't persist; before calling
run_all_baselines (which opens the file in append mode), clear or recreate
output_path (e.g., open output_path for write/truncate or unlink+touch) using
the existing output_path variable (and keep solutions_dir and problems as-is),
so subsequent compute_metrics()/report reads only the fresh run's records;
ensure this truncation happens after output_dir.mkdir(...) and before the
run_all_baselines(...) call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 63f09734-a3f0-434c-bf7b-a95aa3330bf6

📥 Commits

Reviewing files that changed from the base of the PR and between 52c921b and 23cb6d4.

📒 Files selected for processing (3)

tests/test_baseline.py
vera_bench/baseline_runner.py
vera_bench/cli.py

- Include tests_total in error-path ProblemResults (timeout, non-zero exit, bad JSON) so failed runs retain the actual test count - Use TemporaryDirectory context manager instead of mkdtemp to prevent temp dir leaks on errors - Truncate stale JSONL before appending (both run and baselines commands) - Add end-to-end CLI test using CliRunner that verifies baselines command produces output file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vera_bench/baseline_runner.py`:
- Around line 188-193: run_all_baselines currently accepts a language parameter
but always calls run_python_baseline; either remove the unused language
parameter or implement dispatching: inside run_all_baselines check the language
value and call run_python_baseline for "python" and raise NotImplementedError
(or add a TODO) for unsupported languages like "typescript"; update the function
signature and any callers if you remove the parameter, or add the dispatch logic
and a clear error for unsupported languages to avoid silently ignoring the
parameter.
- Around line 21-30: The _find_baseline_file function currently collapses both
zero and multiple glob matches into None; change it to return None only when no
matches are found and raise a clear ValueError when multiple matches exist to
surface naming conflicts. Specifically, in _find_baseline_file inspect the
matches list: if len(matches) == 0 return None; if len(matches) == 1 return
matches[0]; if len(matches) > 1 raise a ValueError that includes the prefix,
lang_dir and the list of matching paths (the variables matches, prefix, and
lang_dir) so callers and logs can debug the conflict.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b4de718-3329-4c4e-a7cc-53cb8a1e0ec1

📥 Commits

Reviewing files that changed from the base of the PR and between 23cb6d4 and e25f262.

📒 Files selected for processing (2)

README.md
vera_bench/baseline_runner.py

- _find_baseline_file: raise ValueError on multiple glob matches instead of silently returning None - run_all_baselines: raise NotImplementedError for non-Python languages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan and others added 2 commits March 29, 2026 22:22

Polish README installation instructions

7301964

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix S607: use sys.executable instead of bare 'python'

e25f262

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread tests/test_baseline.py

Comment thread vera_bench/baseline_runner.py

Comment thread vera_bench/baseline_runner.py Outdated

Comment thread vera_bench/cli.py

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread vera_bench/baseline_runner.py Outdated

Comment thread vera_bench/baseline_runner.py

aallan merged commit e7a1f9b into main Mar 29, 2026
7 checks passed

coderabbitai Bot mentioned this pull request Mar 30, 2026

Add TypeScript support: baselines + LLM generation (v0.0.4) #18

Merged

4 tasks

aallan deleted the feature/baseline-runner branch March 30, 2026 15:51

This was referenced Mar 31, 2026

Include bench and vera versions in filenames and JSONL records (#20) #35

Merged

Increase test coverage to 83%, version in filenames (v0.0.6) #36

Merged

This was referenced Apr 7, 2026

Moonshot provider support + full benchmark script (v0.0.7) #38

Merged

Add Aver language support + language-neutral problem descriptions #48

Merged

coderabbitai Bot mentioned this pull request May 5, 2026

Populate bench_version on baseline JSONL output (closes #66) #67

Merged

4 tasks

This was referenced May 21, 2026

Add AILANG as a baseline target language #70

Merged

scripts/run_full_benchmark.py: include AILANG targets in sweep #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement baseline runner (Phase 3)#8

Implement baseline runner (Phase 3)#8
aallan merged 5 commits into
mainfrom
feature/baseline-runner

aallan commented Mar 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented Mar 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aallan commented Mar 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Design decisions

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aallan commented Mar 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading

codecov-commenter commented Mar 29, 2026 •

edited

Loading