Skip to content

Implement baseline runner (Phase 3)#8

Merged
aallan merged 5 commits into
mainfrom
feature/baseline-runner
Mar 29, 2026
Merged

Implement baseline runner (Phase 3)#8
aallan merged 5 commits into
mainfrom
feature/baseline-runner

Conversation

@aallan

@aallan aallan commented Mar 29, 2026

Copy link
Copy Markdown
Owner

Summary

Add Python baseline execution for cross-language comparison against Vera LLM results.

  • baseline_runner.py: Subprocess-based execution with generated wrapper scripts. Each problem gets an isolated wrapper that imports the entry_point function, runs test cases, and prints JSON results.
  • cli.py: New vera-bench baselines command that runs all 24 testable problems and outputs results/python-baseline.jsonl.
  • 11 new tests: File lookup, wrapper generation, actual execution (Tier 1 + Tier 4), error handling, JSONL serialization.

Usage

# Run Python baselines
vera-bench baselines

# Generate combined report (Vera + Python side-by-side)
vera-bench report results/

Design decisions

  • Subprocess over importlib: Each baseline runs in an isolated process. Avoids namespace collisions from ADT class definitions (List, Tree, Option) that differ across Tier 3/4 files.
  • Generated wrapper scripts: The baseline files have if __name__ blocks with hardcoded assertions, not a test harness interface. The wrapper dynamically generates import + call + JSON output from the problem's test_cases.
  • 24 of 50 problems testable: Tier 2/3 have empty test_cases (string/ADT args can't be passed via vera run CLI). Same limitation applies to baselines — we only compare what's testable.
  • TypeScript deferred: Requires bun or ts-node runtime. Python-only is the practical MVP.

Test plan

  • 296 tests pass (285 existing + 11 new)
  • Ruff clean
  • vera-bench baselines produces python-baseline.jsonl
  • vera-bench report results/ shows both Vera and Python results

Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Run Python baseline solutions against per-problem test cases with timeouts, per-problem summaries, timestamps and optional incremental JSONL output.
  • CLI
    • New baselines command to discover problems, run baselines, show progress and print aggregated metrics; configurable language (python) and output directory.
  • Tests
    • Added end-to-end tests covering baseline discovery, wrapper generation, execution outcomes and CLI integration.
  • Documentation
    • README clarified installation steps and quick-start guidance.

Add Python baseline execution for cross-language comparison:

baseline_runner.py:
- Subprocess-based execution with generated wrapper scripts
- Each problem gets an isolated wrapper that imports the entry_point,
  runs test cases, and prints JSON results
- Handles timeouts, missing files, and execution errors
- Incremental JSONL output (same format as LLM runner)

cli.py:
- Add 'vera-bench baselines' command
- Runs all 24 testable problems (those with test_cases)
- Skips 26 problems with empty test_cases (Tier 2/3 ADT/string)
- Outputs to results/python-baseline.jsonl

Tests: 11 new tests covering file lookup, wrapper generation,
actual execution (Tier 1 + Tier 4), error handling, JSONL
serialization, and CLI command registration.

296 total tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Mar 29, 2026

Copy link
Copy Markdown

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds a Python-only baseline execution harness, CLI integration, and tests: discovers Python baseline files, generates temporary wrappers to run entry points in subprocesses, captures per-test JSON results with timeout and error handling, computes ProblemResult fields (including None for empty tests), and exposes a baselines CLI command that writes JSONL results. (50 words)

Changes

Cohort / File(s) Summary
Baseline Execution Engine
vera_bench/baseline_runner.py
Adds Python-only baseline discovery under solutions/python, _find_baseline_file() with prefix matching, _build_python_wrapper() to generate a temp runner, run_python_baseline() to execute wrappers with timeout/error/JSON parsing and compute tests_total/tests_passed/run_correct, run_all_baselines() to iterate problems and optionally append JSONL results, and _now() timestamp helper.
CLI Integration
vera_bench/cli.py
Adds @main.command() baselines with --language (only python) and --output-dir; loads problems/**/VB_*.json, removes existing <output-dir>/<language>-baseline.jsonl before running, calls run_all_baselines(), and prints metrics and output location. Also updates run(...) to unlink existing output file before writing.
Test Suite
tests/test_baseline.py
New test file covering baseline discovery, wrapper generation (imports, entry-point invocation, JSON output, empty test_cases), run_python_baseline() behaviours (successful Tier runs, empty tests -> None, missing baseline -> failure with error message, timeout/error/non-JSON handling), serialization via to_jsonl(), and CLI baselines producing python-baseline.jsonl.
Documentation
README.md
Minor wording and formatting changes: clarifies Vera language phrasing, adjusts prerequisites list formatting, expands Vera install guidance and quick-start note.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

harness, docs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the primary purpose of the changeset: implementing a Python baseline runner component for the Vera benchmark, enabling cross-language code generation comparison.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/baseline-runner

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter

codecov-commenter commented Mar 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.28571% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.06%. Comparing base (52c921b) to head (4ce586e).

Files with missing lines Patch % Lines
vera_bench/baseline_runner.py 90.24% 8 Missing ⚠️
vera_bench/cli.py 86.66% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
+ Coverage   59.88%   66.06%   +6.17%     
==========================================
  Files           9       10       +1     
  Lines         713      825     +112     
==========================================
+ Hits          427      545     +118     
+ Misses        286      280       -6     
Flag Coverage Δ
python 66.06% <89.28%> (+6.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

aallan and others added 2 commits March 29, 2026 22:22
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_baseline.py`:
- Around line 127-131: Replace the shallow registration check with an end-to-end
CLI invocation: use click.testing.CliRunner to invoke the Click group `main`'s
"baselines" command (or add a new test) and pass a pytest `tmp_path` as the
output directory, then assert the runner.exit_code is 0 and that the expected
output files or artifacts are actually created under `tmp_path`; target the
`main` Click group and the `baselines` command so you exercise the command path
rather than only checking `main.commands`.

In `@vera_bench/baseline_runner.py`:
- Around line 127-168: The ProblemResult returns for the timeout, non-zero-exit,
and bad-JSON branches currently omit the real test count and default to
tests_total=0; update each of those ProblemResult constructors (the one returned
on timeout, the one when result.returncode != 0, and the one in the
json.JSONDecodeError except block) to include tests_total=tests_total so failed
runs retain the actual test count, keeping other fields (problem_id,
model="baseline", language="python", attempt, check_pass, run_correct,
error_message, wall_time_s, timestamp=_now()) unchanged.
- Around line 194-215: The temp directory created by tempfile.mkdtemp() stored
in work_dir leaks on errors; change to use tempfile.TemporaryDirectory() as a
context manager and set work_dir = Path(tmpdir.name) (or Path(tmpdir) depending
on API) so the directory is automatically removed on exit; wrap the block that
uses work_dir (the with Progress(...) loop that calls run_python_baseline)
inside the TemporaryDirectory() context and import tempfile.TemporaryDirectory
if necessary to ensure cleanup on both success and failure.

In `@vera_bench/cli.py`:
- Around line 223-238: The output JSONL file at output_path must be truncated
before appending new results so stale rows don't persist; before calling
run_all_baselines (which opens the file in append mode), clear or recreate
output_path (e.g., open output_path for write/truncate or unlink+touch) using
the existing output_path variable (and keep solutions_dir and problems as-is),
so subsequent compute_metrics()/report reads only the fresh run's records;
ensure this truncation happens after output_dir.mkdir(...) and before the
run_all_baselines(...) call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 63f09734-a3f0-434c-bf7b-a95aa3330bf6

📥 Commits

Reviewing files that changed from the base of the PR and between 52c921b and 23cb6d4.

📒 Files selected for processing (3)
  • tests/test_baseline.py
  • vera_bench/baseline_runner.py
  • vera_bench/cli.py

Comment thread tests/test_baseline.py
Comment thread vera_bench/baseline_runner.py
Comment thread vera_bench/baseline_runner.py Outdated
Comment thread vera_bench/cli.py
- Include tests_total in error-path ProblemResults (timeout, non-zero
  exit, bad JSON) so failed runs retain the actual test count
- Use TemporaryDirectory context manager instead of mkdtemp to prevent
  temp dir leaks on errors
- Truncate stale JSONL before appending (both run and baselines commands)
- Add end-to-end CLI test using CliRunner that verifies baselines
  command produces output file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vera_bench/baseline_runner.py`:
- Around line 188-193: run_all_baselines currently accepts a language parameter
but always calls run_python_baseline; either remove the unused language
parameter or implement dispatching: inside run_all_baselines check the language
value and call run_python_baseline for "python" and raise NotImplementedError
(or add a TODO) for unsupported languages like "typescript"; update the function
signature and any callers if you remove the parameter, or add the dispatch logic
and a clear error for unsupported languages to avoid silently ignoring the
parameter.
- Around line 21-30: The _find_baseline_file function currently collapses both
zero and multiple glob matches into None; change it to return None only when no
matches are found and raise a clear ValueError when multiple matches exist to
surface naming conflicts. Specifically, in _find_baseline_file inspect the
matches list: if len(matches) == 0 return None; if len(matches) == 1 return
matches[0]; if len(matches) > 1 raise a ValueError that includes the prefix,
lang_dir and the list of matching paths (the variables matches, prefix, and
lang_dir) so callers and logs can debug the conflict.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b4de718-3329-4c4e-a7cc-53cb8a1e0ec1

📥 Commits

Reviewing files that changed from the base of the PR and between 23cb6d4 and e25f262.

📒 Files selected for processing (2)
  • README.md
  • vera_bench/baseline_runner.py

Comment thread vera_bench/baseline_runner.py Outdated
Comment thread vera_bench/baseline_runner.py
- _find_baseline_file: raise ValueError on multiple glob matches
  instead of silently returning None
- run_all_baselines: raise NotImplementedError for non-Python languages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants