Implement baseline runner (Phase 3)#8
Conversation
Add Python baseline execution for cross-language comparison: baseline_runner.py: - Subprocess-based execution with generated wrapper scripts - Each problem gets an isolated wrapper that imports the entry_point, runs test cases, and prints JSON results - Handles timeouts, missing files, and execution errors - Incremental JSONL output (same format as LLM runner) cli.py: - Add 'vera-bench baselines' command - Runs all 24 testable problems (those with test_cases) - Skips 26 problems with empty test_cases (Tier 2/3 ADT/string) - Outputs to results/python-baseline.jsonl Tests: 11 new tests covering file lookup, wrapper generation, actual execution (Tier 1 + Tier 4), error handling, JSONL serialization, and CLI command registration. 296 total tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds a Python-only baseline execution harness, CLI integration, and tests: discovers Python baseline files, generates temporary wrappers to run entry points in subprocesses, captures per-test JSON results with timeout and error handling, computes ProblemResult fields (including None for empty tests), and exposes a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8 +/- ##
==========================================
+ Coverage 59.88% 66.06% +6.17%
==========================================
Files 9 10 +1
Lines 713 825 +112
==========================================
+ Hits 427 545 +118
+ Misses 286 280 -6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_baseline.py`:
- Around line 127-131: Replace the shallow registration check with an end-to-end
CLI invocation: use click.testing.CliRunner to invoke the Click group `main`'s
"baselines" command (or add a new test) and pass a pytest `tmp_path` as the
output directory, then assert the runner.exit_code is 0 and that the expected
output files or artifacts are actually created under `tmp_path`; target the
`main` Click group and the `baselines` command so you exercise the command path
rather than only checking `main.commands`.
In `@vera_bench/baseline_runner.py`:
- Around line 127-168: The ProblemResult returns for the timeout, non-zero-exit,
and bad-JSON branches currently omit the real test count and default to
tests_total=0; update each of those ProblemResult constructors (the one returned
on timeout, the one when result.returncode != 0, and the one in the
json.JSONDecodeError except block) to include tests_total=tests_total so failed
runs retain the actual test count, keeping other fields (problem_id,
model="baseline", language="python", attempt, check_pass, run_correct,
error_message, wall_time_s, timestamp=_now()) unchanged.
- Around line 194-215: The temp directory created by tempfile.mkdtemp() stored
in work_dir leaks on errors; change to use tempfile.TemporaryDirectory() as a
context manager and set work_dir = Path(tmpdir.name) (or Path(tmpdir) depending
on API) so the directory is automatically removed on exit; wrap the block that
uses work_dir (the with Progress(...) loop that calls run_python_baseline)
inside the TemporaryDirectory() context and import tempfile.TemporaryDirectory
if necessary to ensure cleanup on both success and failure.
In `@vera_bench/cli.py`:
- Around line 223-238: The output JSONL file at output_path must be truncated
before appending new results so stale rows don't persist; before calling
run_all_baselines (which opens the file in append mode), clear or recreate
output_path (e.g., open output_path for write/truncate or unlink+touch) using
the existing output_path variable (and keep solutions_dir and problems as-is),
so subsequent compute_metrics()/report reads only the fresh run's records;
ensure this truncation happens after output_dir.mkdir(...) and before the
run_all_baselines(...) call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 63f09734-a3f0-434c-bf7b-a95aa3330bf6
📒 Files selected for processing (3)
tests/test_baseline.pyvera_bench/baseline_runner.pyvera_bench/cli.py
- Include tests_total in error-path ProblemResults (timeout, non-zero exit, bad JSON) so failed runs retain the actual test count - Use TemporaryDirectory context manager instead of mkdtemp to prevent temp dir leaks on errors - Truncate stale JSONL before appending (both run and baselines commands) - Add end-to-end CLI test using CliRunner that verifies baselines command produces output file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@vera_bench/baseline_runner.py`:
- Around line 188-193: run_all_baselines currently accepts a language parameter
but always calls run_python_baseline; either remove the unused language
parameter or implement dispatching: inside run_all_baselines check the language
value and call run_python_baseline for "python" and raise NotImplementedError
(or add a TODO) for unsupported languages like "typescript"; update the function
signature and any callers if you remove the parameter, or add the dispatch logic
and a clear error for unsupported languages to avoid silently ignoring the
parameter.
- Around line 21-30: The _find_baseline_file function currently collapses both
zero and multiple glob matches into None; change it to return None only when no
matches are found and raise a clear ValueError when multiple matches exist to
surface naming conflicts. Specifically, in _find_baseline_file inspect the
matches list: if len(matches) == 0 return None; if len(matches) == 1 return
matches[0]; if len(matches) > 1 raise a ValueError that includes the prefix,
lang_dir and the list of matching paths (the variables matches, prefix, and
lang_dir) so callers and logs can debug the conflict.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0b4de718-3329-4c4e-a7cc-53cb8a1e0ec1
📒 Files selected for processing (2)
README.mdvera_bench/baseline_runner.py
- _find_baseline_file: raise ValueError on multiple glob matches instead of silently returning None - run_all_baselines: raise NotImplementedError for non-Python languages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Add Python baseline execution for cross-language comparison against Vera LLM results.
vera-bench baselinescommand that runs all 24 testable problems and outputsresults/python-baseline.jsonl.Usage
Design decisions
if __name__blocks with hardcoded assertions, not a test harness interface. The wrapper dynamically generates import + call + JSON output from the problem's test_cases.Test plan
vera-bench baselinesproduces python-baseline.jsonlvera-bench report results/shows both Vera and Python resultsGenerated with Claude Code
Summary by CodeRabbit