Add --language python for cross-language LLM comparison#11
Conversation
Extend vera-bench run to generate Python code from the same problems:
prompts.py:
- build_python_prompt(): minimal prompt (no SKILL.md, no contracts)
asking the model to write a Python function from the NL description
runner.py:
- _evaluate_python_code(): subprocess-based execution with test wrapper
(same pattern as baseline_runner but for LLM-generated code)
- extract_code(): renamed from extract_vera_code, now matches python/py
fence tags too (backward-compatible alias kept)
- run_single_problem(): accepts language param, routes to vera or python
evaluation path. Python skips fix attempts (no check step).
- run_benchmark(): threads language through to run_single_problem
cli.py:
- --language flag on run command (vera or python)
- Python mode skips SKILL.md loading and vera runner creation
- Output file includes language suffix: {model}-python.jsonl
Usage:
vera-bench run --model claude-sonnet-4-20250514 --language python
vera-bench report results/ # shows Vera, Python LLM, and baselines
Tests: 10 new tests covering python prompt, code extraction, python
evaluation (correct/wrong/empty), and full pipeline with mock client.
308 total tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds first-class Python execution to the harness: CLI language flag, Python prompt builder, broadened fenced-code extraction, subprocess-based Python evaluation with generated test wrappers, language-aware runner flow/signatures, tests for Python paths, and supporting docs (CHANGELOG, CONTRIBUTING, CODE_OF_CONDUCT, SECURITY, README). Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #11 +/- ##
==========================================
- Coverage 66.14% 65.41% -0.73%
==========================================
Files 10 10
Lines 827 905 +78
==========================================
+ Hits 547 592 +45
- Misses 280 313 +33
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vera_bench/cli.py (1)
47-57:⚠️ Potential issue | 🟡 MinorReject Vera-specific flags in Python mode instead of silently ignoring them.
When
language == "python", both--modeand--skill-mdbecome no-ops, yet the CLI still accepts them and even prints the selected mode. That makes runs easy to mislabel and hard to reproduce. Either error on those combinations or emit an explicit warning.Suggested change
def run( model: str, tier: int | None, problem: str | None, language: str, mode: str, skill_md: Path | None, output_dir: Path | None, max_tokens: int, keep_temps: bool, ): """Run benchmark against an LLM model.""" from vera_bench.metrics import compute_metrics from vera_bench.models import create_client from vera_bench.prompts import load_skill_md from vera_bench.runner import run_benchmark from vera_bench.vera_runner import VeraRunner root = _repo_root() + + if language == "python": + if skill_md is not None: + raise click.UsageError("--skill-md is only supported with --language vera") + if mode != "full-spec": + raise click.UsageError("--mode has no effect with --language python")Also applies to: 118-123, 140-143
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@vera_bench/prompts.py`:
- Around line 74-87: The Python prompt built by build_python_prompt currently
only includes problem['description'] and the entry_point name; update it to
include the canonical function signature and other required problem metadata
(SKILL.md content and any contracts) so the model is given description, the full
function signature (e.g., problem['signature'] or
problem['canonical_signature']), and optional contracts; modify the user_msg
construction in build_python_prompt to embed the canonical signature right after
the description and include SKILL.md content and contracts if present, and
ensure the system prompt (PYTHON_SYSTEM_PROMPT) remains used; also ensure the
prompt for fixes (if present elsewhere) includes the vera error message from
problem['last_error'] or similar.
In `@vera_bench/runner.py`:
- Around line 162-174: The result dictionary currently initializes "check_pass":
True prematurely; change the initializer in the block that creates result (the
variable result) to set "check_pass" to None or False, then only set
result["check_pass"]=True after the actual import/compile/check step completes
successfully (the function or code path that performs the Python import/check of
the written file). Also ensure the early return for empty test_cases uses the
actual check outcome (not the default) by running the import/check step before
returning or by explicitly setting result["check_pass"] based on that step;
apply the same fix to the analogous initialization/path referenced around lines
231-248 so no malformed generation is reported as a successful check.
- Around line 217-224: The subprocess.run call that executes the generated
Python (the invocation building proc with [sys.executable, str(wrapper_path)])
should be confined: add cwd=work_dir to restrict filesystem access and pass a
sanitized env (e.g., create env = os.environ.copy() and remove sensitive keys or
use a minimal env = {}) into subprocess.run to avoid leaking credentials; update
the call site where proc is created to include cwd and env arguments and ensure
wrapper_path and work_dir variables are used to locate the working directory and
sanitized environment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 03b94cba-1ab0-4bbe-bddf-4422039b31eb
📒 Files selected for processing (4)
tests/test_runner.pyvera_bench/cli.pyvera_bench/prompts.pyvera_bench/runner.py
Adapted from the vera compiler repo templates for vera-bench context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… accuracy - Subprocess runs with cwd=work_dir and API keys stripped from env - SyntaxError/ImportError in generated Python now sets check_pass=False (analogous to vera check failure) - CLI warns when Vera-specific flags (--mode, --skill-md) are used with --language python Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai Re: outside-diff comment on cli.py lines 47-57 (reject Vera-specific flags in Python mode) — fixed in a052bde. The CLI now warns when |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vera_bench/cli.py (1)
169-172:⚠️ Potential issue | 🟠 MajorDon’t report Vera-only metrics for Python runs.
compute_metrics()is still Vera-centric: any passing Python attempt becomes verify-eligible, and any failing one becomes fix-eligible. That meansvera-bench run --language pythonwill showverify@1 = 0%and can showfix@1 = 0%even though Python has neither stage, which makes both the CLI summary here and the markdown report misleading.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vera_bench/cli.py` around lines 169 - 172, Summary: Vera-only metrics (verify/fix) are being reported for Python runs; detect Python runs and strip those metrics before printing. Locate where compute_metrics(...) and _print_metrics(model, metrics) are called and after computing metrics from results, check model.language (or equivalent flag indicating the run language) and if it equals "python" remove Vera-only metric entries (e.g., keys like "verify@1", "fix@1", "verify", "fix" or any verify/fix aggregates) from the metrics dict before calling _print_metrics; alternatively, adapt compute_metrics to accept a flag (e.g., include_vera_metrics=False) and pass it for Python runs so verify/fix stats are not produced. Ensure only non-Vera metrics are passed to _print_metrics for Python runs.
♻️ Duplicate comments (1)
vera_bench/runner.py (1)
163-175:⚠️ Potential issue | 🟠 MajorDrive
check_passfrom a real import step.Line 174 still returns
check_pass=Truebefore the module is even written or imported, and Lines 237-239 only downgrade failures that mentionSyntaxErrororImportError. Top-level crashes such asModuleNotFoundError,ZeroDivisionError, or any other import-time exception will still inflatecheck@1for Python runs.Based on learnings "The runner.py module must implement the complete pipeline: generate (LLM call) → write (file) → check → verify → run → fix attempt, with individual step failures recorded in JSONL metrics".
Also applies to: 235-243
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vera_bench/runner.py` around lines 163 - 175, The result dict currently sets "check_pass": True before the module is written/imported, which inflates metrics; change the initial "check_pass" to False and only set it to True after the actual import/check step completes successfully (the code paths around the import/write/check logic referenced near the result initialization and the import exception handling around lines ~235-243). Update the import/check exception handling (the block that currently only downgrades on SyntaxError or ImportError) to catch any exception raised during import (e.g., ModuleNotFoundError, ZeroDivisionError, etc.), set result["check_pass"]=False, populate result["error_message"] with the exception details, and only leave check_pass True when the import succeeds; ensure the early return when not test_cases does not mistakenly preserve a true check_pass by returning the result after check_pass has been correctly set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Line 10: The headings like "### Added" (and the other headings at the same
locations) are immediately followed by list content which triggers markdownlint
MD022; insert a single blank line after each affected heading (e.g., after the
"### Added" heading and the headings at the other noted locations) so each
heading is followed by an empty line before the list content to satisfy the
rule.
In `@SECURITY.md`:
- Around line 34-37: Update the subprocess policy sentence to reflect both
trusted-command exceptions: change the wording that currently limits the S603
suppression to the `vera` binary found via `shutil.which()` so it also mentions
that `vera_bench/runner.py` suppresses S603 for uses of `sys.executable` in the
Python evaluation path; explicitly reference the `S603` suppression and the
trusted commands (`shutil.which()`-located `vera` and `sys.executable`) so the
policy text matches the code.
In `@vera_bench/runner.py`:
- Around line 217-228: The current subprocess.run invocation
(subprocess.run([...], cwd=work_dir, env=run_env, ...)) still runs untrusted
model output with host privileges; change this by requiring an explicit opt-in
flag (e.g., allow_untrusted_execution) before executing wrapper_path and, when
not opted-in, replace the call with a safe failure or a dry-run; additionally
enforce a strict environment allowlist (build run_env from a whitelist instead
of filtering *_API_KEY), and document/require execution inside a proper sandbox
(container/VM/low-privilege user) for any opt-in path; update checks around
run_env, wrapper_path and work_dir to gate execution and to log a clear warning
when unsafe execution is refused.
---
Outside diff comments:
In `@vera_bench/cli.py`:
- Around line 169-172: Summary: Vera-only metrics (verify/fix) are being
reported for Python runs; detect Python runs and strip those metrics before
printing. Locate where compute_metrics(...) and _print_metrics(model, metrics)
are called and after computing metrics from results, check model.language (or
equivalent flag indicating the run language) and if it equals "python" remove
Vera-only metric entries (e.g., keys like "verify@1", "fix@1", "verify", "fix"
or any verify/fix aggregates) from the metrics dict before calling
_print_metrics; alternatively, adapt compute_metrics to accept a flag (e.g.,
include_vera_metrics=False) and pass it for Python runs so verify/fix stats are
not produced. Ensure only non-Vera metrics are passed to _print_metrics for
Python runs.
---
Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 163-175: The result dict currently sets "check_pass": True before
the module is written/imported, which inflates metrics; change the initial
"check_pass" to False and only set it to True after the actual import/check step
completes successfully (the code paths around the import/write/check logic
referenced near the result initialization and the import exception handling
around lines ~235-243). Update the import/check exception handling (the block
that currently only downgrades on SyntaxError or ImportError) to catch any
exception raised during import (e.g., ModuleNotFoundError, ZeroDivisionError,
etc.), set result["check_pass"]=False, populate result["error_message"] with the
exception details, and only leave check_pass True when the import succeeds;
ensure the early return when not test_cases does not mistakenly preserve a true
check_pass by returning the result after check_pass has been correctly set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f1881646-cc30-478e-9a8e-a502e9fd7f09
📒 Files selected for processing (7)
CHANGELOG.mdCODE_OF_CONDUCT.mdCONTRIBUTING.mdREADME.mdSECURITY.mdvera_bench/cli.pyvera_bench/runner.py
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 2548
SKILL.md loading and 🧠 Learnings used |
- CHANGELOG.md: add blank lines after headings (MD022) - SECURITY.md: document S603 suppression for both vera and sys.executable - runner.py: widen check_pass=False to catch NameError, ModuleNotFoundError, IndentationError, TabError (not just SyntaxError/ImportError) - cli.py: skip verify@1 and fix@1 rows in metrics table for Python runs (these Vera-only metrics are meaningless for Python) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai Re: outside-diff comment on cli.py lines 169-172 (Vera-only metrics for Python runs) — fixed in 2e2ea53. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 4278
One minor observation outside this fix's scope: the 🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vera_bench/cli.py (1)
273-275: 🧹 Nitpick | 🔵 TrivialConsider passing
language="python"to_print_metricsfor baselines.The
baselinescommand runs Python solutions, but_print_metricsis called without thelanguageparameter, so it defaults to"vera"and displaysverify@1/fix@1rows that aren't meaningful for Python baselines.♻️ Suggested fix
if results: metrics = compute_metrics([json.loads(r.to_jsonl()) for r in results]) - _print_metrics(f"{language}-baseline", metrics) + _print_metrics(f"{language}-baseline", metrics, language=language)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vera_bench/cli.py` around lines 273 - 275, The baseline metrics call is using _print_metrics without specifying language so it defaults to "vera" and shows inappropriate verify/fix rows; change the call in the baselines path so that _print_metrics is invoked with the language parameter set to "python" (e.g., _print_metrics(f"{language}-baseline", metrics, language="python")) after computing metrics with compute_metrics([json.loads(r.to_jsonl()) for r in results]) so the printed table uses Python-specific metric columns.
♻️ Duplicate comments (1)
vera_bench/runner.py (1)
163-175:⚠️ Potential issue | 🟠 MajorEarly return for empty
test_casesbypasses syntax validation.When
test_casesis empty (common for String/Array/ADT-returning problems), the function returnscheck_pass=Truewithout ever writing or parsing the generated code. This means malformed Python generations are recorded as successful checks for these problems.A lightweight
ast.parse(code)guard before the early return would keep thecheck_passmetric honest:🛡️ Proposed fix
+import ast + def _evaluate_python_code( code: str, problem: dict, work_dir: Path, attempt: int, ) -> dict: """Write Python code to a file and run test cases via subprocess.""" entry_point = problem.get("entry_point", "") test_cases = problem.get("test_cases", []) result: dict = { - "check_pass": True, + "check_pass": False, "verify_pass": None, "verify_tier1": 0, "verify_tier3": 0, "run_correct": None, "tests_total": 0, "tests_passed": 0, "error_message": None, } + # Syntax-check even when no test_cases exist + try: + ast.parse(code) + except SyntaxError as exc: + result["error_message"] = f"SyntaxError: {exc}" + return result + result["check_pass"] = True + if not test_cases: return result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vera_bench/runner.py` around lines 163 - 175, Before returning early when test_cases is empty, run a lightweight syntax check with ast.parse on the generated code: wrap ast.parse(code) in a try/except, and if it raises SyntaxError/Exception set result["check_pass"] = False and result["error_message"] = str(exception) (or keep True/no error if parse succeeds) before returning result; update the block that currently does `if not test_cases: return result` so it first validates the code via ast.parse and then returns the possibly-updated result dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@vera_bench/runner.py`:
- Around line 316-319: The code calls _evaluate_code(code, problem, vera,
work_dir, attempt=1) when language != "python" but the parameter vera is typed
as VeraRunner | None; add a defensive check at the start of the function
handling these lines: if language == "vera" (or whenever non-"python" paths
expect vera) assert or raise a clear exception (e.g., raise ValueError or
TypeError) if vera is None, or else only call _evaluate_code when vera is not
None; update the block around the language variable and the call sites to use
this guard so _evaluate_code never receives a None vera.
---
Outside diff comments:
In `@vera_bench/cli.py`:
- Around line 273-275: The baseline metrics call is using _print_metrics without
specifying language so it defaults to "vera" and shows inappropriate verify/fix
rows; change the call in the baselines path so that _print_metrics is invoked
with the language parameter set to "python" (e.g.,
_print_metrics(f"{language}-baseline", metrics, language="python")) after
computing metrics with compute_metrics([json.loads(r.to_jsonl()) for r in
results]) so the printed table uses Python-specific metric columns.
---
Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 163-175: Before returning early when test_cases is empty, run a
lightweight syntax check with ast.parse on the generated code: wrap
ast.parse(code) in a try/except, and if it raises SyntaxError/Exception set
result["check_pass"] = False and result["error_message"] = str(exception) (or
keep True/no error if parse succeeds) before returning result; update the block
that currently does `if not test_cases: return result` so it first validates the
code via ast.parse and then returns the possibly-updated result dict.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5e7a2c69-e35e-49d9-a7cd-951b612c057e
📒 Files selected for processing (4)
CHANGELOG.mdSECURITY.mdvera_bench/cli.pyvera_bench/runner.py
- runner.py: raise ValueError if vera is None when language='vera' - cli.py: pass language to _print_metrics in baselines command so verify/fix rows are hidden for Python baselines Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai Re: outside-diff comment on cli.py lines 273-275 (baselines metrics showing verify/fix) — fixed in 2d46028. Baselines command now passes language param to _print_metrics. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 2008
_print_metrics(f"{language}-baseline", metrics, language=language)Since the 🧠 Learnings used |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
vera_bench/runner.py (1)
163-180:⚠️ Potential issue | 🟠 MajorStill open: Python
check_passis set before the code is checked.Line 174 can return
check_pass=Truebefore the file is even written, and Lines 238-246 only treat a short stderr allow-list as a check failure. On problems with intentionally emptytest_cases, or on modules that blow up at import time with something likeZeroDivisionError, malformed Python still counts towardscheck@1. Please run a dedicated parse/import step first and let that single outcome drivecheck_passbefore any early return or test execution.Based on learnings, empty
test_casesare expected for String/Array/ADT problems, andrunner.pymust record each pipeline step failure in JSONL metrics.Also applies to: 235-247
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vera_bench/runner.py` around lines 163 - 180, The current flow sets result["check_pass"] before validating the generated Python (leading to malformed modules or import-time errors being counted as passes); change runner.py to first write the code (use safe_id and code_path as currently constructed), then run a dedicated parse/import validation step (e.g., attempt to compile() or import the written module in a sandboxed subprocess) and set result["check_pass"] strictly from that outcome; only after that, if check_pass is True, proceed to early return on empty test_cases or to run tests, and ensure any parse/import failures are recorded in result["error_message"] and appropriate pipeline flags (same fix applied to the other block around check handling at the later 235-247 region).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 163-180: The current flow sets result["check_pass"] before
validating the generated Python (leading to malformed modules or import-time
errors being counted as passes); change runner.py to first write the code (use
safe_id and code_path as currently constructed), then run a dedicated
parse/import validation step (e.g., attempt to compile() or import the written
module in a sandboxed subprocess) and set result["check_pass"] strictly from
that outcome; only after that, if check_pass is True, proceed to early return on
empty test_cases or to run tests, and ensure any parse/import failures are
recorded in result["error_message"] and appropriate pipeline flags (same fix
applied to the other block around check handling at the later 235-247 region).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2d0fda06-98f8-4ccf-9779-4df2b06f1953
📒 Files selected for processing (2)
vera_bench/cli.pyvera_bench/runner.py
|
@coderabbitai Python check_pass is set before the code is checked. The concern is check_pass=True before validation, but:
Nothing to do here. |
|
Tip For best results, initiate chat on the files or code changes.
Nothing to do here — noted for future reviews. ✏️ Learnings added
🧠 Learnings used |
Summary
Extend
vera-bench runto ask the LLM to write Python for the same problems, enabling direct Vera vs Python comparison.Usage
Design
python checkstep, so the retry-with-error-feedback loop only applies to Vera.Key changes
prompts.py:build_python_prompt()— system prompt for Python, NL description + entry_point namerunner.py:_evaluate_python_code()— subprocess execution + test comparison;extract_code()now handles python/py fence tags;run_single_problem()andrun_benchmark()acceptlanguageparamcli.py:--languageflag on run command; Python mode skips SKILL.md and vera runnerTest plan
vera-bench run --model claude-sonnet-4-20250514 --language python --problem VB-T1-001(requires API key)Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Documentation