Skip to content

Add --language python for cross-language LLM comparison#11

Merged
aallan merged 8 commits into
mainfrom
feature/python-generation
Mar 30, 2026
Merged

Add --language python for cross-language LLM comparison#11
aallan merged 8 commits into
mainfrom
feature/python-generation

Conversation

@aallan

@aallan aallan commented Mar 30, 2026

Copy link
Copy Markdown
Owner

Summary

Extend vera-bench run to ask the LLM to write Python for the same problems, enabling direct Vera vs Python comparison.

Usage

# Vera (existing)
vera-bench run --model claude-sonnet-4-20250514

# Python (new)
vera-bench run --model claude-sonnet-4-20250514 --language python

# Three-way comparison report
vera-bench report results/

Design

  • Python prompt is deliberately minimal: no SKILL.md (Python is in training data), no contracts (Python has none). Just the NL description and function name. This is the fair comparison — Vera needs extra context because it is not in training data.
  • No fix attempts for Python: there is no python check step, so the retry-with-error-feedback loop only applies to Vera.
  • Execution reuses the subprocess wrapper pattern from baseline_runner: write generated code to temp file, build a test wrapper that imports and runs test cases, parse JSON results.

Key changes

  • prompts.py: build_python_prompt() — system prompt for Python, NL description + entry_point name
  • runner.py: _evaluate_python_code() — subprocess execution + test comparison; extract_code() now handles python/py fence tags; run_single_problem() and run_benchmark() accept language param
  • cli.py: --language flag on run command; Python mode skips SKILL.md and vera runner

Test plan

  • 308 tests pass (298 existing + 10 new)
  • Ruff + security lint clean
  • vera-bench run --model claude-sonnet-4-20250514 --language python --problem VB-T1-001 (requires API key)

Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a language option (Vera or Python) for runs; outputs, filenames and reported metrics adapt per language. Python runs use a dedicated Python prompt and are evaluated via a subprocess execution path; code fence parsing recognises Python fences while remaining backward-compatible.
  • Tests

    • New tests covering prompt construction, fenced-code extraction, Python execution/evaluation and end-to-end run behaviour.
  • Documentation

    • Added CHANGELOG, CONTRIBUTING, CODE_OF_CONDUCT, SECURITY and updated README for Python benchmarking.

Extend vera-bench run to generate Python code from the same problems:

prompts.py:
- build_python_prompt(): minimal prompt (no SKILL.md, no contracts)
  asking the model to write a Python function from the NL description

runner.py:
- _evaluate_python_code(): subprocess-based execution with test wrapper
  (same pattern as baseline_runner but for LLM-generated code)
- extract_code(): renamed from extract_vera_code, now matches python/py
  fence tags too (backward-compatible alias kept)
- run_single_problem(): accepts language param, routes to vera or python
  evaluation path. Python skips fix attempts (no check step).
- run_benchmark(): threads language through to run_single_problem

cli.py:
- --language flag on run command (vera or python)
- Python mode skips SKILL.md loading and vera runner creation
- Output file includes language suffix: {model}-python.jsonl

Usage:
  vera-bench run --model claude-sonnet-4-20250514 --language python
  vera-bench report results/  # shows Vera, Python LLM, and baselines

Tests: 10 new tests covering python prompt, code extraction, python
evaluation (correct/wrong/empty), and full pipeline with mock client.
308 total tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9fbea7db-4ab2-490b-9a64-fc36c8bb2bc7

📥 Commits

Reviewing files that changed from the base of the PR and between 2d46028 and c1d1fd7.

📒 Files selected for processing (1)
  • README.md

📝 Walkthrough

Walkthrough

Adds first-class Python execution to the harness: CLI language flag, Python prompt builder, broadened fenced-code extraction, subprocess-based Python evaluation with generated test wrappers, language-aware runner flow/signatures, tests for Python paths, and supporting docs (CHANGELOG, CONTRIBUTING, CODE_OF_CONDUCT, SECURITY, README).

Changes

Cohort / File(s) Summary
Tests
tests/test_runner.py
Adds tests for build_python_prompt, extract_code (recognises python/py fences and preserves alias), _evaluate_python_code (pass/fail/None outcomes), and run_single_problem with language="python" (result language, flags, and no fix/retry attempts).
CLI
vera_bench/cli.py
Adds --language (vera
Prompts
vera_bench/prompts.py
Adds PYTHON_SYSTEM_PROMPT and build_python_prompt(problem) which extracts entry_point and constructs a Python-focused user message (omits contract content from user prompt when provided).
Runner / Execution
vera_bench/runner.py
Replaces extract_vera_code with extract_code (alias preserved), expands fence regex to include vera/python/py/bare; adds _evaluate_python_code that writes candidate .py, generates and runs a test wrapper via subprocess.run (30s timeout), parses JSON stdout for test results, and sets tests_total/tests_passed/run_correct/check_pass/error_message. Updates run_single_problem/run_benchmark signatures to accept `vera: VeraRunner
Docs & Governance
CHANGELOG.md, CONTRIBUTING.md, CODE_OF_CONDUCT.md, README.md, SECURITY.md
Adds CHANGELOG and project docs; README updated to show vera-bench run --language python and combined reporting guidance.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • aallan/vera-bench#3: Overlaps runner extractor refactor and signature/behavior changes for run_single_problem/run_benchmark.
  • aallan/vera-bench#8: Related CLI/run flow changes and Python subprocess execution paths.

Suggested labels

harness, docs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarises the primary change: adding a --language python flag to enable cross-language LLM comparison between Vera and Python solutions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/python-generation

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter

codecov-commenter commented Mar 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 58.16327% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.41%. Comparing base (2ed883a) to head (c1d1fd7).

Files with missing lines Patch % Lines
vera_bench/runner.py 69.56% 21 Missing ⚠️
vera_bench/cli.py 16.66% 20 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #11      +/-   ##
==========================================
- Coverage   66.14%   65.41%   -0.73%     
==========================================
  Files          10       10              
  Lines         827      905      +78     
==========================================
+ Hits          547      592      +45     
- Misses        280      313      +33     
Flag Coverage Δ
python 65.41% <58.16%> (-0.73%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

aallan and others added 2 commits March 30, 2026 09:06
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vera_bench/cli.py (1)

47-57: ⚠️ Potential issue | 🟡 Minor

Reject Vera-specific flags in Python mode instead of silently ignoring them.

When language == "python", both --mode and --skill-md become no-ops, yet the CLI still accepts them and even prints the selected mode. That makes runs easy to mislabel and hard to reproduce. Either error on those combinations or emit an explicit warning.

Suggested change
 def run(
     model: str,
     tier: int | None,
     problem: str | None,
     language: str,
     mode: str,
     skill_md: Path | None,
     output_dir: Path | None,
     max_tokens: int,
     keep_temps: bool,
 ):
     """Run benchmark against an LLM model."""
     from vera_bench.metrics import compute_metrics
     from vera_bench.models import create_client
     from vera_bench.prompts import load_skill_md
     from vera_bench.runner import run_benchmark
     from vera_bench.vera_runner import VeraRunner
 
     root = _repo_root()
+
+    if language == "python":
+        if skill_md is not None:
+            raise click.UsageError("--skill-md is only supported with --language vera")
+        if mode != "full-spec":
+            raise click.UsageError("--mode has no effect with --language python")

Also applies to: 118-123, 140-143

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vera_bench/prompts.py`:
- Around line 74-87: The Python prompt built by build_python_prompt currently
only includes problem['description'] and the entry_point name; update it to
include the canonical function signature and other required problem metadata
(SKILL.md content and any contracts) so the model is given description, the full
function signature (e.g., problem['signature'] or
problem['canonical_signature']), and optional contracts; modify the user_msg
construction in build_python_prompt to embed the canonical signature right after
the description and include SKILL.md content and contracts if present, and
ensure the system prompt (PYTHON_SYSTEM_PROMPT) remains used; also ensure the
prompt for fixes (if present elsewhere) includes the vera error message from
problem['last_error'] or similar.

In `@vera_bench/runner.py`:
- Around line 162-174: The result dictionary currently initializes "check_pass":
True prematurely; change the initializer in the block that creates result (the
variable result) to set "check_pass" to None or False, then only set
result["check_pass"]=True after the actual import/compile/check step completes
successfully (the function or code path that performs the Python import/check of
the written file). Also ensure the early return for empty test_cases uses the
actual check outcome (not the default) by running the import/check step before
returning or by explicitly setting result["check_pass"] based on that step;
apply the same fix to the analogous initialization/path referenced around lines
231-248 so no malformed generation is reported as a successful check.
- Around line 217-224: The subprocess.run call that executes the generated
Python (the invocation building proc with [sys.executable, str(wrapper_path)])
should be confined: add cwd=work_dir to restrict filesystem access and pass a
sanitized env (e.g., create env = os.environ.copy() and remove sensitive keys or
use a minimal env = {}) into subprocess.run to avoid leaking credentials; update
the call site where proc is created to include cwd and env arguments and ensure
wrapper_path and work_dir variables are used to locate the working directory and
sanitized environment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 03b94cba-1ab0-4bbe-bddf-4422039b31eb

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed883a and 40a5eed.

📒 Files selected for processing (4)
  • tests/test_runner.py
  • vera_bench/cli.py
  • vera_bench/prompts.py
  • vera_bench/runner.py

Comment thread vera_bench/prompts.py
Comment thread vera_bench/runner.py
Comment thread vera_bench/runner.py
aallan and others added 2 commits March 30, 2026 09:15
Adapted from the vera compiler repo templates for vera-bench context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… accuracy

- Subprocess runs with cwd=work_dir and API keys stripped from env
- SyntaxError/ImportError in generated Python now sets check_pass=False
  (analogous to vera check failure)
- CLI warns when Vera-specific flags (--mode, --skill-md) are used
  with --language python

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan

aallan commented Mar 30, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Re: outside-diff comment on cli.py lines 47-57 (reject Vera-specific flags in Python mode) — fixed in a052bde. The CLI now warns when --mode or --skill-md are used with --language python.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vera_bench/cli.py (1)

169-172: ⚠️ Potential issue | 🟠 Major

Don’t report Vera-only metrics for Python runs.

compute_metrics() is still Vera-centric: any passing Python attempt becomes verify-eligible, and any failing one becomes fix-eligible. That means vera-bench run --language python will show verify@1 = 0% and can show fix@1 = 0% even though Python has neither stage, which makes both the CLI summary here and the markdown report misleading.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/cli.py` around lines 169 - 172, Summary: Vera-only metrics
(verify/fix) are being reported for Python runs; detect Python runs and strip
those metrics before printing. Locate where compute_metrics(...) and
_print_metrics(model, metrics) are called and after computing metrics from
results, check model.language (or equivalent flag indicating the run language)
and if it equals "python" remove Vera-only metric entries (e.g., keys like
"verify@1", "fix@1", "verify", "fix" or any verify/fix aggregates) from the
metrics dict before calling _print_metrics; alternatively, adapt compute_metrics
to accept a flag (e.g., include_vera_metrics=False) and pass it for Python runs
so verify/fix stats are not produced. Ensure only non-Vera metrics are passed to
_print_metrics for Python runs.
♻️ Duplicate comments (1)
vera_bench/runner.py (1)

163-175: ⚠️ Potential issue | 🟠 Major

Drive check_pass from a real import step.

Line 174 still returns check_pass=True before the module is even written or imported, and Lines 237-239 only downgrade failures that mention SyntaxError or ImportError. Top-level crashes such as ModuleNotFoundError, ZeroDivisionError, or any other import-time exception will still inflate check@1 for Python runs.

Based on learnings "The runner.py module must implement the complete pipeline: generate (LLM call) → write (file) → check → verify → run → fix attempt, with individual step failures recorded in JSONL metrics".

Also applies to: 235-243

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 163 - 175, The result dict currently sets
"check_pass": True before the module is written/imported, which inflates
metrics; change the initial "check_pass" to False and only set it to True after
the actual import/check step completes successfully (the code paths around the
import/write/check logic referenced near the result initialization and the
import exception handling around lines ~235-243). Update the import/check
exception handling (the block that currently only downgrades on SyntaxError or
ImportError) to catch any exception raised during import (e.g.,
ModuleNotFoundError, ZeroDivisionError, etc.), set result["check_pass"]=False,
populate result["error_message"] with the exception details, and only leave
check_pass True when the import succeeds; ensure the early return when not
test_cases does not mistakenly preserve a true check_pass by returning the
result after check_pass has been correctly set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Line 10: The headings like "### Added" (and the other headings at the same
locations) are immediately followed by list content which triggers markdownlint
MD022; insert a single blank line after each affected heading (e.g., after the
"### Added" heading and the headings at the other noted locations) so each
heading is followed by an empty line before the list content to satisfy the
rule.

In `@SECURITY.md`:
- Around line 34-37: Update the subprocess policy sentence to reflect both
trusted-command exceptions: change the wording that currently limits the S603
suppression to the `vera` binary found via `shutil.which()` so it also mentions
that `vera_bench/runner.py` suppresses S603 for uses of `sys.executable` in the
Python evaluation path; explicitly reference the `S603` suppression and the
trusted commands (`shutil.which()`-located `vera` and `sys.executable`) so the
policy text matches the code.

In `@vera_bench/runner.py`:
- Around line 217-228: The current subprocess.run invocation
(subprocess.run([...], cwd=work_dir, env=run_env, ...)) still runs untrusted
model output with host privileges; change this by requiring an explicit opt-in
flag (e.g., allow_untrusted_execution) before executing wrapper_path and, when
not opted-in, replace the call with a safe failure or a dry-run; additionally
enforce a strict environment allowlist (build run_env from a whitelist instead
of filtering *_API_KEY), and document/require execution inside a proper sandbox
(container/VM/low-privilege user) for any opt-in path; update checks around
run_env, wrapper_path and work_dir to gate execution and to log a clear warning
when unsafe execution is refused.

---

Outside diff comments:
In `@vera_bench/cli.py`:
- Around line 169-172: Summary: Vera-only metrics (verify/fix) are being
reported for Python runs; detect Python runs and strip those metrics before
printing. Locate where compute_metrics(...) and _print_metrics(model, metrics)
are called and after computing metrics from results, check model.language (or
equivalent flag indicating the run language) and if it equals "python" remove
Vera-only metric entries (e.g., keys like "verify@1", "fix@1", "verify", "fix"
or any verify/fix aggregates) from the metrics dict before calling
_print_metrics; alternatively, adapt compute_metrics to accept a flag (e.g.,
include_vera_metrics=False) and pass it for Python runs so verify/fix stats are
not produced. Ensure only non-Vera metrics are passed to _print_metrics for
Python runs.

---

Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 163-175: The result dict currently sets "check_pass": True before
the module is written/imported, which inflates metrics; change the initial
"check_pass" to False and only set it to True after the actual import/check step
completes successfully (the code paths around the import/write/check logic
referenced near the result initialization and the import exception handling
around lines ~235-243). Update the import/check exception handling (the block
that currently only downgrades on SyntaxError or ImportError) to catch any
exception raised during import (e.g., ModuleNotFoundError, ZeroDivisionError,
etc.), set result["check_pass"]=False, populate result["error_message"] with the
exception details, and only leave check_pass True when the import succeeds;
ensure the early return when not test_cases does not mistakenly preserve a true
check_pass by returning the result after check_pass has been correctly set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f1881646-cc30-478e-9a8e-a502e9fd7f09

📥 Commits

Reviewing files that changed from the base of the PR and between 40a5eed and a052bde.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • CODE_OF_CONDUCT.md
  • CONTRIBUTING.md
  • README.md
  • SECURITY.md
  • vera_bench/cli.py
  • vera_bench/runner.py

Comment thread CHANGELOG.md
Comment thread SECURITY.md
Comment thread vera_bench/runner.py
@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of cli.py around the Vera-specific flag warning logic
rg -n "mode|skill.md|language|warn" vera_bench/cli.py -A2 -B2 | head -80

Length of output: 2548


@aallan Thanks for the fix — confirmed in the codebase. The warning block (Lines 97–106) correctly fires:

  • [yellow]Warning: --skill-md is ignored with --language python[/yellow] when --skill-md is explicitly provided.
  • [yellow]Warning: --mode is ignored with --language python[/yellow] when --mode is set to a non-default value.

SKILL.md loading and VeraRunner instantiation are also already gated on language == "vera" (Lines 131 and 149), so nothing leaks through silently. Looks good! ✅


🧠 Learnings used
Learnt from: aallan
Repo: aallan/vera-bench PR: 11
File: vera_bench/prompts.py:74-87
Timestamp: 2026-03-30T08:18:58.874Z
Learning: In vera_bench/prompts.py, build_python_prompt() is intentionally minimal (NL description + entry_point only, no SKILL.md, no contracts, no signature). Design rationale: Vera is a novel language the model has never seen so it gets extra context (SKILL.md + contracts); Python is well-represented in training data so it gets only the NL description, matching the context level of standard coding benchmarks. This is a deliberate benchmark fairness decision and should not be flagged as a deficiency.

Learnt from: aallan
Repo: aallan/vera-bench PR: 3
File: vera_bench/cli.py:144-145
Timestamp: 2026-03-29T20:18:00.371Z
Learning: In vera_bench/runner.py and vera_bench/cli.py (aallan/vera-bench), the pattern `json.loads(r.to_jsonl())` in `cli.py` for converting `ProblemResult` objects before passing to `compute_metrics` is intentional: `ProblemResult.to_jsonl()` drops `None` fields, and `compute_metrics` in `vera_bench/metrics.py` relies on absent keys (not `None`-valued keys) when computing rates. Using `dataclasses.asdict()` would preserve `None` values and break metrics computation. A future refactor to a `to_dict(drop_none=True)` method is planned. Do not flag the `json.loads(to_jsonl())` pattern as redundant overhead.

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/prompts.py : Prompts must include SKILL.md content, problem description, function signature, and optional contracts; fix prompts must include the error message from vera output

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/**/*.py : Harness Python code must use subprocess calls with timeout protection, JSON parsing of vera --json output, and proper error handling for all vera CLI invocations

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/runner.py : The runner.py module must implement the complete pipeline: generate (LLM call) → write (file) → check → verify → run → fix attempt, with individual step failures recorded in JSONL metrics

- CHANGELOG.md: add blank lines after headings (MD022)
- SECURITY.md: document S603 suppression for both vera and sys.executable
- runner.py: widen check_pass=False to catch NameError, ModuleNotFoundError,
  IndentationError, TabError (not just SyntaxError/ImportError)
- cli.py: skip verify@1 and fix@1 rows in metrics table for Python runs
  (these Vera-only metrics are meaningless for Python)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan

aallan commented Mar 30, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Re: outside-diff comment on cli.py lines 169-172 (Vera-only metrics for Python runs) — fixed in 2e2ea53. _print_metrics() now accepts a language parameter and skips verify@1 and fix@1 rows for Python runs.

@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of _print_metrics in cli.py
rg -n "_print_metrics|verify@1|fix@1|language" vera_bench/cli.py -A3 -B2

Length of output: 4278


@aallan Confirmed in the codebase — the fix looks correct:

  • Line 183: _print_metrics now has language: str = "vera" in its signature.
  • Lines 191–193: verify@1 and fix@1 rows are gated on language == "vera", so Python runs produce a clean table with only check@1 and run_correct.
  • Line 172: The call site in run passes language=language correctly.

One minor observation outside this fix's scope: the baselines command at Line 275 calls _print_metrics(f"{language}-baseline", metrics) without the language= keyword, so it silently defaults to "vera" and would show verify@1/fix@1 rows (as None/0%) for a Python baseline run. Worth a one-liner follow-up if that bothers you — but it's not a blocker. ✅


🧠 Learnings used
Learnt from: aallan
Repo: aallan/vera-bench PR: 11
File: vera_bench/prompts.py:74-87
Timestamp: 2026-03-30T08:18:58.874Z
Learning: In vera_bench/prompts.py, build_python_prompt() is intentionally minimal (NL description + entry_point only, no SKILL.md, no contracts, no signature). Design rationale: Vera is a novel language the model has never seen so it gets extra context (SKILL.md + contracts); Python is well-represented in training data so it gets only the NL description, matching the context level of standard coding benchmarks. This is a deliberate benchmark fairness decision and should not be flagged as a deficiency.

Learnt from: aallan
Repo: aallan/vera-bench PR: 3
File: vera_bench/cli.py:144-145
Timestamp: 2026-03-29T20:18:00.371Z
Learning: In vera_bench/runner.py and vera_bench/cli.py (aallan/vera-bench), the pattern `json.loads(r.to_jsonl())` in `cli.py` for converting `ProblemResult` objects before passing to `compute_metrics` is intentional: `ProblemResult.to_jsonl()` drops `None` fields, and `compute_metrics` in `vera_bench/metrics.py` relies on absent keys (not `None`-valued keys) when computing rates. Using `dataclasses.asdict()` would preserve `None` values and break metrics computation. A future refactor to a `to_dict(drop_none=True)` method is planned. Do not flag the `json.loads(to_jsonl())` pattern as redundant overhead.

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/**/*.py : Harness Python code must use subprocess calls with timeout protection, JSON parsing of vera --json output, and proper error handling for all vera CLI invocations

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/prompts.py : Prompts must include SKILL.md content, problem description, function signature, and optional contracts; fix prompts must include the error message from vera output

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/runner.py : The runner.py module must implement the complete pipeline: generate (LLM call) → write (file) → check → verify → run → fix attempt, with individual step failures recorded in JSONL metrics

Learnt from: aallan
Repo: aallan/vera-bench PR: 1
File: problems/tier2/VB_T2_005_contains_substring.json:7-11
Timestamp: 2026-03-29T19:15:48.671Z
Learning: In vera-bench (aallan/vera-bench), Tier 2 problem JSON files intentionally use `ensures: ["true"]` (weak postconditions). The benchmark challenge for Tier 2 is built-in function *discovery* — whether an LLM agent can find the correct built-in (e.g. `string_contains`, `string_join`, `array_map`) and call it with the correct argument order. Adding a strong postcondition (e.g. `Bool.result == string_contains(String.1, String.0)`) would let Z3 verify a trivial wrapper and eliminate the discovery challenge entirely. Do not flag weak `ensures` as a deficiency for Tier 2 built-in discovery problems.

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/metrics.py : Metrics computation must calculate check_rate, verify_rate, fix_rate, verify_tier1/tier3 breakdown, and wall_time_s per problem-model combination; aggregate results by tier

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T19:41:57.150Z
Learning: Applies to **/vera_bench/*.py : Use `click` for CLI implementation in Python.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vera_bench/cli.py (1)

273-275: 🧹 Nitpick | 🔵 Trivial

Consider passing language="python" to _print_metrics for baselines.

The baselines command runs Python solutions, but _print_metrics is called without the language parameter, so it defaults to "vera" and displays verify@1/fix@1 rows that aren't meaningful for Python baselines.

♻️ Suggested fix
     if results:
         metrics = compute_metrics([json.loads(r.to_jsonl()) for r in results])
-        _print_metrics(f"{language}-baseline", metrics)
+        _print_metrics(f"{language}-baseline", metrics, language=language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/cli.py` around lines 273 - 275, The baseline metrics call is using
_print_metrics without specifying language so it defaults to "vera" and shows
inappropriate verify/fix rows; change the call in the baselines path so that
_print_metrics is invoked with the language parameter set to "python" (e.g.,
_print_metrics(f"{language}-baseline", metrics, language="python")) after
computing metrics with compute_metrics([json.loads(r.to_jsonl()) for r in
results]) so the printed table uses Python-specific metric columns.
♻️ Duplicate comments (1)
vera_bench/runner.py (1)

163-175: ⚠️ Potential issue | 🟠 Major

Early return for empty test_cases bypasses syntax validation.

When test_cases is empty (common for String/Array/ADT-returning problems), the function returns check_pass=True without ever writing or parsing the generated code. This means malformed Python generations are recorded as successful checks for these problems.

A lightweight ast.parse(code) guard before the early return would keep the check_pass metric honest:

🛡️ Proposed fix
+import ast
+
 def _evaluate_python_code(
     code: str,
     problem: dict,
     work_dir: Path,
     attempt: int,
 ) -> dict:
     """Write Python code to a file and run test cases via subprocess."""
     entry_point = problem.get("entry_point", "")
     test_cases = problem.get("test_cases", [])

     result: dict = {
-        "check_pass": True,
+        "check_pass": False,
         "verify_pass": None,
         "verify_tier1": 0,
         "verify_tier3": 0,
         "run_correct": None,
         "tests_total": 0,
         "tests_passed": 0,
         "error_message": None,
     }

+    # Syntax-check even when no test_cases exist
+    try:
+        ast.parse(code)
+    except SyntaxError as exc:
+        result["error_message"] = f"SyntaxError: {exc}"
+        return result
+    result["check_pass"] = True
+
     if not test_cases:
         return result
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 163 - 175, Before returning early when
test_cases is empty, run a lightweight syntax check with ast.parse on the
generated code: wrap ast.parse(code) in a try/except, and if it raises
SyntaxError/Exception set result["check_pass"] = False and
result["error_message"] = str(exception) (or keep True/no error if parse
succeeds) before returning result; update the block that currently does `if not
test_cases: return result` so it first validates the code via ast.parse and then
returns the possibly-updated result dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vera_bench/runner.py`:
- Around line 316-319: The code calls _evaluate_code(code, problem, vera,
work_dir, attempt=1) when language != "python" but the parameter vera is typed
as VeraRunner | None; add a defensive check at the start of the function
handling these lines: if language == "vera" (or whenever non-"python" paths
expect vera) assert or raise a clear exception (e.g., raise ValueError or
TypeError) if vera is None, or else only call _evaluate_code when vera is not
None; update the block around the language variable and the call sites to use
this guard so _evaluate_code never receives a None vera.

---

Outside diff comments:
In `@vera_bench/cli.py`:
- Around line 273-275: The baseline metrics call is using _print_metrics without
specifying language so it defaults to "vera" and shows inappropriate verify/fix
rows; change the call in the baselines path so that _print_metrics is invoked
with the language parameter set to "python" (e.g.,
_print_metrics(f"{language}-baseline", metrics, language="python")) after
computing metrics with compute_metrics([json.loads(r.to_jsonl()) for r in
results]) so the printed table uses Python-specific metric columns.

---

Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 163-175: Before returning early when test_cases is empty, run a
lightweight syntax check with ast.parse on the generated code: wrap
ast.parse(code) in a try/except, and if it raises SyntaxError/Exception set
result["check_pass"] = False and result["error_message"] = str(exception) (or
keep True/no error if parse succeeds) before returning result; update the block
that currently does `if not test_cases: return result` so it first validates the
code via ast.parse and then returns the possibly-updated result dict.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5e7a2c69-e35e-49d9-a7cd-951b612c057e

📥 Commits

Reviewing files that changed from the base of the PR and between a052bde and 2e2ea53.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • SECURITY.md
  • vera_bench/cli.py
  • vera_bench/runner.py

Comment thread vera_bench/runner.py
- runner.py: raise ValueError if vera is None when language='vera'
- cli.py: pass language to _print_metrics in baselines command so
  verify/fix rows are hidden for Python baselines

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan

aallan commented Mar 30, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Re: outside-diff comment on cli.py lines 273-275 (baselines metrics showing verify/fix) — fixed in 2d46028. Baselines command now passes language param to _print_metrics.

@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of the baselines command in cli.py around line 273-275
rg -n "_print_metrics|baselines|language" vera_bench/cli.py -A3 -B2 | tail -60

Length of output: 2008


@aallan Confirmed in the codebase — Line 275 now reads:

_print_metrics(f"{language}-baseline", metrics, language=language)

Since the baselines command only accepts language="python" (it's a click.Choice(["python"])), verify@1 and fix@1 will be correctly suppressed for all baseline runs. Clean fix! ✅


🧠 Learnings used
Learnt from: aallan
Repo: aallan/vera-bench PR: 11
File: vera_bench/prompts.py:74-87
Timestamp: 2026-03-30T08:18:58.874Z
Learning: In vera_bench/prompts.py, build_python_prompt() is intentionally minimal (NL description + entry_point only, no SKILL.md, no contracts, no signature). Design rationale: Vera is a novel language the model has never seen so it gets extra context (SKILL.md + contracts); Python is well-represented in training data so it gets only the NL description, matching the context level of standard coding benchmarks. This is a deliberate benchmark fairness decision and should not be flagged as a deficiency.

Learnt from: aallan
Repo: aallan/vera-bench PR: 3
File: vera_bench/cli.py:144-145
Timestamp: 2026-03-29T20:18:00.371Z
Learning: In vera_bench/runner.py and vera_bench/cli.py (aallan/vera-bench), the pattern `json.loads(r.to_jsonl())` in `cli.py` for converting `ProblemResult` objects before passing to `compute_metrics` is intentional: `ProblemResult.to_jsonl()` drops `None` fields, and `compute_metrics` in `vera_bench/metrics.py` relies on absent keys (not `None`-valued keys) when computing rates. Using `dataclasses.asdict()` would preserve `None` values and break metrics computation. A future refactor to a `to_dict(drop_none=True)` method is planned. Do not flag the `json.loads(to_jsonl())` pattern as redundant overhead.

Learnt from: aallan
Repo: aallan/vera-bench PR: 11
File: vera_bench/runner.py:217-228
Timestamp: 2026-03-30T08:33:29.948Z
Learning: In vera_bench/runner.py, the _evaluate_python_code() subprocess invocation (cwd=work_dir, env with *_API_KEY vars stripped) is intentionally the security boundary for Python evaluation. The maintainer considers this proportionate for a benchmark tool: invoking `vera-bench run --language python` is explicit user consent to execute LLM-generated code. Do not flag the absence of a container/VM sandbox or an opt-in execution flag as a deficiency; the cwd isolation and API-key env stripping are the accepted mitigations.

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/**/*.py : Harness Python code must use subprocess calls with timeout protection, JSON parsing of vera --json output, and proper error handling for all vera CLI invocations

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/prompts.py : Prompts must include SKILL.md content, problem description, function signature, and optional contracts; fix prompts must include the error message from vera output

Learnt from: aallan
Repo: aallan/vera-bench PR: 1
File: problems/tier2/VB_T2_005_contains_substring.json:7-11
Timestamp: 2026-03-29T19:15:48.671Z
Learning: In vera-bench (aallan/vera-bench), Tier 2 problem JSON files intentionally use `ensures: ["true"]` (weak postconditions). The benchmark challenge for Tier 2 is built-in function *discovery* — whether an LLM agent can find the correct built-in (e.g. `string_contains`, `string_join`, `array_map`) and call it with the correct argument order. Adding a strong postcondition (e.g. `Bool.result == string_contains(String.1, String.0)`) would let Z3 verify a trivial wrapper and eliminate the discovery challenge entirely. Do not flag weak `ensures` as a deficiency for Tier 2 built-in discovery problems.

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/runner.py : The runner.py module must implement the complete pipeline: generate (LLM call) → write (file) → check → verify → run → fix attempt, with individual step failures recorded in JSONL metrics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
vera_bench/runner.py (1)

163-180: ⚠️ Potential issue | 🟠 Major

Still open: Python check_pass is set before the code is checked.

Line 174 can return check_pass=True before the file is even written, and Lines 238-246 only treat a short stderr allow-list as a check failure. On problems with intentionally empty test_cases, or on modules that blow up at import time with something like ZeroDivisionError, malformed Python still counts towards check@1. Please run a dedicated parse/import step first and let that single outcome drive check_pass before any early return or test execution.

Based on learnings, empty test_cases are expected for String/Array/ADT problems, and runner.py must record each pipeline step failure in JSONL metrics.

Also applies to: 235-247

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 163 - 180, The current flow sets
result["check_pass"] before validating the generated Python (leading to
malformed modules or import-time errors being counted as passes); change
runner.py to first write the code (use safe_id and code_path as currently
constructed), then run a dedicated parse/import validation step (e.g., attempt
to compile() or import the written module in a sandboxed subprocess) and set
result["check_pass"] strictly from that outcome; only after that, if check_pass
is True, proceed to early return on empty test_cases or to run tests, and ensure
any parse/import failures are recorded in result["error_message"] and
appropriate pipeline flags (same fix applied to the other block around check
handling at the later 235-247 region).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 163-180: The current flow sets result["check_pass"] before
validating the generated Python (leading to malformed modules or import-time
errors being counted as passes); change runner.py to first write the code (use
safe_id and code_path as currently constructed), then run a dedicated
parse/import validation step (e.g., attempt to compile() or import the written
module in a sandboxed subprocess) and set result["check_pass"] strictly from
that outcome; only after that, if check_pass is True, proceed to early return on
empty test_cases or to run tests, and ensure any parse/import failures are
recorded in result["error_message"] and appropriate pipeline flags (same fix
applied to the other block around check handling at the later 235-247 region).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2d0fda06-98f8-4ccf-9779-4df2b06f1953

📥 Commits

Reviewing files that changed from the base of the PR and between 2e2ea53 and 2d46028.

📒 Files selected for processing (2)
  • vera_bench/cli.py
  • vera_bench/runner.py

@aallan

aallan commented Mar 30, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Python check_pass is set before the code is checked. The concern is check_pass=True before validation, but:

  1. If test_cases is empty, we return early with check_pass=True — but there's nothing to validate against. No tests = no observable failure.
  2. If test_cases exist, we run the code and catch SyntaxError/ImportError/etc in the subprocess, setting check_pass=False at that point (fixed in a052bde and widened in 2e2ea53).
  3. Adding ast.parse or a compile step for the empty-test-cases path catches syntax errors in code we can't test anyway — it adds complexity for no actionable signal.

Nothing to do here.

@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@aallan — Agreed, the reasoning is sound across all three paths:

  • Empty test cases → check_pass=True early return: there's no observable failure possible, so the sentinel value is correct and not misleading in the metrics.
  • Non-empty test cases → subprocess catches errors: SyntaxError, ImportError, and runtime failures all surface as non-zero exit or stderr, which correctly flips check_pass=False.
  • No ast.parse pre-check for the empty path: adding a compile step there would produce a signal (syntax valid/invalid) on code that can't be exercised anyway — noise with no actionable benchmark value.

Nothing to do here — noted for future reviews.


✏️ Learnings added
Learnt from: aallan
URL: https://github.com/aallan/vera-bench/pull/11

Timestamp: 2026-03-30T08:51:14.736Z
Learning: In vera_bench/runner.py, `_evaluate_python_code()` intentionally sets `check_pass=True` as an initial/early-return value before execution. Design rationale: (1) if `test_cases` is empty, early return with `check_pass=True` is correct — there is nothing to validate against, so no observable failure is possible; (2) if `test_cases` are present, the subprocess execution catches `SyntaxError`, `ImportError`, and runtime failures via non-zero exit, setting `check_pass=False` appropriately; (3) adding an `ast.parse` or compile pre-check for the empty-test-cases path adds complexity with no actionable benchmark signal. Do not flag `check_pass=True` before subprocess execution as a deficiency in `_evaluate_python_code()`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: aallan
Repo: aallan/vera-bench PR: 11
File: vera_bench/runner.py:217-228
Timestamp: 2026-03-30T08:33:29.948Z
Learning: In vera_bench/runner.py, the _evaluate_python_code() subprocess invocation (cwd=work_dir, env with *_API_KEY vars stripped) is intentionally the security boundary for Python evaluation. The maintainer considers this proportionate for a benchmark tool: invoking `vera-bench run --language python` is explicit user consent to execute LLM-generated code. Do not flag the absence of a container/VM sandbox or an opt-in execution flag as a deficiency; the cwd isolation and API-key env stripping are the accepted mitigations.

Learnt from: aallan
Repo: aallan/vera-bench PR: 11
File: vera_bench/prompts.py:74-87
Timestamp: 2026-03-30T08:18:58.874Z
Learning: In vera_bench/prompts.py, build_python_prompt() is intentionally minimal (NL description + entry_point only, no SKILL.md, no contracts, no signature). Design rationale: Vera is a novel language the model has never seen so it gets extra context (SKILL.md + contracts); Python is well-represented in training data so it gets only the NL description, matching the context level of standard coding benchmarks. This is a deliberate benchmark fairness decision and should not be flagged as a deficiency.

Learnt from: aallan
Repo: aallan/vera-bench PR: 1
File: problems/tier2/VB_T2_005_contains_substring.json:7-11
Timestamp: 2026-03-29T19:15:48.671Z
Learning: In vera-bench (aallan/vera-bench), Tier 2 problem JSON files intentionally use `ensures: ["true"]` (weak postconditions). The benchmark challenge for Tier 2 is built-in function *discovery* — whether an LLM agent can find the correct built-in (e.g. `string_contains`, `string_join`, `array_map`) and call it with the correct argument order. Adding a strong postcondition (e.g. `Bool.result == string_contains(String.1, String.0)`) would let Z3 verify a trivial wrapper and eliminate the discovery challenge entirely. Do not flag weak `ensures` as a deficiency for Tier 2 built-in discovery problems.

Learnt from: aallan
Repo: aallan/vera-bench PR: 3
File: vera_bench/cli.py:144-145
Timestamp: 2026-03-29T20:18:00.371Z
Learning: In vera_bench/runner.py and vera_bench/cli.py (aallan/vera-bench), the pattern `json.loads(r.to_jsonl())` in `cli.py` for converting `ProblemResult` objects before passing to `compute_metrics` is intentional: `ProblemResult.to_jsonl()` drops `None` fields, and `compute_metrics` in `vera_bench/metrics.py` relies on absent keys (not `None`-valued keys) when computing rates. Using `dataclasses.asdict()` would preserve `None` values and break metrics computation. A future refactor to a `to_dict(drop_none=True)` method is planned. Do not flag the `json.loads(to_jsonl())` pattern as redundant overhead.

Learnt from: aallan
Repo: aallan/vera-bench PR: 1
File: problems/tier3/VB_T3_008_tree_count_leaves.json:9-14
Timestamp: 2026-03-29T19:15:25.737Z
Learning: In vera-bench (aallan/vera-bench), problems whose entry-point function *accepts* ADT input parameters (e.g., Tree, List, Option, or any user-defined ADT) must have an empty `test_cases` array. `vera run --fn` cannot parse constructed ADT values (e.g., `Branch(Leaf(1), Leaf(2))` or `Cons(1, Nil)`) from CLI arguments. Validation for these problems is performed exclusively via `vera check` and `vera verify`. Do not flag empty `test_cases` as a deficiency for any problem whose entry-point takes ADT-typed parameters.

Learnt from: aallan
Repo: aallan/vera-bench PR: 1
File: problems/tier3/VB_T3_006_option_unwrap_or.json:14-14
Timestamp: 2026-03-29T19:15:21.168Z
Learning: In vera-bench (aallan/vera-bench), problems whose entry-point accepts ADT-typed parameters (e.g., Option, List, Tree, Cons, Nil) must leave `test_cases` empty. `vera run --fn` cannot accept constructed ADT values (e.g., Some(42), Cons(1, Nil)) as CLI arguments, so runtime test comparison is not feasible for these inputs. These problems are validated exclusively via `vera check` and `vera verify`. Additionally, `ensures: true` is intentional for Tier 3 ADT problems — the evaluation challenge is whether the agent can correctly define the ADT and write pattern matching with correct De Bruijn indices, not whether it can write strong postconditions. Do not flag empty `test_cases` or weak contracts as deficiencies for these problems.

Learnt from: aallan
Repo: aallan/vera-bench PR: 1
File: problems/tier5/VB_T5_005_checked_index.json:7-11
Timestamp: 2026-03-29T19:15:55.453Z
Learning: In vera-bench (aallan/vera-bench), `ensures: ["true"]` (a permissive/vacuous postcondition) is intentional for Tier 5 problems. The evaluation goal for Tier 5 is whether an LLM agent can correctly wire up effect handlers (State<T>, Exn<E>, IO patterns) — not whether it can write strong Z3-verifiable postconditions. Stronger ensures clauses would change the problem's difficulty axis from effect system understanding to contract verification. Do not flag `ensures: ["true"]` as a weakness for Tier 5 problems involving effect handlers (State/Exn/IO).

Learnt from: aallan
Repo: aallan/vera-bench PR: 1
File: problems/tier5/VB_T5_010_safe_head.json:9-14
Timestamp: 2026-03-29T19:16:05.346Z
Learning: In vera-bench (aallan/vera-bench), `ensures: ["true"]` (a permissive/vacuous postcondition) is intentional for Tier 5 effect-handler problems (State<Int>, Exn<E>, IO). The evaluation goal for these problems is whether an LLM agent can correctly wire up effect handlers (handle[State<Int>], handle[Exn<Int>], IO declarations, where-block helpers with effects annotations) — not whether it can write strong Z3-verifiable postconditions. Stronger ensures clauses would shift the challenge toward Z3 verification rather than effect-system understanding. Do not flag `ensures: ["true"]` as a weakness for Tier 5 problems involving State, Exn, or IO effects.

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/prompts.py : Prompts must include SKILL.md content, problem description, function signature, and optional contracts; fix prompts must include the error message from vera output

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.516Z
Learning: Applies to vera_bench/runner.py : The runner.py module must implement the complete pipeline: generate (LLM call) → write (file) → check → verify → run → fix attempt, with individual step failures recorded in JSONL metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants