Skip to content

Implement LLM runner harness (Phase 2)#3

Merged
aallan merged 4 commits into
mainfrom
feature/llm-runner
Mar 29, 2026
Merged

Implement LLM runner harness (Phase 2)#3
aallan merged 4 commits into
mainfrom
feature/llm-runner

Conversation

@aallan

@aallan aallan commented Mar 29, 2026

Copy link
Copy Markdown
Owner

Summary

Implements the complete benchmark evaluation pipeline — vera-bench run --model claude-sonnet-4-20250514 now works end-to-end.

  • models.py: Anthropic + OpenAI API abstraction with lazy imports, provider detection from model ID prefix, SDK built-in retry for rate limits
  • runner.py: generate → check → verify → run → fix pipeline with code extraction from markdown fences, JSONL output (incremental, crash-safe), temp file management
  • metrics.py: check_rate, verify_rate, fix_rate, run_correct_rate computation with per-tier breakdowns
  • report.py: markdown report generation (summary table, tier breakdown, per-problem detail)
  • cli.py: run and report commands fully wired up

New CLI usage

# Run full benchmark
vera-bench run --model claude-sonnet-4-20250514

# Run single tier
vera-bench run --model claude-sonnet-4-20250514 --tier 1

# Run single problem
vera-bench run --model claude-sonnet-4-20250514 --problem VB-T1-001

# Spec-from-NL mode (agent writes contracts)
vera-bench run --model claude-sonnet-4-20250514 --mode spec-from-nl

# Generate report
vera-bench report results/

Key design decisions

  • No streaming — batch responses only, token counts come automatically
  • SDK retry — no custom rate limit handling, both SDKs retry 429s
  • Incremental JSONL — each result written immediately (survives crashes)
  • Code extraction — regex for markdown fences, longest block wins, falls back to raw text
  • One fix attempt — on check failure, feeds error back to model for one retry

Test plan

  • 285 tests pass (259 existing + 26 new)
  • ruff check . && ruff format --check . clean
  • ruff check --select S vera_bench/ security lint clean
  • vera-bench run --model claude-sonnet-4-20250514 --problem VB-T1-001 (requires API key)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Fully functional CLI run command with benchmark execution, per-model JSONL results and Rich summary table
    • New run options: --max-tokens (default 4096), --keep-temps, improved --tier help and --skill-md support
    • Benchmark runner: end-to-end generation, evaluation, optional fix attempts and incremental result output
    • Markdown-only report generation that writes/prints summary.md and per-model reports
    • Automated metrics computation with tiered breakdowns and JSONL load support
    • Integrated LLM provider support with clearer errors for missing keys/unsupported models
  • Tests

    • Comprehensive test suite covering parsing, serialization, provider detection, metrics, reporting and retry behaviour
  • Documentation

    • README updated with prerequisites, installation, Vera setup and revised quick-start commands

Add the complete benchmark evaluation pipeline:

models.py — LLM API abstraction
- AnthropicClient and OpenAIClient with lazy imports
- Unified LLMResponse dataclass (text, tokens, wall_time)
- Provider detection from model ID prefix (claude-*, gpt-*, o1-*, o3-*)
- API keys from environment, SDK built-in retry for rate limits

runner.py — Pipeline orchestration
- extract_vera_code(): regex-based code extraction from markdown fences
- run_single_problem(): generate -> check -> verify -> run -> fix pipeline
- run_benchmark(): iterate problems with rich progress, JSONL output
- ProblemResult dataclass matching BRIEFING.md JSONL format
- Retry-with-error-feedback (one fix attempt on check failure)
- Temp file management with optional --keep-temps

metrics.py — Result aggregation
- load_results(): parse JSONL files
- compute_metrics(): check_rate, verify_rate, fix_rate, run_correct_rate
- Per-tier breakdowns via problem ID parsing
- Handles multi-attempt results (best-attempt for verify/run, fix_rate)

report.py — Markdown report generation
- Summary table (model x metrics)
- Tier breakdown matrix
- Per-problem detail listing
- Writes summary.md to results directory

cli.py — Wired up run and report commands
- vera-bench run --model MODEL [--tier N] [--problem ID] [--mode MODE]
- vera-bench report RESULTS_DIR
- Problem filtering, SKILL.md loading, output directory management
- Metrics summary printed on completion

tests/test_runner.py — 26 new tests
- Code extraction (plain, fenced, multi-fence, no-fence)
- ProblemResult JSONL serialization
- Provider detection (claude/gpt/unknown)
- Metrics computation with hand-crafted fixtures
- Report generation
- Full pipeline with mock LLMClient and VeraRunner

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Mar 29, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@aallan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 23 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 5 minutes and 23 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d77e22f1-4725-467c-a451-86497c6ff8ea

📥 Commits

Reviewing files that changed from the base of the PR and between 8874109 and f281579.

📒 Files selected for processing (5)
  • README.md
  • tests/test_runner.py
  • vera_bench/cli.py
  • vera_bench/models.py
  • vera_bench/runner.py
📝 Walkthrough

Walkthrough

A benchmark harness adding LLM client abstractions, a runner that generates and evaluates Vera programs with optional fix attempts, metric computation and Markdown reporting, CLI wiring for run/report, and comprehensive tests covering parsing, serialization, provider selection, metrics, reporting and retry behaviour.

Changes

Cohort / File(s) Summary
LLM client & models
vera_bench/models.py
Added LLMResponse dataclass and LLMClient Protocol; implemented create_client() factory, AnthropicClient and OpenAIClient with lazy SDK imports, API-key checks, timeout→TimeoutError handling, and safe extraction/defaulting of tokens, model and wall-time.
Runner / evaluation
vera_bench/runner.py
Added extract_vera_code() parser, ProblemResult with to_jsonl() that omits None, _evaluate_code() to write/run/check/verify/execute tests, run_single_problem() with fix-retry logic and run_benchmark() with Rich progress, temp-dir management, optional keep-temps and _now() timestamp helper.
Metrics computation
vera_bench/metrics.py
Added TierMetrics and BenchmarkMetrics dataclasses, load_results() and compute_metrics() plus helpers (_compute_by_tier, _tier_from_id, _rate) to group attempts by problem/tier and compute check/verify/fix/run_correct rates with zero-division protection and empty-input handling.
Reporting (Markdown)
vera_bench/report.py
Replaced multi-format reporting with Markdown-only generate_report(results_dir: Path) -> str; scans *.jsonl, uses load_results and compute_metrics, writes summary.md, builds summary, tier breakdown and per-problem sections, and returns/report messages when no results found.
CLI integration
vera_bench/cli.py
Implemented run flow: discover problems, load SKILL.md, create client, instantiate VeraRunner and call run_benchmark; added --max-tokens and --keep-temps options, _repo_root() helper, improved --tier help, and updated report to call generate_report and print summary.md path.
Tests
tests/test_runner.py
New tests covering extract_vera_code(), ProblemResult.to_jsonl(), provider selection/errors for create_client(), compute_metrics() and load_results() behaviours, generate_report() outputs, CLI command presence, and mocked run_single_problem() retry semantics (including retry suppression when max_fix_attempts=0).
Docs / README
README.md
Rewrote Quick start and installation: explicit Python/Git prerequisites, virtualenv workflow, .[llm] extra, instructions to install Vera compiler and minimum Vera version, updated examples using vera-bench and results/ layout.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

harness,docs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Implement LLM runner harness (Phase 2)' directly and specifically describes the main change—completing the LLM evaluation pipeline with models, runner, metrics, reporting, and CLI integration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/llm-runner

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter

codecov-commenter commented Mar 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.11236% with 133 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.97%. Comparing base (a6749ee) to head (f281579).

Files with missing lines Patch % Lines
vera_bench/cli.py 10.60% 59 Missing ⚠️
vera_bench/runner.py 71.85% 38 Missing ⚠️
vera_bench/models.py 52.38% 30 Missing ⚠️
vera_bench/metrics.py 96.49% 4 Missing ⚠️
vera_bench/report.py 97.01% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main       #3       +/-   ##
===========================================
+ Coverage   40.14%   59.97%   +19.82%     
===========================================
  Files           5        9        +4     
  Lines         269      707      +438     
===========================================
+ Hits          108      424      +316     
- Misses        161      283      +122     
Flag Coverage Δ
python 59.97% <70.11%> (+19.82%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_runner.py`:
- Around line 109-124: Tests in TestCreateClient depend on external environment
state; make them deterministic by using pytest's monkeypatch to clear provider
API env vars before calling create_client. Update test_anthropic_prefix,
test_openai_prefix, and test_o1_prefix to accept a monkeypatch fixture and call
monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False) /
monkeypatch.delenv("OPENAI_API_KEY", raising=False) /
monkeypatch.delenv("O1_API_KEY", raising=False) respectively (or the actual env
var names used by create_client), then run the existing with
pytest.raises((ImportError, EnvironmentError)): create_client(...) assertion so
the test no longer relies on CI secrets; keep the
test_unknown_raises_value_error unchanged.
- Around line 206-217: The test_jsonl_round_trip creates a temporary file with
tempfile.NamedTemporaryFile and calls path.unlink() after assertions, but that
cleanup won't run if an assertion fails; update the test to use pytest's
tmp_path fixture or a try/finally so the temp file is always removed.
Specifically, replace the tempfile.NamedTemporaryFile usage in
test_jsonl_round_trip (and the path variable) with tmp_path.joinpath / tmp_path
/ tmp_path fixture APIs to create/write the .jsonl file, or wrap the current
creation/assertion in try/finally to call path.unlink() in the finally block so
cleanup always occurs.

In `@vera_bench/cli.py`:
- Around line 143-144: The code is re-serialising ProblemResult objects via
to_jsonl() and json.loads(), which is wasteful; update compute_metrics (and its
callers like where compute_metrics is invoked before _print_metrics) to accept a
list of ProblemResult objects directly (or alternatively convert each
ProblemResult to a dict with dataclasses.asdict(result) or a dedicated to_dict()
method) and pass results returned from run_benchmark straight into
compute_metrics (replace json.loads(r.to_jsonl()) with either r or asdict(r));
adjust compute_metrics parameter type and internal handling to read fields from
ProblemResult instead of expecting pre-parsed dicts.
- Around line 62-67: The CLI accepts --max-tokens but it isn't forwarded to the
LLM call; update the call chain to thread max_tokens from the click handler into
run(), then into run_benchmark(), then into run_single_problem(), and finally
pass it to client.complete() (or the client's request payload) so the runtime
uses the user-specified value; update the function signatures for run(),
run_benchmark(), and run_single_problem() to accept a max_tokens:int (with
existing defaults preserved) and propagate that parameter when invoking
client.complete().

In `@vera_bench/metrics.py`:
- Around line 68-108: The logic that tallies check/verify/fix/run counts is
duplicated between compute_metrics and _compute_by_tier; extract it into a new
helper _compute_counts(by_problem: dict[str, list[dict]]) that returns the tuple
(check_pass_count, verify_pass_count, verify_eligible, fix_success,
fix_eligible, run_correct_count, run_eligible, total) using the exact selection
logic (attempt_1, attempt_2, best) shown in the diff, then replace the local
counting blocks in compute_metrics and _compute_by_tier to call _compute_counts
and map the returned values into their BenchmarkMetrics constructions (update
the arguments to _rate calls accordingly) so both functions reuse the single
implementation and remain consistent.

In `@vera_bench/models.py`:
- Around line 138-146: The code assumes choice.message is non-null when
computing text (choice.message.content), which can raise AttributeError; modify
the extraction to defensively check that choice and choice.message exist before
accessing .content (e.g., set text = choice.message.content if choice and
choice.message and choice.message.content else ""), update the logic around
response.choices and the LLMResponse construction (references: response.choices,
choice, choice.message, LLMResponse) so text falls back to an empty string when
message is None while preserving the existing usage and model fields.
- Around line 114-147: The complete method is passing timeout=timeout into
self._client.chat.completions.create which the OpenAI SDK 1.x does not accept;
remove the timeout kwarg from that call and instead either instantiate the
client with a timeout or call
self._client.with_options(timeout=timeout).chat.completions.create(...); update
the call site in complete (and any similar calls) to use
client.with_options(timeout=timeout).chat.completions.create(...) or ensure the
client was created with OpenAI(timeout=...) so you avoid the TypeError at
runtime.

In `@vera_bench/runner.py`:
- Around line 27-41: The regex _FENCE_RE used by extract_vera_code requires a
newline before the closing backticks so blocks like ```vera\ncode``` are missed;
update _FENCE_RE to allow an optional newline before the closing backticks (e.g.
make the pattern use \n? before ```), keep re.DOTALL, then ensure
extract_vera_code continues to pick the longest match and returns the stripped
code plus a terminating newline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d75ec1e4-9724-4227-bafb-f731073ea039

📥 Commits

Reviewing files that changed from the base of the PR and between a6749ee and f847788.

📒 Files selected for processing (6)
  • tests/test_runner.py
  • vera_bench/cli.py
  • vera_bench/metrics.py
  • vera_bench/models.py
  • vera_bench/report.py
  • vera_bench/runner.py

Comment thread tests/test_runner.py
Comment thread tests/test_runner.py Outdated
Comment thread vera_bench/cli.py
Comment thread vera_bench/cli.py
Comment thread vera_bench/metrics.py
Comment thread vera_bench/models.py
Comment thread vera_bench/models.py
Comment thread vera_bench/runner.py Outdated
Add prerequisites, step-by-step clone/venv/install, separate Vera
compiler installation, and expanded CLI usage examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 53-74: Add a new "Results summary table" section to README.md that
explains that running `vera-bench report results/` produces `results/summary.md`
and show the expected per-model columns (Model, check_rate, verify_rate,
fix_rate, run_correct_rate, wall_time_s) with a small example row; place this
section near the usage/CLI examples so it satisfies the README requirement to
document the results summary and reference `summary.md` as the source of the
table.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d0dd6b9b-5e98-4088-9105-3123a83bed0b

📥 Commits

Reviewing files that changed from the base of the PR and between f847788 and 3c4647c.

📒 Files selected for processing (1)
  • README.md

Comment thread README.md
The harness finds vera via shutil.which(), so it can be installed from
any location. Show both local clone and direct-from-GitHub options.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
README.md (1)

57-78: ⚠️ Potential issue | 🟠 Major

Add an explicit results summary table section (still missing).

The README now covers installation and CLI usage well, but it still does not document a concrete results summary table (columns/example) or clearly tie it to results/summary.md output from vera-bench report. This is a required README element.

Suggested patch
 ## Quick start
@@
 # Generate a report from results
 vera-bench report results/

+## Results summary table
+
+Running:
+
+bash +vera-bench report results/ +
+
+writes results/summary.md, including a per-model summary table. Typical columns:
+
+| Model | check_rate | verify_rate | fix_rate | run_correct_rate | wall_time_s |
+|------|------------:|------------:|---------:|-----------------:|------------:|
+| claude-sonnet-4-20250514 | ... | ... | ... | ... | ... |

</details>

As per coding guidelines, `README.md` must document installation, CLI usage, problem structure, metric definitions, results summary table, and citation information.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @README.md around lines 57 - 78, The README is missing a "Results summary
table" section; add a short subsection explaining that running the CLI command
vera-bench report results/ writes results/summary.md and include a concrete
example table (per-model summary) with the typical columns used by the reporter
(e.g., Model, check_rate, verify_rate, fix_rate, run_correct_rate, wall_time_s)
and an example row (e.g., claude-sonnet-4-20250514 | ... | ... | ... | ... |
...), and mention the file name results/summary.md so readers can correlate the
CLI output to the documented table.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @README.md:

  • Around line 57-78: The README is missing a "Results summary table" section;
    add a short subsection explaining that running the CLI command vera-bench report
    results/ writes results/summary.md and include a concrete example table
    (per-model summary) with the typical columns used by the reporter (e.g., Model,
    check_rate, verify_rate, fix_rate, run_correct_rate, wall_time_s) and an example
    row (e.g., claude-sonnet-4-20250514 | ... | ... | ... | ... | ...), and mention
    the file name results/summary.md so readers can correlate the CLI output to the
    documented table.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: ASSERTIVE

**Plan**: Pro

**Run ID**: `97734e42-572f-482d-913a-243cb3b85b00`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 3c4647cc9ac7293745ab6c48581a8d7538795dac and 8874109e4508291b398ee2f7b625e8b3206585a7.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `README.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Bugs fixed:
- Thread --max-tokens through CLI -> run_benchmark -> run_single_problem
  -> client.complete() (was accepted but silently ignored)
- OpenAI: use client.with_options(timeout=) instead of passing timeout
  kwarg to create() (not supported in SDK 1.x)
- OpenAI: defensive null check on choice.message before accessing .content
- Fence regex: allow optional trailing newline before closing backticks

Tests hardened:
- monkeypatch env vars in create_client tests for determinism
- Use tmp_path fixture for JSONL round-trip (cleanup on assertion failure)

README:
- Add Results section documenting summary.md output format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan aallan merged commit 36265db into main Mar 29, 2026
8 checks passed
@aallan aallan deleted the feature/llm-runner branch March 30, 2026 15:51
aallan added a commit to sunholo-voight-kampff/vera-bench that referenced this pull request May 22, 2026
Two of CR's three outside-diff findings on the latest review:

1. `_ailang_literal(value) -> str` was missing the parameter type
   hint on `value`. One-character fix matching the project's "type
   hints everywhere" rule from CLAUDE.md. The sibling `_aver_literal`
   has the same gap and predates this PR — that's a "do next time we
   touch the Aver path" mental note rather than scope-creep here.

2. Per-test subprocess failures in `_evaluate_aver_code` and
   `_evaluate_ailang_code` silently `continue` without capturing
   stderr — unlike the Python/TypeScript evaluators which record
   stderr into `ProblemResult.error_message`. Filed as aallan#72 with a
   shared-helper refactor proposal that fixes Aver and AILANG
   consistently. Roadmap'd under Milestone 1; not blocking this PR.

The third outside-diff finding (`AILANG_RESULTS.md:74` version pin
inconsistency) becomes moot once the file is removed per ask aallan#3 in
the consolidated review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants