Implement LLM runner harness (Phase 2)#3
Conversation
Add the complete benchmark evaluation pipeline: models.py — LLM API abstraction - AnthropicClient and OpenAIClient with lazy imports - Unified LLMResponse dataclass (text, tokens, wall_time) - Provider detection from model ID prefix (claude-*, gpt-*, o1-*, o3-*) - API keys from environment, SDK built-in retry for rate limits runner.py — Pipeline orchestration - extract_vera_code(): regex-based code extraction from markdown fences - run_single_problem(): generate -> check -> verify -> run -> fix pipeline - run_benchmark(): iterate problems with rich progress, JSONL output - ProblemResult dataclass matching BRIEFING.md JSONL format - Retry-with-error-feedback (one fix attempt on check failure) - Temp file management with optional --keep-temps metrics.py — Result aggregation - load_results(): parse JSONL files - compute_metrics(): check_rate, verify_rate, fix_rate, run_correct_rate - Per-tier breakdowns via problem ID parsing - Handles multi-attempt results (best-attempt for verify/run, fix_rate) report.py — Markdown report generation - Summary table (model x metrics) - Tier breakdown matrix - Per-problem detail listing - Writes summary.md to results directory cli.py — Wired up run and report commands - vera-bench run --model MODEL [--tier N] [--problem ID] [--mode MODE] - vera-bench report RESULTS_DIR - Problem filtering, SKILL.md loading, output directory management - Metrics summary printed on completion tests/test_runner.py — 26 new tests - Code extraction (plain, fenced, multi-fence, no-fence) - ProblemResult JSONL serialization - Provider detection (claude/gpt/unknown) - Metrics computation with hand-crafted fixtures - Report generation - Full pipeline with mock LLMClient and VeraRunner Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 5 minutes and 23 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughA benchmark harness adding LLM client abstractions, a runner that generates and evaluates Vera programs with optional fix attempts, metric computation and Markdown reporting, CLI wiring for run/report, and comprehensive tests covering parsing, serialization, provider selection, metrics, reporting and retry behaviour. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3 +/- ##
===========================================
+ Coverage 40.14% 59.97% +19.82%
===========================================
Files 5 9 +4
Lines 269 707 +438
===========================================
+ Hits 108 424 +316
- Misses 161 283 +122
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_runner.py`:
- Around line 109-124: Tests in TestCreateClient depend on external environment
state; make them deterministic by using pytest's monkeypatch to clear provider
API env vars before calling create_client. Update test_anthropic_prefix,
test_openai_prefix, and test_o1_prefix to accept a monkeypatch fixture and call
monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False) /
monkeypatch.delenv("OPENAI_API_KEY", raising=False) /
monkeypatch.delenv("O1_API_KEY", raising=False) respectively (or the actual env
var names used by create_client), then run the existing with
pytest.raises((ImportError, EnvironmentError)): create_client(...) assertion so
the test no longer relies on CI secrets; keep the
test_unknown_raises_value_error unchanged.
- Around line 206-217: The test_jsonl_round_trip creates a temporary file with
tempfile.NamedTemporaryFile and calls path.unlink() after assertions, but that
cleanup won't run if an assertion fails; update the test to use pytest's
tmp_path fixture or a try/finally so the temp file is always removed.
Specifically, replace the tempfile.NamedTemporaryFile usage in
test_jsonl_round_trip (and the path variable) with tmp_path.joinpath / tmp_path
/ tmp_path fixture APIs to create/write the .jsonl file, or wrap the current
creation/assertion in try/finally to call path.unlink() in the finally block so
cleanup always occurs.
In `@vera_bench/cli.py`:
- Around line 143-144: The code is re-serialising ProblemResult objects via
to_jsonl() and json.loads(), which is wasteful; update compute_metrics (and its
callers like where compute_metrics is invoked before _print_metrics) to accept a
list of ProblemResult objects directly (or alternatively convert each
ProblemResult to a dict with dataclasses.asdict(result) or a dedicated to_dict()
method) and pass results returned from run_benchmark straight into
compute_metrics (replace json.loads(r.to_jsonl()) with either r or asdict(r));
adjust compute_metrics parameter type and internal handling to read fields from
ProblemResult instead of expecting pre-parsed dicts.
- Around line 62-67: The CLI accepts --max-tokens but it isn't forwarded to the
LLM call; update the call chain to thread max_tokens from the click handler into
run(), then into run_benchmark(), then into run_single_problem(), and finally
pass it to client.complete() (or the client's request payload) so the runtime
uses the user-specified value; update the function signatures for run(),
run_benchmark(), and run_single_problem() to accept a max_tokens:int (with
existing defaults preserved) and propagate that parameter when invoking
client.complete().
In `@vera_bench/metrics.py`:
- Around line 68-108: The logic that tallies check/verify/fix/run counts is
duplicated between compute_metrics and _compute_by_tier; extract it into a new
helper _compute_counts(by_problem: dict[str, list[dict]]) that returns the tuple
(check_pass_count, verify_pass_count, verify_eligible, fix_success,
fix_eligible, run_correct_count, run_eligible, total) using the exact selection
logic (attempt_1, attempt_2, best) shown in the diff, then replace the local
counting blocks in compute_metrics and _compute_by_tier to call _compute_counts
and map the returned values into their BenchmarkMetrics constructions (update
the arguments to _rate calls accordingly) so both functions reuse the single
implementation and remain consistent.
In `@vera_bench/models.py`:
- Around line 138-146: The code assumes choice.message is non-null when
computing text (choice.message.content), which can raise AttributeError; modify
the extraction to defensively check that choice and choice.message exist before
accessing .content (e.g., set text = choice.message.content if choice and
choice.message and choice.message.content else ""), update the logic around
response.choices and the LLMResponse construction (references: response.choices,
choice, choice.message, LLMResponse) so text falls back to an empty string when
message is None while preserving the existing usage and model fields.
- Around line 114-147: The complete method is passing timeout=timeout into
self._client.chat.completions.create which the OpenAI SDK 1.x does not accept;
remove the timeout kwarg from that call and instead either instantiate the
client with a timeout or call
self._client.with_options(timeout=timeout).chat.completions.create(...); update
the call site in complete (and any similar calls) to use
client.with_options(timeout=timeout).chat.completions.create(...) or ensure the
client was created with OpenAI(timeout=...) so you avoid the TypeError at
runtime.
In `@vera_bench/runner.py`:
- Around line 27-41: The regex _FENCE_RE used by extract_vera_code requires a
newline before the closing backticks so blocks like ```vera\ncode``` are missed;
update _FENCE_RE to allow an optional newline before the closing backticks (e.g.
make the pattern use \n? before ```), keep re.DOTALL, then ensure
extract_vera_code continues to pick the longest match and returns the stripped
code plus a terminating newline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d75ec1e4-9724-4227-bafb-f731073ea039
📒 Files selected for processing (6)
tests/test_runner.pyvera_bench/cli.pyvera_bench/metrics.pyvera_bench/models.pyvera_bench/report.pyvera_bench/runner.py
Add prerequisites, step-by-step clone/venv/install, separate Vera compiler installation, and expanded CLI usage examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Around line 53-74: Add a new "Results summary table" section to README.md that
explains that running `vera-bench report results/` produces `results/summary.md`
and show the expected per-model columns (Model, check_rate, verify_rate,
fix_rate, run_correct_rate, wall_time_s) with a small example row; place this
section near the usage/CLI examples so it satisfies the README requirement to
document the results summary and reference `summary.md` as the source of the
table.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d0dd6b9b-5e98-4088-9105-3123a83bed0b
📒 Files selected for processing (1)
README.md
The harness finds vera via shutil.which(), so it can be installed from any location. Show both local clone and direct-from-GitHub options. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
README.md (1)
57-78:⚠️ Potential issue | 🟠 MajorAdd an explicit results summary table section (still missing).
The README now covers installation and CLI usage well, but it still does not document a concrete results summary table (columns/example) or clearly tie it to
results/summary.mdoutput fromvera-bench report. This is a required README element.Suggested patch
## Quick start @@ # Generate a report from results vera-bench report results/
+## Results summary table
+
+Running:
+
+bash +vera-bench report results/ +
+
+writesresults/summary.md, including a per-model summary table. Typical columns:
+
+| Model | check_rate | verify_rate | fix_rate | run_correct_rate | wall_time_s |
+|------|------------:|------------:|---------:|-----------------:|------------:|
+| claude-sonnet-4-20250514 | ... | ... | ... | ... | ... |</details> As per coding guidelines, `README.md` must document installation, CLI usage, problem structure, metric definitions, results summary table, and citation information. <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@README.mdaround lines 57 - 78, The README is missing a "Results summary
table" section; add a short subsection explaining that running the CLI command
vera-bench report results/ writes results/summary.md and include a concrete
example table (per-model summary) with the typical columns used by the reporter
(e.g., Model, check_rate, verify_rate, fix_rate, run_correct_rate, wall_time_s)
and an example row (e.g., claude-sonnet-4-20250514 | ... | ... | ... | ... |
...), and mention the file name results/summary.md so readers can correlate the
CLI output to the documented table.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In@README.md:
- Around line 57-78: The README is missing a "Results summary table" section;
add a short subsection explaining that running the CLI command vera-bench report
results/ writes results/summary.md and include a concrete example table
(per-model summary) with the typical columns used by the reporter (e.g., Model,
check_rate, verify_rate, fix_rate, run_correct_rate, wall_time_s) and an example
row (e.g., claude-sonnet-4-20250514 | ... | ... | ... | ... | ...), and mention
the file name results/summary.md so readers can correlate the CLI output to the
documented table.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: ASSERTIVE **Plan**: Pro **Run ID**: `97734e42-572f-482d-913a-243cb3b85b00` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 3c4647cc9ac7293745ab6c48581a8d7538795dac and 8874109e4508291b398ee2f7b625e8b3206585a7. </details> <details> <summary>📒 Files selected for processing (1)</summary> * `README.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Bugs fixed: - Thread --max-tokens through CLI -> run_benchmark -> run_single_problem -> client.complete() (was accepted but silently ignored) - OpenAI: use client.with_options(timeout=) instead of passing timeout kwarg to create() (not supported in SDK 1.x) - OpenAI: defensive null check on choice.message before accessing .content - Fence regex: allow optional trailing newline before closing backticks Tests hardened: - monkeypatch env vars in create_client tests for determinism - Use tmp_path fixture for JSONL round-trip (cleanup on assertion failure) README: - Add Results section documenting summary.md output format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two of CR's three outside-diff findings on the latest review: 1. `_ailang_literal(value) -> str` was missing the parameter type hint on `value`. One-character fix matching the project's "type hints everywhere" rule from CLAUDE.md. The sibling `_aver_literal` has the same gap and predates this PR — that's a "do next time we touch the Aver path" mental note rather than scope-creep here. 2. Per-test subprocess failures in `_evaluate_aver_code` and `_evaluate_ailang_code` silently `continue` without capturing stderr — unlike the Python/TypeScript evaluators which record stderr into `ProblemResult.error_message`. Filed as aallan#72 with a shared-helper refactor proposal that fixes Aver and AILANG consistently. Roadmap'd under Milestone 1; not blocking this PR. The third outside-diff finding (`AILANG_RESULTS.md:74` version pin inconsistency) becomes moot once the file is removed per ask aallan#3 in the consolidated review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Implements the complete benchmark evaluation pipeline —
vera-bench run --model claude-sonnet-4-20250514now works end-to-end.runandreportcommands fully wired upNew CLI usage
Key design decisions
Test plan
ruff check . && ruff format --check .cleanruff check --select S vera_bench/security lint cleanvera-bench run --model claude-sonnet-4-20250514 --problem VB-T1-001(requires API key)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Documentation