Update README with v0.0.7 multi-model benchmark results by aallan · Pull Request #43 · aallan/vera-bench

aallan · 2026-04-08T15:58:32Z

Summary

Add benchmark chart (assets/benchmark_v0.0.7.png) generated by new scripts/plot_results.py
Replace single-model v0.0.4 results in README with 6-model v0.0.7 results across 3 providers
Key finding: Kimi K2.5 achieves 100% run_correct on Vera, beating Python (86%) and TypeScript (91%)
References the earlier v0.0.4 Sonnet result and discusses run-to-run variance

Test plan

Chart renders correctly with site colour palette
README displays chart and tables properly

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Replaced the prior "Initial Results" with a new v0.0.7 "Results" section covering 50 problems and six models, added a benchmark image, introduced tiered per‑model run tables, revised key findings and determinism note, raised minimum Vera version to v0.0.108+, and moved results into a new "Report generation" section describing report output and JSONL resumability.
New Features
- Added a CLI script to generate and save the composite v0.0.7 benchmark visualisation used in the README.

Add benchmark chart (plot_results.py) and update README with results from 6 models across 3 providers: Claude Opus 4/Sonnet 4, GPT-4.1/4o, Kimi K2.5/K2 Turbo. Key finding: Kimi K2.5 achieves 100% run_correct on Vera, beating both Python (86%) and TypeScript (91%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-08T15:58:46Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces the README "Initial Results" narrative with a new "Report generation" / Results section for VeraBench v0.0.7 vs Vera v0.0.108 (50 problems, 6 models, multiple modes) and adds a new plotting CLI script scripts/plot_results.py that generates assets/benchmark_v0.0.7.png.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Replaced the prior v0.0.4 "Initial Results" content with a Results section for VeraBench v0.0.7 vs Vera v0.0.108 (50 problems, 6 models, multiple modes). Added results image, per-model `run_correct` tables grouped into Flagship and Sonnet tiers, rewrote "Key findings", updated minimum Vera version to v0.0.108+, and moved results under a new "Report generation" section describing `vera-bench report` output and JSONL resumability.
Plotting Script & Asset `scripts/plot_results.py`, `assets/benchmark_v0.0.7.png`	Added a new CLI plotting script with `main()` and helper functions (`_style_ax`, `plot_tier`, `plot_vera_vs_both`, `plot_all_modes`). Embeds benchmark data for Flagship and Sonnet tiers, composes a multi-panel figure plus footer, saves `assets/benchmark_v0.0.7.png`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Add Initial Results section to README #23 — Earlier README "Initial Results" content that this change replaces and expands.

Suggested labels

docs, ci

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: updating the README with new v0.0.7 multi-model benchmark results, which aligns with the primary modifications to README.md and the addition of the plotting script.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/v007-results

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)
152-154: 🧹 Nitpick | 🔵 Trivial

Consider renaming this section to avoid duplicate heading.

There are now two ## Results headings in the README (line 10 and line 152). This could cause issues with table-of-contents generation and anchor links. Consider renaming this section to better distinguish it from the benchmark results, e.g., "Report Generation" or "Output Files".
📝 Suggested rename
-## Results
+## Report Generation
 
 Running `vera-bench report results/` generates `results/summary.md`...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 152 - 154, The README contains a duplicate "##
Results" heading; rename the lower "## Results" heading (the paragraph starting
"Running `vera-bench report results/`...") to a more specific title such as "##
Report Generation" or "## Output Files" and update any internal links/TOC
anchors that point to "Results" (e.g., markdown links like [Results](`#results`)
or autogenerated TOC entries) so they reference the new heading text/anchor;
ensure the section heading text you change matches the new anchor format
(kebab-case) so links work.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@README.md`:
- Around line 152-154: The README contains a duplicate "## Results" heading;
rename the lower "## Results" heading (the paragraph starting "Running
`vera-bench report results/`...") to a more specific title such as "## Report
Generation" or "## Output Files" and update any internal links/TOC anchors that
point to "Results" (e.g., markdown links like [Results](`#results`) or
autogenerated TOC entries) so they reference the new heading text/anchor; ensure
the section heading text you change matches the new anchor format (kebab-case)
so links work.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f7e8e6c4-ef56-4ed9-bcc4-92bb09feeb67

📥 Commits

Reviewing files that changed from the base of the PR and between 63c1f5d and f802413.

⛔ Files ignored due to path filters (1)

assets/benchmark_v0.0.7.png is excluded by !**/*.png

📒 Files selected for processing (2)

README.md
scripts/plot_results.py

codecov · 2026-04-08T16:03:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.94%. Comparing base (63c1f5d) to head (8b8213e).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #43   +/-   ##
=======================================
  Coverage   84.94%   84.94%           
=======================================
  Files          10       10           
  Lines        1116     1116           
=======================================
  Hits          948      948           
  Misses        168      168

Flag	Coverage Δ
python	`84.94% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Explains the key context: Vera has no training data at all, models learn it entirely from SKILL.md in context, yet multiple models write better Vera than TypeScript. Language design matters more than training data volume. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)
112-112: 🧹 Nitpick | 🔵 Trivial

Consider documenting the recommended Vera version for reproducing v0.0.7 results.

Line 112 states "v0.0.104 or later" as the minimum requirement, whilst line 14 indicates the v0.0.7 benchmark was run against Vera v0.0.108. For exact reproducibility, consider clarifying that v0.0.108 is recommended.
📌 Optional clarification
-this should return v0.0.104 or later.
+this should return v0.0.104 or later. For reproducing the v0.0.7 benchmark results, use v0.0.108 or later.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 112, Update the README phrasing so the recommended Vera
version for reproducing the v0.0.7 benchmark is explicit: locate the string
"v0.0.104 or later" and change it to state that v0.0.104 is the minimum but
v0.0.108 is recommended (or simply recommend "v0.0.108 or later") to match the
benchmark note on line 14; ensure the sentence references both the minimum and
the recommended versions so readers know which exact version to use for exact
reproducibility.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 16-42: The tables under the "run_correct by model (Vera vs Python
vs TypeScript)" heading are ambiguous about which evaluation mode(s) they
represent; update the README by adding a concise clarifying note directly
beneath that heading (or immediately before the tables) stating whether the
percentages are from a single mode (e.g., full-spec), averaged across all 4
modes, best-of-modes, or a different aggregation, and if relevant, add a
parenthetical indicating where readers can find per-mode breakdowns (e.g., refer
to a specific section or link). Ensure the note references the evaluation modes
by name (e.g., "full-spec", "spec-from-NL", etc.) so readers can unambiguously
interpret the table values.
- Line 14: Update the README metadata line that currently references "VeraBench
v0.0.7" and "Vera v0.0.108" to also include the SKILL.md version used (commit
SHA or tag), per-model release dates, and the LLM API versions; alternatively
create a small "Reproducibility / Metadata" section enumerating SKILL.md:
<commit-or-tag>, models: e.g., "Claude Opus 4: 2025-02-14", "gpt-4o-mini:
2025-03-01", and APIs: e.g., "OpenAI API v1", "Anthropic API v2025-02", ensuring
each model entry references its release date and API version so results tied to
VeraBench v0.0.7 and Vera v0.0.108 are fully reproducible.

---

Outside diff comments:
In `@README.md`:
- Line 112: Update the README phrasing so the recommended Vera version for
reproducing the v0.0.7 benchmark is explicit: locate the string "v0.0.104 or
later" and change it to state that v0.0.104 is the minimum but v0.0.108 is
recommended (or simply recommend "v0.0.108 or later") to match the benchmark
note on line 14; ensure the sentence references both the minimum and the
recommended versions so readers know which exact version to use for exact
reproducibility.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7976bdbc-c85e-4b5e-81f0-a3d973a61f18

📥 Commits

Reviewing files that changed from the base of the PR and between f802413 and 1a8d8af.

📒 Files selected for processing (1)

README.md

- Rename second '## Results' to '## Report generation' - Clarify run_correct table shows Vera full-spec mode - Update recommended vera version from v0.0.104 to v0.0.108 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan · 2026-04-08T16:11:13Z

Addressing CodeRabbit outside-diff comments:

Duplicate ## Results heading (line 152-154): Fixed in a720292 — renamed to ## Report generation.

Vera version v0.0.104 (line 112): Fixed in a720292 — updated to v0.0.108.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

README.md (1)
16-32: 🧹 Nitpick | 🔵 Trivial

Mode clarification improved, but Python/TypeScript context could be clearer.

The heading now explicitly states "(Vera full-spec vs Python vs TypeScript)", which clarifies that the Vera column shows full-spec mode results. However, it's not explicit whether Python and TypeScript have multiple evaluation modes or represent a single standard run. Consider adding a brief note (e.g., "Python and TypeScript are evaluated in a single standard mode for each problem") to eliminate remaining ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 16 - 32, The table heading "run_correct by model
(Vera full-spec vs Python vs TypeScript)" is clearer but doesn't state whether
Python and TypeScript used multiple evaluation modes; add a short clarifying
sentence or footnote immediately under that heading — e.g., "Python and
TypeScript results reflect a single standard evaluation mode per problem" — to
explicitly state that Python/TypeScript are single-mode runs; update the README
heading area and ensure the sentence mentions "Python" and "TypeScript" by name
so readers know which columns it applies to.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@README.md`:
- Around line 16-32: The table heading "run_correct by model (Vera full-spec vs
Python vs TypeScript)" is clearer but doesn't state whether Python and
TypeScript used multiple evaluation modes; add a short clarifying sentence or
footnote immediately under that heading — e.g., "Python and TypeScript results
reflect a single standard evaluation mode per problem" — to explicitly state
that Python/TypeScript are single-mode runs; update the README heading area and
ensure the sentence mentions "Python" and "TypeScript" by name so readers know
which columns it applies to.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dd694d4f-be01-4144-8d11-224094a5d794

📥 Commits

Reviewing files that changed from the base of the PR and between 1a8d8af and a720292.

📒 Files selected for processing (1)

README.md

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/plot_results.py`:
- Around line 189-265: The save step in main() can fail if the assets directory
doesn't exist; before calling fig.savefig(out, ...), ensure the output directory
exists by creating os.path.dirname(out) with os.makedirs(..., exist_ok=True)
(and add an import os at top if not present); place this directory-creation just
before fig.savefig to guarantee the path for the out variable
("assets/benchmark_v0.0.7.png") exists.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a7fc6e7b-f577-4a48-b528-eb493d8c7c5d

📥 Commits

Reviewing files that changed from the base of the PR and between a720292 and c76dc8f.

⛔ Files ignored due to path filters (1)

assets/benchmark_v0.0.7.png is excluded by !**/*.png

📒 Files selected for processing (1)

scripts/plot_results.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan · 2026-04-08T16:26:23Z

Re: fig.savefig failing if assets/ doesn't exist: Fixed in c89dfde — added Path(out).parent.mkdir(parents=True, exist_ok=True) before savefig.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/plot_results.py`:
- Around line 8-10: The script imports matplotlib.pyplot before setting a
backend, which can fail in headless CI; call matplotlib.use("Agg") immediately
after importing matplotlib (before importing matplotlib.pyplot) so the backend
is pinned; update the top of scripts/plot_results.py to import matplotlib, call
matplotlib.use("Agg"), then import matplotlib.pyplot as plt and numpy as np to
ensure deterministic image generation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 62ded443-6afe-4b19-b06e-c497494865d2

📥 Commits

Reviewing files that changed from the base of the PR and between c76dc8f and c89dfde.

📒 Files selected for processing (1)

scripts/plot_results.py

aallan · 2026-04-08T16:27:11Z

@coderabbitai Re: fig.savefig failing if assets/ doesn't exist — fixed in c89dfde. Added Path(out).parent.mkdir(parents=True, exist_ok=True) before savefig.

coderabbitai · 2026-04-08T16:27:27Z

Tip

For best results, initiate chat on the files or code changes.

@aallan Thanks for confirming — Path(out).parent.mkdir(parents=True, exist_ok=True) is exactly the right call here. Using parents=True, exist_ok=True ensures the full path is created idempotently whether assets/ is missing or the script is run from a fresh clone.

🧠 Learnings used

Learnt from: aallan
Repo: aallan/vera-bench PR: 43
File: README.md:14-14
Timestamp: 2026-04-08T16:11:17.628Z
Learning: In aallan/vera-bench, model identifiers (e.g., `claude-sonnet-4-20250514`, `gpt-4.1-2025-04-14`) are considered sufficient for reproducibility. Do not flag missing per-model release dates or LLM API versions in README.md — the identifiers already encode this information precisely.

Learnt from: aallan
Repo: aallan/vera-bench PR: 43
File: README.md:14-14
Timestamp: 2026-04-08T16:11:17.628Z
Learning: In aallan/vera-bench, the SKILL.md version used during evaluation is recorded as `skill_md_hash` in every JSONL result file (results/**/*.jsonl). There is no need to add a SKILL.md version reference to README.md — the JSONL files are the canonical reproducibility record.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan · 2026-04-08T16:41:46Z

@coderabbitai Re: matplotlib backend for headless CI — fixed in 8b8213e. Added matplotlib.use("Agg") immediately after importing matplotlib, before importing pyplot.

coderabbitai · 2026-04-08T16:42:06Z

Tip

For best results, initiate chat on the files or code changes.

@aallan: Perfect fix — matplotlib.use() must be called early in script execution, and crucially, before any import of pylab or matplotlib.pyplot. The Agg backend is a non-GUI backend used primarily for generating image files (like PNGs) in environments without a display server — exactly what scripts/plot_results.py needs. The placement in commit 8b8213e is exactly right.

🧠 Learnings used

Learnt from: aallan
Repo: aallan/vera-bench PR: 43
File: README.md:14-14
Timestamp: 2026-04-08T16:11:17.628Z
Learning: In aallan/vera-bench, model identifiers (e.g., `claude-sonnet-4-20250514`, `gpt-4.1-2025-04-14`) are considered sufficient for reproducibility. Do not flag missing per-model release dates or LLM API versions in README.md — the identifiers already encode this information precisely.

Learnt from: aallan
Repo: aallan/vera-bench PR: 43
File: README.md:14-14
Timestamp: 2026-04-08T16:11:17.628Z
Learning: In aallan/vera-bench, the SKILL.md version used during evaluation is recorded as `skill_md_hash` in every JSONL result file (results/**/*.jsonl). There is no need to add a SKILL.md version reference to README.md — the JSONL files are the canonical reproducibility record.

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.529Z
Learning: Pin SKILL.md versions in context/ directory for reproducibility; document vera version compatibility in results metadata; track model release dates and LLM API versions

Learnt from: aallan
Repo: aallan/vera-bench PR: 36
File: tests/test_models.py:84-89
Timestamp: 2026-03-31T14:05:18.280Z
Learning: In `tests/test_models.py` in the `vera-bench` repository, the Anthropic and OpenAI client tests deliberately patch `vera_bench.models.anthropic` / `vera_bench.models.openai` (module-level references) rather than the constructors directly. This is an intentional choice: the `[llm]` extras (`pip install -e .[llm]`) are always installed in CI, so the SDK modules are guaranteed present in `sys.modules`. Do not flag this patch-target approach as incorrect during code review.

aallan and others added 2 commits April 8, 2026 16:57

Move benchmark chart to assets/ to avoid accidental deletion

f802413

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan mentioned this pull request Apr 8, 2026

Update VeraBench results to v0.0.7 (6 models, 3 providers) aallan/vera#448

Merged

2 tasks

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread README.md

Comment thread README.md Outdated

Fix ruff lint: import order and line length in plot_results.py

c76dc8f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread scripts/plot_results.py

Ensure assets/ dir exists before saving chart, fix ruff format

c89dfde

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread scripts/plot_results.py Outdated

Use Agg backend for headless CI compatibility

8b8213e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan merged commit d4006f3 into main Apr 8, 2026
10 checks passed

aallan deleted the feature/v007-results branch April 8, 2026 16:42

coderabbitai Bot mentioned this pull request Apr 9, 2026

Update ROADMAP to reflect v0.0.7 milestones and provider registry refactor #46

Merged

2 tasks

coderabbitai Bot mentioned this pull request Apr 17, 2026

docs: document all scripts in scripts/README.md; make plot_results data-driven #59

Merged

5 tasks

Conversation

aallan commented Apr 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aallan commented Apr 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aallan commented Apr 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aallan commented Apr 8, 2026

Uh oh!

coderabbitai Bot commented Apr 8, 2026

Uh oh!

aallan commented Apr 8, 2026

Uh oh!

coderabbitai Bot commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aallan commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading