Skip to content

Update README with v0.0.7 multi-model benchmark results#43

Merged
aallan merged 7 commits into
mainfrom
feature/v007-results
Apr 8, 2026
Merged

Update README with v0.0.7 multi-model benchmark results#43
aallan merged 7 commits into
mainfrom
feature/v007-results

Conversation

@aallan

@aallan aallan commented Apr 8, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add benchmark chart (assets/benchmark_v0.0.7.png) generated by new scripts/plot_results.py
  • Replace single-model v0.0.4 results in README with 6-model v0.0.7 results across 3 providers
  • Key finding: Kimi K2.5 achieves 100% run_correct on Vera, beating Python (86%) and TypeScript (91%)
  • References the earlier v0.0.4 Sonnet result and discusses run-to-run variance

Test plan

  • Chart renders correctly with site colour palette
  • README displays chart and tables properly

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Replaced the prior "Initial Results" with a new v0.0.7 "Results" section covering 50 problems and six models, added a benchmark image, introduced tiered per‑model run tables, revised key findings and determinism note, raised minimum Vera version to v0.0.108+, and moved results into a new "Report generation" section describing report output and JSONL resumability.
  • New Features
    • Added a CLI script to generate and save the composite v0.0.7 benchmark visualisation used in the README.

aallan and others added 2 commits April 8, 2026 16:57
Add benchmark chart (plot_results.py) and update README with results
from 6 models across 3 providers: Claude Opus 4/Sonnet 4, GPT-4.1/4o,
Kimi K2.5/K2 Turbo. Key finding: Kimi K2.5 achieves 100% run_correct
on Vera, beating both Python (86%) and TypeScript (91%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 8, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces the README "Initial Results" narrative with a new "Report generation" / Results section for VeraBench v0.0.7 vs Vera v0.0.108 (50 problems, 6 models, multiple modes) and adds a new plotting CLI script scripts/plot_results.py that generates assets/benchmark_v0.0.7.png.

Changes

Cohort / File(s) Summary
Documentation
README.md
Replaced the prior v0.0.4 "Initial Results" content with a Results section for VeraBench v0.0.7 vs Vera v0.0.108 (50 problems, 6 models, multiple modes). Added results image, per-model run_correct tables grouped into Flagship and Sonnet tiers, rewrote "Key findings", updated minimum Vera version to v0.0.108+, and moved results under a new "Report generation" section describing vera-bench report output and JSONL resumability.
Plotting Script & Asset
scripts/plot_results.py, assets/benchmark_v0.0.7.png
Added a new CLI plotting script with main() and helper functions (_style_ax, plot_tier, plot_vera_vs_both, plot_all_modes). Embeds benchmark data for Flagship and Sonnet tiers, composes a multi-panel figure plus footer, saves assets/benchmark_v0.0.7.png.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

docs, ci

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: updating the README with new v0.0.7 multi-model benchmark results, which aligns with the primary modifications to README.md and the addition of the plotting script.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/v007-results

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

152-154: 🧹 Nitpick | 🔵 Trivial

Consider renaming this section to avoid duplicate heading.

There are now two ## Results headings in the README (line 10 and line 152). This could cause issues with table-of-contents generation and anchor links. Consider renaming this section to better distinguish it from the benchmark results, e.g., "Report Generation" or "Output Files".

📝 Suggested rename
-## Results
+## Report Generation
 
 Running `vera-bench report results/` generates `results/summary.md`...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 152 - 154, The README contains a duplicate "##
Results" heading; rename the lower "## Results" heading (the paragraph starting
"Running `vera-bench report results/`...") to a more specific title such as "##
Report Generation" or "## Output Files" and update any internal links/TOC
anchors that point to "Results" (e.g., markdown links like [Results](`#results`)
or autogenerated TOC entries) so they reference the new heading text/anchor;
ensure the section heading text you change matches the new anchor format
(kebab-case) so links work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@README.md`:
- Around line 152-154: The README contains a duplicate "## Results" heading;
rename the lower "## Results" heading (the paragraph starting "Running
`vera-bench report results/`...") to a more specific title such as "## Report
Generation" or "## Output Files" and update any internal links/TOC anchors that
point to "Results" (e.g., markdown links like [Results](`#results`) or
autogenerated TOC entries) so they reference the new heading text/anchor; ensure
the section heading text you change matches the new anchor format (kebab-case)
so links work.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f7e8e6c4-ef56-4ed9-bcc4-92bb09feeb67

📥 Commits

Reviewing files that changed from the base of the PR and between 63c1f5d and f802413.

⛔ Files ignored due to path filters (1)
  • assets/benchmark_v0.0.7.png is excluded by !**/*.png
📒 Files selected for processing (2)
  • README.md
  • scripts/plot_results.py

@codecov

codecov Bot commented Apr 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.94%. Comparing base (63c1f5d) to head (8b8213e).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #43   +/-   ##
=======================================
  Coverage   84.94%   84.94%           
=======================================
  Files          10       10           
  Lines        1116     1116           
=======================================
  Hits          948      948           
  Misses        168      168           
Flag Coverage Δ
python 84.94% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Explains the key context: Vera has no training data at all, models
learn it entirely from SKILL.md in context, yet multiple models
write better Vera than TypeScript. Language design matters more
than training data volume.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

112-112: 🧹 Nitpick | 🔵 Trivial

Consider documenting the recommended Vera version for reproducing v0.0.7 results.

Line 112 states "v0.0.104 or later" as the minimum requirement, whilst line 14 indicates the v0.0.7 benchmark was run against Vera v0.0.108. For exact reproducibility, consider clarifying that v0.0.108 is recommended.

📌 Optional clarification
-this should return v0.0.104 or later.
+this should return v0.0.104 or later. For reproducing the v0.0.7 benchmark results, use v0.0.108 or later.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 112, Update the README phrasing so the recommended Vera
version for reproducing the v0.0.7 benchmark is explicit: locate the string
"v0.0.104 or later" and change it to state that v0.0.104 is the minimum but
v0.0.108 is recommended (or simply recommend "v0.0.108 or later") to match the
benchmark note on line 14; ensure the sentence references both the minimum and
the recommended versions so readers know which exact version to use for exact
reproducibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 16-42: The tables under the "run_correct by model (Vera vs Python
vs TypeScript)" heading are ambiguous about which evaluation mode(s) they
represent; update the README by adding a concise clarifying note directly
beneath that heading (or immediately before the tables) stating whether the
percentages are from a single mode (e.g., full-spec), averaged across all 4
modes, best-of-modes, or a different aggregation, and if relevant, add a
parenthetical indicating where readers can find per-mode breakdowns (e.g., refer
to a specific section or link). Ensure the note references the evaluation modes
by name (e.g., "full-spec", "spec-from-NL", etc.) so readers can unambiguously
interpret the table values.
- Line 14: Update the README metadata line that currently references "VeraBench
v0.0.7" and "Vera v0.0.108" to also include the SKILL.md version used (commit
SHA or tag), per-model release dates, and the LLM API versions; alternatively
create a small "Reproducibility / Metadata" section enumerating SKILL.md:
<commit-or-tag>, models: e.g., "Claude Opus 4: 2025-02-14", "gpt-4o-mini:
2025-03-01", and APIs: e.g., "OpenAI API v1", "Anthropic API v2025-02", ensuring
each model entry references its release date and API version so results tied to
VeraBench v0.0.7 and Vera v0.0.108 are fully reproducible.

---

Outside diff comments:
In `@README.md`:
- Line 112: Update the README phrasing so the recommended Vera version for
reproducing the v0.0.7 benchmark is explicit: locate the string "v0.0.104 or
later" and change it to state that v0.0.104 is the minimum but v0.0.108 is
recommended (or simply recommend "v0.0.108 or later") to match the benchmark
note on line 14; ensure the sentence references both the minimum and the
recommended versions so readers know which exact version to use for exact
reproducibility.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7976bdbc-c85e-4b5e-81f0-a3d973a61f18

📥 Commits

Reviewing files that changed from the base of the PR and between f802413 and 1a8d8af.

📒 Files selected for processing (1)
  • README.md

Comment thread README.md
Comment thread README.md Outdated
- Rename second '## Results' to '## Report generation'
- Clarify run_correct table shows Vera full-spec mode
- Update recommended vera version from v0.0.104 to v0.0.108

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan

aallan commented Apr 8, 2026

Copy link
Copy Markdown
Owner Author

Addressing CodeRabbit outside-diff comments:

Duplicate ## Results heading (line 152-154): Fixed in a720292 — renamed to ## Report generation.

Vera version v0.0.104 (line 112): Fixed in a720292 — updated to v0.0.108.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
README.md (1)

16-32: 🧹 Nitpick | 🔵 Trivial

Mode clarification improved, but Python/TypeScript context could be clearer.

The heading now explicitly states "(Vera full-spec vs Python vs TypeScript)", which clarifies that the Vera column shows full-spec mode results. However, it's not explicit whether Python and TypeScript have multiple evaluation modes or represent a single standard run. Consider adding a brief note (e.g., "Python and TypeScript are evaluated in a single standard mode for each problem") to eliminate remaining ambiguity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 16 - 32, The table heading "run_correct by model
(Vera full-spec vs Python vs TypeScript)" is clearer but doesn't state whether
Python and TypeScript used multiple evaluation modes; add a short clarifying
sentence or footnote immediately under that heading — e.g., "Python and
TypeScript results reflect a single standard evaluation mode per problem" — to
explicitly state that Python/TypeScript are single-mode runs; update the README
heading area and ensure the sentence mentions "Python" and "TypeScript" by name
so readers know which columns it applies to.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@README.md`:
- Around line 16-32: The table heading "run_correct by model (Vera full-spec vs
Python vs TypeScript)" is clearer but doesn't state whether Python and
TypeScript used multiple evaluation modes; add a short clarifying sentence or
footnote immediately under that heading — e.g., "Python and TypeScript results
reflect a single standard evaluation mode per problem" — to explicitly state
that Python/TypeScript are single-mode runs; update the README heading area and
ensure the sentence mentions "Python" and "TypeScript" by name so readers know
which columns it applies to.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dd694d4f-be01-4144-8d11-224094a5d794

📥 Commits

Reviewing files that changed from the base of the PR and between 1a8d8af and a720292.

📒 Files selected for processing (1)
  • README.md

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/plot_results.py`:
- Around line 189-265: The save step in main() can fail if the assets directory
doesn't exist; before calling fig.savefig(out, ...), ensure the output directory
exists by creating os.path.dirname(out) with os.makedirs(..., exist_ok=True)
(and add an import os at top if not present); place this directory-creation just
before fig.savefig to guarantee the path for the out variable
("assets/benchmark_v0.0.7.png") exists.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a7fc6e7b-f577-4a48-b528-eb493d8c7c5d

📥 Commits

Reviewing files that changed from the base of the PR and between a720292 and c76dc8f.

⛔ Files ignored due to path filters (1)
  • assets/benchmark_v0.0.7.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • scripts/plot_results.py

Comment thread scripts/plot_results.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan

aallan commented Apr 8, 2026

Copy link
Copy Markdown
Owner Author

Re: fig.savefig failing if assets/ doesn't exist: Fixed in c89dfde — added Path(out).parent.mkdir(parents=True, exist_ok=True) before savefig.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/plot_results.py`:
- Around line 8-10: The script imports matplotlib.pyplot before setting a
backend, which can fail in headless CI; call matplotlib.use("Agg") immediately
after importing matplotlib (before importing matplotlib.pyplot) so the backend
is pinned; update the top of scripts/plot_results.py to import matplotlib, call
matplotlib.use("Agg"), then import matplotlib.pyplot as plt and numpy as np to
ensure deterministic image generation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 62ded443-6afe-4b19-b06e-c497494865d2

📥 Commits

Reviewing files that changed from the base of the PR and between c76dc8f and c89dfde.

📒 Files selected for processing (1)
  • scripts/plot_results.py

Comment thread scripts/plot_results.py Outdated
@aallan

aallan commented Apr 8, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Re: fig.savefig failing if assets/ doesn't exist — fixed in c89dfde. Added Path(out).parent.mkdir(parents=True, exist_ok=True) before savefig.

@coderabbitai

coderabbitai Bot commented Apr 8, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@aallan Thanks for confirming — Path(out).parent.mkdir(parents=True, exist_ok=True) is exactly the right call here. Using parents=True, exist_ok=True ensures the full path is created idempotently whether assets/ is missing or the script is run from a fresh clone.


🧠 Learnings used
Learnt from: aallan
Repo: aallan/vera-bench PR: 43
File: README.md:14-14
Timestamp: 2026-04-08T16:11:17.628Z
Learning: In aallan/vera-bench, model identifiers (e.g., `claude-sonnet-4-20250514`, `gpt-4.1-2025-04-14`) are considered sufficient for reproducibility. Do not flag missing per-model release dates or LLM API versions in README.md — the identifiers already encode this information precisely.

Learnt from: aallan
Repo: aallan/vera-bench PR: 43
File: README.md:14-14
Timestamp: 2026-04-08T16:11:17.628Z
Learning: In aallan/vera-bench, the SKILL.md version used during evaluation is recorded as `skill_md_hash` in every JSONL result file (results/**/*.jsonl). There is no need to add a SKILL.md version reference to README.md — the JSONL files are the canonical reproducibility record.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan

aallan commented Apr 8, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Re: matplotlib backend for headless CI — fixed in 8b8213e. Added matplotlib.use("Agg") immediately after importing matplotlib, before importing pyplot.

@coderabbitai

coderabbitai Bot commented Apr 8, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@aallan: Perfect fix — matplotlib.use() must be called early in script execution, and crucially, before any import of pylab or matplotlib.pyplot. The Agg backend is a non-GUI backend used primarily for generating image files (like PNGs) in environments without a display server — exactly what scripts/plot_results.py needs. The placement in commit 8b8213e is exactly right.


🧠 Learnings used
Learnt from: aallan
Repo: aallan/vera-bench PR: 43
File: README.md:14-14
Timestamp: 2026-04-08T16:11:17.628Z
Learning: In aallan/vera-bench, model identifiers (e.g., `claude-sonnet-4-20250514`, `gpt-4.1-2025-04-14`) are considered sufficient for reproducibility. Do not flag missing per-model release dates or LLM API versions in README.md — the identifiers already encode this information precisely.

Learnt from: aallan
Repo: aallan/vera-bench PR: 43
File: README.md:14-14
Timestamp: 2026-04-08T16:11:17.628Z
Learning: In aallan/vera-bench, the SKILL.md version used during evaluation is recorded as `skill_md_hash` in every JSONL result file (results/**/*.jsonl). There is no need to add a SKILL.md version reference to README.md — the JSONL files are the canonical reproducibility record.

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.529Z
Learning: Pin SKILL.md versions in context/ directory for reproducibility; document vera version compatibility in results metadata; track model release dates and LLM API versions

Learnt from: aallan
Repo: aallan/vera-bench PR: 36
File: tests/test_models.py:84-89
Timestamp: 2026-03-31T14:05:18.280Z
Learning: In `tests/test_models.py` in the `vera-bench` repository, the Anthropic and OpenAI client tests deliberately patch `vera_bench.models.anthropic` / `vera_bench.models.openai` (module-level references) rather than the constructors directly. This is an intentional choice: the `[llm]` extras (`pip install -e .[llm]`) are always installed in CI, so the SDK modules are guaranteed present in `sys.modules`. Do not flag this patch-target approach as incorrect during code review.

@aallan aallan merged commit d4006f3 into main Apr 8, 2026
10 checks passed
@aallan aallan deleted the feature/v007-results branch April 8, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant