Update README with v0.0.7 multi-model benchmark results#43
Conversation
Add benchmark chart (plot_results.py) and update README with results from 6 models across 3 providers: Claude Opus 4/Sonnet 4, GPT-4.1/4o, Kimi K2.5/K2 Turbo. Key finding: Kimi K2.5 achieves 100% run_correct on Vera, beating both Python (86%) and TypeScript (91%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces the README "Initial Results" narrative with a new "Report generation" / Results section for VeraBench v0.0.7 vs Vera v0.0.108 (50 problems, 6 models, multiple modes) and adds a new plotting CLI script Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
README.md (1)
152-154: 🧹 Nitpick | 🔵 TrivialConsider renaming this section to avoid duplicate heading.
There are now two
## Resultsheadings in the README (line 10 and line 152). This could cause issues with table-of-contents generation and anchor links. Consider renaming this section to better distinguish it from the benchmark results, e.g., "Report Generation" or "Output Files".📝 Suggested rename
-## Results +## Report Generation Running `vera-bench report results/` generates `results/summary.md`...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 152 - 154, The README contains a duplicate "## Results" heading; rename the lower "## Results" heading (the paragraph starting "Running `vera-bench report results/`...") to a more specific title such as "## Report Generation" or "## Output Files" and update any internal links/TOC anchors that point to "Results" (e.g., markdown links like [Results](`#results`) or autogenerated TOC entries) so they reference the new heading text/anchor; ensure the section heading text you change matches the new anchor format (kebab-case) so links work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@README.md`:
- Around line 152-154: The README contains a duplicate "## Results" heading;
rename the lower "## Results" heading (the paragraph starting "Running
`vera-bench report results/`...") to a more specific title such as "## Report
Generation" or "## Output Files" and update any internal links/TOC anchors that
point to "Results" (e.g., markdown links like [Results](`#results`) or
autogenerated TOC entries) so they reference the new heading text/anchor; ensure
the section heading text you change matches the new anchor format (kebab-case)
so links work.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f7e8e6c4-ef56-4ed9-bcc4-92bb09feeb67
⛔ Files ignored due to path filters (1)
assets/benchmark_v0.0.7.pngis excluded by!**/*.png
📒 Files selected for processing (2)
README.mdscripts/plot_results.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #43 +/- ##
=======================================
Coverage 84.94% 84.94%
=======================================
Files 10 10
Lines 1116 1116
=======================================
Hits 948 948
Misses 168 168
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Explains the key context: Vera has no training data at all, models learn it entirely from SKILL.md in context, yet multiple models write better Vera than TypeScript. Language design matters more than training data volume. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
README.md (1)
112-112: 🧹 Nitpick | 🔵 TrivialConsider documenting the recommended Vera version for reproducing v0.0.7 results.
Line 112 states "v0.0.104 or later" as the minimum requirement, whilst line 14 indicates the v0.0.7 benchmark was run against Vera v0.0.108. For exact reproducibility, consider clarifying that v0.0.108 is recommended.
📌 Optional clarification
-this should return v0.0.104 or later. +this should return v0.0.104 or later. For reproducing the v0.0.7 benchmark results, use v0.0.108 or later.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` at line 112, Update the README phrasing so the recommended Vera version for reproducing the v0.0.7 benchmark is explicit: locate the string "v0.0.104 or later" and change it to state that v0.0.104 is the minimum but v0.0.108 is recommended (or simply recommend "v0.0.108 or later") to match the benchmark note on line 14; ensure the sentence references both the minimum and the recommended versions so readers know which exact version to use for exact reproducibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Around line 16-42: The tables under the "run_correct by model (Vera vs Python
vs TypeScript)" heading are ambiguous about which evaluation mode(s) they
represent; update the README by adding a concise clarifying note directly
beneath that heading (or immediately before the tables) stating whether the
percentages are from a single mode (e.g., full-spec), averaged across all 4
modes, best-of-modes, or a different aggregation, and if relevant, add a
parenthetical indicating where readers can find per-mode breakdowns (e.g., refer
to a specific section or link). Ensure the note references the evaluation modes
by name (e.g., "full-spec", "spec-from-NL", etc.) so readers can unambiguously
interpret the table values.
- Line 14: Update the README metadata line that currently references "VeraBench
v0.0.7" and "Vera v0.0.108" to also include the SKILL.md version used (commit
SHA or tag), per-model release dates, and the LLM API versions; alternatively
create a small "Reproducibility / Metadata" section enumerating SKILL.md:
<commit-or-tag>, models: e.g., "Claude Opus 4: 2025-02-14", "gpt-4o-mini:
2025-03-01", and APIs: e.g., "OpenAI API v1", "Anthropic API v2025-02", ensuring
each model entry references its release date and API version so results tied to
VeraBench v0.0.7 and Vera v0.0.108 are fully reproducible.
---
Outside diff comments:
In `@README.md`:
- Line 112: Update the README phrasing so the recommended Vera version for
reproducing the v0.0.7 benchmark is explicit: locate the string "v0.0.104 or
later" and change it to state that v0.0.104 is the minimum but v0.0.108 is
recommended (or simply recommend "v0.0.108 or later") to match the benchmark
note on line 14; ensure the sentence references both the minimum and the
recommended versions so readers know which exact version to use for exact
reproducibility.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7976bdbc-c85e-4b5e-81f0-a3d973a61f18
📒 Files selected for processing (1)
README.md
- Rename second '## Results' to '## Report generation' - Clarify run_correct table shows Vera full-spec mode - Update recommended vera version from v0.0.104 to v0.0.108 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
README.md (1)
16-32: 🧹 Nitpick | 🔵 TrivialMode clarification improved, but Python/TypeScript context could be clearer.
The heading now explicitly states "(Vera full-spec vs Python vs TypeScript)", which clarifies that the Vera column shows full-spec mode results. However, it's not explicit whether Python and TypeScript have multiple evaluation modes or represent a single standard run. Consider adding a brief note (e.g., "Python and TypeScript are evaluated in a single standard mode for each problem") to eliminate remaining ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 16 - 32, The table heading "run_correct by model (Vera full-spec vs Python vs TypeScript)" is clearer but doesn't state whether Python and TypeScript used multiple evaluation modes; add a short clarifying sentence or footnote immediately under that heading — e.g., "Python and TypeScript results reflect a single standard evaluation mode per problem" — to explicitly state that Python/TypeScript are single-mode runs; update the README heading area and ensure the sentence mentions "Python" and "TypeScript" by name so readers know which columns it applies to.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@README.md`:
- Around line 16-32: The table heading "run_correct by model (Vera full-spec vs
Python vs TypeScript)" is clearer but doesn't state whether Python and
TypeScript used multiple evaluation modes; add a short clarifying sentence or
footnote immediately under that heading — e.g., "Python and TypeScript results
reflect a single standard evaluation mode per problem" — to explicitly state
that Python/TypeScript are single-mode runs; update the README heading area and
ensure the sentence mentions "Python" and "TypeScript" by name so readers know
which columns it applies to.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: dd694d4f-be01-4144-8d11-224094a5d794
📒 Files selected for processing (1)
README.md
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/plot_results.py`:
- Around line 189-265: The save step in main() can fail if the assets directory
doesn't exist; before calling fig.savefig(out, ...), ensure the output directory
exists by creating os.path.dirname(out) with os.makedirs(..., exist_ok=True)
(and add an import os at top if not present); place this directory-creation just
before fig.savefig to guarantee the path for the out variable
("assets/benchmark_v0.0.7.png") exists.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a7fc6e7b-f577-4a48-b528-eb493d8c7c5d
⛔ Files ignored due to path filters (1)
assets/benchmark_v0.0.7.pngis excluded by!**/*.png
📒 Files selected for processing (1)
scripts/plot_results.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Re: |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/plot_results.py`:
- Around line 8-10: The script imports matplotlib.pyplot before setting a
backend, which can fail in headless CI; call matplotlib.use("Agg") immediately
after importing matplotlib (before importing matplotlib.pyplot) so the backend
is pinned; update the top of scripts/plot_results.py to import matplotlib, call
matplotlib.use("Agg"), then import matplotlib.pyplot as plt and numpy as np to
ensure deterministic image generation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 62ded443-6afe-4b19-b06e-c497494865d2
📒 Files selected for processing (1)
scripts/plot_results.py
|
@coderabbitai Re: |
|
Tip For best results, initiate chat on the files or code changes.
🧠 Learnings used |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai Re: matplotlib backend for headless CI — fixed in 8b8213e. Added |
|
Tip For best results, initiate chat on the files or code changes.
🧠 Learnings used |
Summary
assets/benchmark_v0.0.7.png) generated by newscripts/plot_results.pyTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit