Add Initial Results section to README by aallan · Pull Request #23 · aallan/vera-bench

aallan · 2026-03-30T18:11:52Z

Front-and-centre the six-way comparison table from the first Claude Sonnet 4 benchmark run.

Highlights:

Vera with contracts (83%) outperforms TypeScript without them (79%)
Python remains strongest at 92%, but the gap to Vera is only 9 points
Contract design (spec-from-NL) is harder than code generation
Clear caveat about single-run non-determinism

Summary by CodeRabbit

Documentation
- Added an "Initial Results" section with first‑run benchmark numbers and comparison tables (Vera full‑spec vs spec‑from‑NL; Python/TypeScript LLM modes and baselines).
- Added narrative bullets summarising relative performance and a note on non‑determinism recommending multi‑run stability (pass@k).
- Clarified Results output: summary, per‑tier and per‑problem detail; generated result files are ignored in version control.

Front-and-centre the six-way comparison table from the first Sonnet 4 benchmark run. Highlights Vera outperforming TypeScript despite zero training data presence, and the contract design challenge in spec-from-NL mode. Includes caveat about single-run non-determinism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-30T18:12:10Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3f6662d7-c5be-4a4b-8b9f-beb6dc5a59c4

📥 Commits

Reviewing files that changed from the base of the PR and between 0dba889 and 87a3215.

📒 Files selected for processing (1)

README.md

📝 Walkthrough

Walkthrough

Added an "Initial Results" section to README.md reporting single-run VeraBench v0.0.4 vs Vera v0.0.104 benchmark metrics, clarified non-determinism and need for multi-run stability, replaced a concrete results/summary.md example with a general vera-bench report description, and gitignored generated results/ files.

Changes

Cohort / File(s)	Summary
Documentation `README.md`, `.gitignore`	Added "Initial Results" section with single-run benchmark table and narrative on non-determinism/pass@k; removed explicit `results/summary.md` example and replaced with a general description of `vera-bench report` outputs; added `results/` to `.gitignore`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

docs

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add Initial Results section to README' directly and clearly describes the main change in the pull request, which introduces a new 'Initial Results' section to the README file.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/initial-results

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-30T18:13:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.29%. Comparing base (2396d48) to head (87a3215).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #23   +/-   ##
=======================================
  Coverage   66.29%   66.29%           
=======================================
  Files          10       10           
  Lines        1068     1068           
=======================================
  Hits          708      708           
  Misses        360      360

Flag	Coverage Δ
python	`66.29% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 14-21: The Initial Results table in README.md is missing
reproducibility metadata; update the table (or add an adjacent metadata block)
to record the vera compiler version, the SKILL.md version(s) used (and pin
SKILL.md files in the context/ directory), and the exact model identifier
including release date/API (e.g., claude-sonnet-4-20250514) rather than a
generic name; also add a note indicating vera compatibility constraints for this
run so future readers can reproduce results.
- Around line 14-21: The Results table is ambiguous about whether it shows real
benchmarks or an example; update README.md to clearly label the example table
and the real-results section: add a short clarifying sentence above the example
table (e.g., "Example output format — not actual results") and/or a note near
the top of the Results section stating that actual benchmark data lives in the
results/ directory (which is currently empty), ensuring readers can distinguish
the illustrative table from real data.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26adb698-5c7c-472f-abec-267ffa9ddf1d

📥 Commits

Reviewing files that changed from the base of the PR and between 2396d48 and ddacec9.

📒 Files selected for processing (1)

README.md

- Link VeraBench v0.0.4, Vera v0.0.104, and Claude Sonnet 4 model page in the Initial Results introduction - Remove dummy results table from the Results section (had fake numbers from before real benchmarks existed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

README.md (1)
12-12: ⚠️ Potential issue | 🟡 Minor

Document the SKILL.md version or date used for the benchmark results on line 12.

Line 12 documents VeraBench v0.0.4, Vera v0.0.104, and the model identifier, but omits which SKILL.md version was used. Since SKILL.md is fetched at runtime from veralang.dev (line 126), the benchmark results cannot be reproduced without knowing which language reference was current at the time. This affects longitudinal tracking across vera compiler versions.

Either pin a SKILL.md snapshot in a context/ directory (as outlined in BRIEFING.md and DESIGN.md) and reference it in the results header, or at minimum document the date or commit hash alongside the veralang.dev URL.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 12, Add the SKILL.md version/date used for the benchmark
to the README results header: either pin a snapshot of SKILL.md in the project
(e.g., commit a copy under context/ per BRIEFING.md and DESIGN.md) and reference
that snapshot filename or include the SKILL.md commit hash or retrieval date
next to the veralang.dev URL so the exact language reference used by the
benchmarks (SKILL.md) is reproducible.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 12: The Anthropic Claude documentation link in the README (the anchor URL
used for "Claude Sonnet 4") returns a permanent redirect; update that link to
the current direct models documentation path (remove the anchor-style URL and
use the canonical models page URL) while keeping the label "Claude Sonnet 4" and
the model identifier `claude-sonnet-4-20250514` unchanged so the README points
to the resolved (HTTP 200) Anthropic docs.

---

Duplicate comments:
In `@README.md`:
- Line 12: Add the SKILL.md version/date used for the benchmark to the README
results header: either pin a snapshot of SKILL.md in the project (e.g., commit a
copy under context/ per BRIEFING.md and DESIGN.md) and reference that snapshot
filename or include the SKILL.md commit hash or retrieval date next to the
veralang.dev URL so the exact language reference used by the benchmarks
(SKILL.md) is reproducible.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 12a98f7b-89a0-4a46-a282-e69ce064b5ad

📥 Commits

Reviewing files that changed from the base of the PR and between ddacec9 and 67f6d6c.

📒 Files selected for processing (1)

README.md

Updated link to Claude Sonnet 4 documentation.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 29: Add a brief definition of "pass@k" to the README note so unfamiliar
readers understand the metric: state that pass@k measures sampling k independent
model outputs for each problem and counts a problem as passed if at least one of
the k outputs meets the success criteria (or link to a canonical explanation of
pass@k), and insert it inline after the phrase "pass@k" in the existing note for
clarity.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e0895255-9a86-4b62-b83e-5bd94a39e5e3

📥 Commits

Reviewing files that changed from the base of the PR and between 67f6d6c and 0dba889.

📒 Files selected for processing (1)

README.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread README.md

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread README.md Outdated

Fix link to Claude Sonnet 4 documentation

0dba889

Updated link to Claude Sonnet 4 documentation.

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread README.md Outdated

Add pass@k definition with link to HumanEval paper

87a3215

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan merged commit 81e2f12 into main Mar 30, 2026
10 checks passed

aallan deleted the docs/initial-results branch March 30, 2026 18:54

coderabbitai Bot mentioned this pull request Apr 8, 2026

Update README with v0.0.7 multi-model benchmark results #43

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Initial Results section to README#23

Add Initial Results section to README#23
aallan merged 4 commits into
mainfrom
docs/initial-results

aallan commented Mar 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

codecov Bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aallan commented Mar 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

codecov Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aallan commented Mar 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading

codecov Bot commented Mar 30, 2026 •

edited

Loading