Skip to content

Add Initial Results section to README#23

Merged
aallan merged 4 commits into
mainfrom
docs/initial-results
Mar 30, 2026
Merged

Add Initial Results section to README#23
aallan merged 4 commits into
mainfrom
docs/initial-results

Conversation

@aallan

@aallan aallan commented Mar 30, 2026

Copy link
Copy Markdown
Owner

Front-and-centre the six-way comparison table from the first Claude Sonnet 4 benchmark run.

Highlights:

  • Vera with contracts (83%) outperforms TypeScript without them (79%)
  • Python remains strongest at 92%, but the gap to Vera is only 9 points
  • Contract design (spec-from-NL) is harder than code generation
  • Clear caveat about single-run non-determinism

Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added an "Initial Results" section with first‑run benchmark numbers and comparison tables (Vera full‑spec vs spec‑from‑NL; Python/TypeScript LLM modes and baselines).
    • Added narrative bullets summarising relative performance and a note on non‑determinism recommending multi‑run stability (pass@k).
    • Clarified Results output: summary, per‑tier and per‑problem detail; generated result files are ignored in version control.

Front-and-centre the six-way comparison table from the first Sonnet 4
benchmark run. Highlights Vera outperforming TypeScript despite zero
training data presence, and the contract design challenge in spec-from-NL
mode. Includes caveat about single-run non-determinism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3f6662d7-c5be-4a4b-8b9f-beb6dc5a59c4

📥 Commits

Reviewing files that changed from the base of the PR and between 0dba889 and 87a3215.

📒 Files selected for processing (1)
  • README.md

📝 Walkthrough

Walkthrough

Added an "Initial Results" section to README.md reporting single-run VeraBench v0.0.4 vs Vera v0.0.104 benchmark metrics, clarified non-determinism and need for multi-run stability, replaced a concrete results/summary.md example with a general vera-bench report description, and gitignored generated results/ files.

Changes

Cohort / File(s) Summary
Documentation
README.md, .gitignore
Added "Initial Results" section with single-run benchmark table and narrative on non-determinism/pass@k; removed explicit results/summary.md example and replaced with a general description of vera-bench report outputs; added results/ to .gitignore.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

docs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add Initial Results section to README' directly and clearly describes the main change in the pull request, which introduces a new 'Initial Results' section to the README file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/initial-results

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Mar 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.29%. Comparing base (2396d48) to head (87a3215).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #23   +/-   ##
=======================================
  Coverage   66.29%   66.29%           
=======================================
  Files          10       10           
  Lines        1068     1068           
=======================================
  Hits          708      708           
  Misses        360      360           
Flag Coverage Δ
python 66.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 14-21: The Initial Results table in README.md is missing
reproducibility metadata; update the table (or add an adjacent metadata block)
to record the vera compiler version, the SKILL.md version(s) used (and pin
SKILL.md files in the context/ directory), and the exact model identifier
including release date/API (e.g., claude-sonnet-4-20250514) rather than a
generic name; also add a note indicating vera compatibility constraints for this
run so future readers can reproduce results.
- Around line 14-21: The Results table is ambiguous about whether it shows real
benchmarks or an example; update README.md to clearly label the example table
and the real-results section: add a short clarifying sentence above the example
table (e.g., "Example output format — not actual results") and/or a note near
the top of the Results section stating that actual benchmark data lives in the
results/ directory (which is currently empty), ensuring readers can distinguish
the illustrative table from real data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26adb698-5c7c-472f-abec-267ffa9ddf1d

📥 Commits

Reviewing files that changed from the base of the PR and between 2396d48 and ddacec9.

📒 Files selected for processing (1)
  • README.md

Comment thread README.md
- Link VeraBench v0.0.4, Vera v0.0.104, and Claude Sonnet 4 model page
  in the Initial Results introduction
- Remove dummy results table from the Results section (had fake numbers
  from before real benchmarks existed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
README.md (1)

12-12: ⚠️ Potential issue | 🟡 Minor

Document the SKILL.md version or date used for the benchmark results on line 12.

Line 12 documents VeraBench v0.0.4, Vera v0.0.104, and the model identifier, but omits which SKILL.md version was used. Since SKILL.md is fetched at runtime from veralang.dev (line 126), the benchmark results cannot be reproduced without knowing which language reference was current at the time. This affects longitudinal tracking across vera compiler versions.

Either pin a SKILL.md snapshot in a context/ directory (as outlined in BRIEFING.md and DESIGN.md) and reference it in the results header, or at minimum document the date or commit hash alongside the veralang.dev URL.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 12, Add the SKILL.md version/date used for the benchmark
to the README results header: either pin a snapshot of SKILL.md in the project
(e.g., commit a copy under context/ per BRIEFING.md and DESIGN.md) and reference
that snapshot filename or include the SKILL.md commit hash or retrieval date
next to the veralang.dev URL so the exact language reference used by the
benchmarks (SKILL.md) is reproducible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 12: The Anthropic Claude documentation link in the README (the anchor URL
used for "Claude Sonnet 4") returns a permanent redirect; update that link to
the current direct models documentation path (remove the anchor-style URL and
use the canonical models page URL) while keeping the label "Claude Sonnet 4" and
the model identifier `claude-sonnet-4-20250514` unchanged so the README points
to the resolved (HTTP 200) Anthropic docs.

---

Duplicate comments:
In `@README.md`:
- Line 12: Add the SKILL.md version/date used for the benchmark to the README
results header: either pin a snapshot of SKILL.md in the project (e.g., commit a
copy under context/ per BRIEFING.md and DESIGN.md) and reference that snapshot
filename or include the SKILL.md commit hash or retrieval date next to the
veralang.dev URL so the exact language reference used by the benchmarks
(SKILL.md) is reproducible.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 12a98f7b-89a0-4a46-a282-e69ce064b5ad

📥 Commits

Reviewing files that changed from the base of the PR and between ddacec9 and 67f6d6c.

📒 Files selected for processing (1)
  • README.md

Comment thread README.md Outdated
Updated link to Claude Sonnet 4 documentation.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 29: Add a brief definition of "pass@k" to the README note so unfamiliar
readers understand the metric: state that pass@k measures sampling k independent
model outputs for each problem and counts a problem as passed if at least one of
the k outputs meets the success criteria (or link to a canonical explanation of
pass@k), and insert it inline after the phrase "pass@k" in the existing note for
clarity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e0895255-9a86-4b62-b83e-5bd94a39e5e3

📥 Commits

Reviewing files that changed from the base of the PR and between 67f6d6c and 0dba889.

📒 Files selected for processing (1)
  • README.md

Comment thread README.md Outdated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan aallan merged commit 81e2f12 into main Mar 30, 2026
10 checks passed
@aallan aallan deleted the docs/initial-results branch March 30, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant