Add Initial Results section to README#23
Conversation
Front-and-centre the six-way comparison table from the first Sonnet 4 benchmark run. Highlights Vera outperforming TypeScript despite zero training data presence, and the contract design challenge in spec-from-NL mode. Includes caveat about single-run non-determinism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdded an "Initial Results" section to README.md reporting single-run VeraBench v0.0.4 vs Vera v0.0.104 benchmark metrics, clarified non-determinism and need for multi-run stability, replaced a concrete Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #23 +/- ##
=======================================
Coverage 66.29% 66.29%
=======================================
Files 10 10
Lines 1068 1068
=======================================
Hits 708 708
Misses 360 360
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Around line 14-21: The Initial Results table in README.md is missing
reproducibility metadata; update the table (or add an adjacent metadata block)
to record the vera compiler version, the SKILL.md version(s) used (and pin
SKILL.md files in the context/ directory), and the exact model identifier
including release date/API (e.g., claude-sonnet-4-20250514) rather than a
generic name; also add a note indicating vera compatibility constraints for this
run so future readers can reproduce results.
- Around line 14-21: The Results table is ambiguous about whether it shows real
benchmarks or an example; update README.md to clearly label the example table
and the real-results section: add a short clarifying sentence above the example
table (e.g., "Example output format — not actual results") and/or a note near
the top of the Results section stating that actual benchmark data lives in the
results/ directory (which is currently empty), ensuring readers can distinguish
the illustrative table from real data.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 26adb698-5c7c-472f-abec-267ffa9ddf1d
📒 Files selected for processing (1)
README.md
- Link VeraBench v0.0.4, Vera v0.0.104, and Claude Sonnet 4 model page in the Initial Results introduction - Remove dummy results table from the Results section (had fake numbers from before real benchmarks existed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
README.md (1)
12-12:⚠️ Potential issue | 🟡 MinorDocument the SKILL.md version or date used for the benchmark results on line 12.
Line 12 documents VeraBench v0.0.4, Vera v0.0.104, and the model identifier, but omits which SKILL.md version was used. Since SKILL.md is fetched at runtime from veralang.dev (line 126), the benchmark results cannot be reproduced without knowing which language reference was current at the time. This affects longitudinal tracking across vera compiler versions.
Either pin a SKILL.md snapshot in a
context/directory (as outlined in BRIEFING.md and DESIGN.md) and reference it in the results header, or at minimum document the date or commit hash alongside the veralang.dev URL.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` at line 12, Add the SKILL.md version/date used for the benchmark to the README results header: either pin a snapshot of SKILL.md in the project (e.g., commit a copy under context/ per BRIEFING.md and DESIGN.md) and reference that snapshot filename or include the SKILL.md commit hash or retrieval date next to the veralang.dev URL so the exact language reference used by the benchmarks (SKILL.md) is reproducible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Line 12: The Anthropic Claude documentation link in the README (the anchor URL
used for "Claude Sonnet 4") returns a permanent redirect; update that link to
the current direct models documentation path (remove the anchor-style URL and
use the canonical models page URL) while keeping the label "Claude Sonnet 4" and
the model identifier `claude-sonnet-4-20250514` unchanged so the README points
to the resolved (HTTP 200) Anthropic docs.
---
Duplicate comments:
In `@README.md`:
- Line 12: Add the SKILL.md version/date used for the benchmark to the README
results header: either pin a snapshot of SKILL.md in the project (e.g., commit a
copy under context/ per BRIEFING.md and DESIGN.md) and reference that snapshot
filename or include the SKILL.md commit hash or retrieval date next to the
veralang.dev URL so the exact language reference used by the benchmarks
(SKILL.md) is reproducible.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 12a98f7b-89a0-4a46-a282-e69ce064b5ad
📒 Files selected for processing (1)
README.md
Updated link to Claude Sonnet 4 documentation.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Line 29: Add a brief definition of "pass@k" to the README note so unfamiliar
readers understand the metric: state that pass@k measures sampling k independent
model outputs for each problem and counts a problem as passed if at least one of
the k outputs meets the success criteria (or link to a canonical explanation of
pass@k), and insert it inline after the phrase "pass@k" in the existing note for
clarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e0895255-9a86-4b62-b83e-5bd94a39e5e3
📒 Files selected for processing (1)
README.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Front-and-centre the six-way comparison table from the first Claude Sonnet 4 benchmark run.
Highlights:
Generated with Claude Code
Summary by CodeRabbit