aver: migrate baselines + harness to Aver 0.16 Console=String#65
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe Aver test-wrapper in the Vera benchmark harness now prints the entry-point invocation as a string-interpolated representation (including braces) instead of printing the raw expression, updated in both existing-module and new-module wrapper code paths. ChangesTest Result Output Formatting
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #65 +/- ##
=======================================
Coverage 83.62% 83.62%
=======================================
Files 10 10
Lines 1392 1392
=======================================
Hits 1164 1164
Misses 228 228
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Aver 0.16 ("Anneal") tightens Console.print to require String —
passing Int/Bool/List/Option is now a typecheck error. The standard
idiom is interpolation: Console.print("{x}").
- vera_bench/runner.py: the two `Console.print({entry}({args}))`
injection sites in the per-test wrapper now emit
Console.print("{<call>}"). Two-line change.
- solutions/aver/*.av: 56 baseline solutions migrated to interpolation
via a paren-balanced + string-aware parser so nested expressions
and string arguments survive intact.
- 9 baseline solutions had main() printing only a subset of
test_cases — regenerated main() to print every test case from
the problem JSON. Pre-existing coverage gap, surfaces only after
migration brings them past `aver check`.
- 3 new baselines: VB_T2_011_starts_with_prefix,
VB_T2_012_ends_with_suffix, VB_T2_013_get_char_code (using
String.startsWith / String.endsWith / String.charAt + Char.toCode).
Backward compatible: interpolation predates Console=String by many
versions, so Console.print("{x}") works on Aver 0.10–0.15 and on
0.16+. Nothing regresses on older Aver.
Verified locally on Aver 0.16-wasm-gc-probe:
- baselines: check@1 100%, run_correct 100% (60/60)
- Haiku 4.5 run: check@1 82%, verify@1 93%, run_correct 94% (in
line with the pre-migration ~96% historical Haiku 4.5 result on
Aver 0.12).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8d180fb to
3afbb2a
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
1 similar comment
✅ Actions performedReview triggered.
|
The Aver test-wrapper harness and 56 canonical baselines now emit
string interpolation (`Console.print("{x}")`) instead of bare
expressions — required for Aver 0.16's typed `Console.print`,
backward-compatible to Aver 0.10. Plus three restored T2 baselines
and 9 coverage-gap fixes (see aallan#65).
This is a methodology change for Aver scoring with two distinct
flavours of compatibility impact:
- On Aver 0.16+: required. Without this, every injected `aver run`
crashes at typecheck and `run_correct = 0%`.
- On Aver 0.10–0.15: scoring may differ slightly between v0.0.10 and
v0.0.11 result files, because the 9 coverage-gap fixes mean
`run_correct` is now measured against the full test_cases set
rather than the partial set the baseline `main()` happened to
print, and the 3 restored T2 baselines now contribute to the Aver
baseline `run_correct` denominator.
Aver baseline rises to 100% check@1 / 100% run_correct against
Aver 0.15.2 (previously 95%/73% on the same compiler). Vera, Vera
spec-from-NL, Python, and TypeScript scoring is unaffected.
Files touched:
- pyproject.toml: 0.0.10 -> 0.0.11
- CITATION.cff: version + date-released bumped together
- CHANGELOG.md: new [0.0.11] section with Compatibility note
documenting both flavours of impact; link references updated
- ROADMAP.md: prepended a v0.0.11 line above the v0.0.10 summary
Verified: `pip install -e .` followed by
`python -c "from importlib.metadata import version; print(version('vera-bench'))"`
reports 0.0.11, full test suite green at 494 cases, Aver baselines
100%/100% locally on Aver 0.15.2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@jasisz Pushed 2f74785 bumping the bench version Compatibility note in the CHANGELOG calls out both flavours explicitly:
Files touched:
Verified Local verification of the 0.16+ forward-compat claim wasn't possible — This should be the last push from us — once CI goes green I'll merge and tag v0.0.11. |
aallan
left a comment
There was a problem hiding this comment.
Approving — ready to merge
Substantive change is correct, well-scoped, and well-tested. Combined with the version-bump scaffolding pushed in 2f74785, this is merge-ready.
What's verified
- Backward-compat to current Aver ✅ Locally regenerated Aver baselines on Aver 0.15.2 (the latest crates.io release): 100% check@1, 100% run_correct, 60/60 across all 5 tiers. Previous baseline on the same compiler was 95%/73%, so the lift is real (3 restored T2 baselines + 9 coverage-gap fixes), not a measurement quirk.
- No regression in non-Aver paths ✅
pytest tests/green at 494 cases.validatejob covers every problem JSON + canonical Vera solution and is green in CI. - CI green wall ✅ lint, security, dependency-audit, test (3.11/3.12/3.13), validate, codecov patch + project, CodeRabbit — all pass.
- CodeRabbit happy ✅ no findings.
What's taken on faith
- Forward-compat to Aver 0.16+ —
cargo search aver-langonly goes up to 0.15.2; the0.16-wasm-gc-probebuild you tested against isn't on crates.io. The Console=String breaking change in 0.16 is independently documented, and the interpolation form is the canonical idiom that survives it, so the underlying reason is sound. We just can't run it locally to confirm.
Notable bonus
The 9 coverage-gap fixes are a hidden win. Some Aver baselines were main()-printing only a subset of their test_cases from the problem JSON, so run_correct was being measured against a partial set. Easy to miss because everything appeared to pass, but the fix is doing real correctness work — that's the kind of thing that's valuable to surface in the CHANGELOG, which the v0.0.11 Compatibility note now does.
After merge
- Move the v0.0.11 tag to the merge commit, push, create the GitHub release.
- The Aver baseline numbers in the new CHANGELOG (100%/100% on 0.15.2) are now the v0.0.11 reference; if you want to re-run the model sweeps on 0.15.2 to refresh the headline chart's Aver column, that's a cheap follow-up — but not blocking.
- The
assets/results-graph.pngis still pinned to v0.0.7 content — same as before this PR. Unrelated to merge readiness.
Thanks again for two consecutive Aver-forward-compat PRs landing cleanly. The "ship before the upstream breaking change ships" cadence has been working well.
|
@jasisz Closing the loop on the forward-compat claim from this PR — Identical to the numbers on Aver 0.15.2 with this PR applied — the interpolation migration and the three restored T2 baselines all hold up cleanly on the released 0.16.0 compiler. The The "taken on faith" caveat from the v0.0.11 review is now closed. Thanks again for landing it preemptively — having the bench be 0.16-ready on day one of the release rather than scrambling after the fact has been the pattern with v0.0.10 / v0.0.11 and it's working really well. |
Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70. (Item 1 — extracting --parallel N into its own PR — addressed via PR aallan#73.) ### Item 2: README headline section -> single sentence in §Overview Removed the "AILANG: AI-designed language..." headline section (13 lines: the heading, the description paragraphs, the per-mode results table, the "full-circle finding" paragraph). The phrasing included editorial claims about VeraBench's identity that should be a project-owner call, and "added in this fork" wouldn't read correctly post-merge. Replaced the §Overview line about baselines with the form @aallan suggested verbatim: The same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/) as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis. Matches the existing Aver pattern: light-touch mention without results writeups in the README. ### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md Neither file is load-bearing — no code or tests reference them. Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or AVER_MAPPING.md. Numbers and writeups go in PR descriptions and external content; in-repo docs are reserved for things future maintainers need. ### Item 4: .coderabbit.yaml path_filters Added the two missing AILANG entries to mirror the existing {python, typescript, aver} pattern: - "!**/*.ail" (alongside !**/*.vera, !**/*.av) - "!solutions/ailang/**" (alongside the other solutions/* entries) This stops CodeRabbit from generating speculative findings on .ail solution files in future review passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why
Aver 0.16 ("Anneal") tightens
Console.printto requireString— passingInt,Bool,List<T>,Option<T>etc. is now a typecheck error. The previous form(
Console.print(<any>)) coerced silently, which is gone. The standard idiom isstring interpolation:
Console.print(\"{x}\").Without this change every baseline solution and every test wrapper the harness
injects fails to typecheck on Aver ≥ 0.16. On a fresh local run with
aver 0.16-wasm-gc-probe:*before the harness fix every injected
aver runcrashed at typecheck so LLMruns reported run_correct=0% across the board.
What
vera_bench/runner.py— the twoConsole.print({entry}({args}))injection sites in the per-test wrapper now emit
Console.print(\"{<call>}\").Two-line change, exact same shape as the migrated baselines.
solutions/aver/*.av— 56 baseline solutions migrated to interpolation(
Console.print(EXPR)→Console.print(\"{EXPR}\")). Mechanical, appliedwith a paren-balanced + string-aware parser so nested expressions and string
arguments survive intact.
solutions/aver/VB_T*_*.av— 9 solutions wheremain()printed only asubset of
test_casesgot theirmain()regenerated to print every testcase from the problem JSON. Pre-existing coverage gap, surfaces only after
the interpolation migration brings them past `aver check`.
3 new baselines —
VB_T2_011_starts_with_prefix.av,VB_T2_012_ends_with_suffix.av,VB_T2_013_get_char_code.avwere missingfrom
solutions/aver/; added usingString.startsWith/String.endsWith/String.charAt + Char.toCode.Backward compatibility
Interpolation predates the Console=String breaking change by many versions, so
Console.print(\"{x}\")works on Aver 0.10–0.15 and on 0.16+. Nothing regresseson older Aver — same baselines pass on the previous releases the bench has been
tracking.
Verification
```
$ vera-bench baselines --language aver
check@1: 100% run_correct: 100% (60/60 across all 5 tiers)
$ vera-bench run --model claude-haiku-4-5 --language aver
check@1 82%, verify@1 93%, run_correct 94%
```
Two run_correct misses with Haiku 4.5 are model-side, not harness/compiler:
signature and returns `Result<Int, String>`.
Result is in line with the historical Haiku 4.5 ~96% on Aver 0.12.
🤖 Generated with Claude Code
Summary by CodeRabbit