aver: migrate baselines + harness to Aver 0.16 Console=String by jasisz · Pull Request #65 · aallan/vera-bench

jasisz · 2026-05-04T18:39:48Z

Why

Aver 0.16 ("Anneal") tightens Console.print to require String — passing Int,
Bool, List<T>, Option<T> etc. is now a typecheck error. The previous form
(Console.print(<any>)) coerced silently, which is gone. The standard idiom is
string interpolation: Console.print(\"{x}\").

Without this change every baseline solution and every test wrapper the harness
injects fails to typecheck on Aver ≥ 0.16. On a fresh local run with
aver 0.16-wasm-gc-probe:

metric	before	after
baselines check@1	7%	100%
baselines run_correct	—	100%
Haiku 4.5 run (60 problems)	crashed*	run_correct 94%

*before the harness fix every injected aver run crashed at typecheck so LLM
runs reported run_correct=0% across the board.

What

vera_bench/runner.py — the two Console.print({entry}({args}))
injection sites in the per-test wrapper now emit Console.print(\"{<call>}\").
Two-line change, exact same shape as the migrated baselines.
solutions/aver/*.av — 56 baseline solutions migrated to interpolation
(Console.print(EXPR) → Console.print(\"{EXPR}\")). Mechanical, applied
with a paren-balanced + string-aware parser so nested expressions and string
arguments survive intact.
solutions/aver/VB_T*_*.av — 9 solutions where main() printed only a
subset of test_cases got their main() regenerated to print every test
case from the problem JSON. Pre-existing coverage gap, surfaces only after
the interpolation migration brings them past `aver check`.
3 new baselines — VB_T2_011_starts_with_prefix.av,
VB_T2_012_ends_with_suffix.av, VB_T2_013_get_char_code.av were missing
from solutions/aver/; added using String.startsWith / String.endsWith /
String.charAt + Char.toCode.

Backward compatibility

Interpolation predates the Console=String breaking change by many versions, so
Console.print(\"{x}\") works on Aver 0.10–0.15 and on 0.16+. Nothing regresses
on older Aver — same baselines pass on the previous releases the bench has been
tracking.

Verification

```
$ vera-bench baselines --language aver
check@1: 100% run_correct: 100% (60/60 across all 5 tiers)

$ vera-bench run --model claude-haiku-4-5 --language aver
check@1 82%, verify@1 93%, run_correct 94%
```

Two run_correct misses with Haiku 4.5 are model-side, not harness/compiler:

VB-T1-007 (`safe_modulo`) — Haiku consistently ignores `@Int -> @Int`
signature and returns `Result<Int, String>`.
VB-T4-003 (`is_even`) — flaky on retry, passed 100% when re-run.

Result is in line with the historical Haiku 4.5 ~96% on Aver 0.12.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Test execution output now prints evaluated results as properly formatted strings rather than raw expressions, ensuring clearer and more readable test run logs.

coderabbitai · 2026-05-04T18:40:02Z

Warning

Rate limit exceeded

@aallan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7453080d-b991-45b4-a887-d3e7c7ef543e

📥 Commits

Reviewing files that changed from the base of the PR and between 3afbb2a and 2f74785.

📒 Files selected for processing (4)

CHANGELOG.md
CITATION.cff
ROADMAP.md
pyproject.toml

📝 Walkthrough

Walkthrough

The Aver test-wrapper in the Vera benchmark harness now prints the entry-point invocation as a string-interpolated representation (including braces) instead of printing the raw expression, updated in both existing-module and new-module wrapper code paths.

Changes

Test Result Output Formatting

Layer / File(s)	Summary
Core Change `vera_bench/runner.py`	Replaced `Console.print({entry_point}({args_str}))` with `Console.print("{{{entry_point}({args_str})}}")` when generating `fn main() -> Unit` in both wrapper branches.
Context / Surrounding Logic `vera_bench/runner.py`	Change occurs inside `_evaluate_aver_code` where the Aver wrapper code is synthesized for existing-module and wrapped-module cases (around lines 502–515).
Tests / Docs (none changed)	No test or documentation files altered in this diff.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

problems, harness

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the primary change: migrating Aver code to version 0.16 with string-based Console output, which is the core technical change across the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-04T18:46:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.62%. Comparing base (100075f) to head (2f74785).

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #65   +/-   ##
=======================================
  Coverage   83.62%   83.62%           
=======================================
  Files          10       10           
  Lines        1392     1392           
=======================================
  Hits         1164     1164           
  Misses        228      228

Flag	Coverage Δ
python	`83.62% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Aver 0.16 ("Anneal") tightens Console.print to require String — passing Int/Bool/List/Option is now a typecheck error. The standard idiom is interpolation: Console.print("{x}"). - vera_bench/runner.py: the two `Console.print({entry}({args}))` injection sites in the per-test wrapper now emit Console.print("{<call>}"). Two-line change. - solutions/aver/*.av: 56 baseline solutions migrated to interpolation via a paren-balanced + string-aware parser so nested expressions and string arguments survive intact. - 9 baseline solutions had main() printing only a subset of test_cases — regenerated main() to print every test case from the problem JSON. Pre-existing coverage gap, surfaces only after migration brings them past `aver check`. - 3 new baselines: VB_T2_011_starts_with_prefix, VB_T2_012_ends_with_suffix, VB_T2_013_get_char_code (using String.startsWith / String.endsWith / String.charAt + Char.toCode). Backward compatible: interpolation predates Console=String by many versions, so Console.print("{x}") works on Aver 0.10–0.15 and on 0.16+. Nothing regresses on older Aver. Verified locally on Aver 0.16-wasm-gc-probe: - baselines: check@1 100%, run_correct 100% (60/60) - Haiku 4.5 run: check@1 82%, verify@1 93%, run_correct 94% (in line with the pre-migration ~96% historical Haiku 4.5 result on Aver 0.12). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aallan · 2026-05-04T20:34:16Z

@coderabbitai review

coderabbitai · 2026-05-04T20:34:45Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-05-04T20:35:59Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

The Aver test-wrapper harness and 56 canonical baselines now emit string interpolation (`Console.print("{x}")`) instead of bare expressions — required for Aver 0.16's typed `Console.print`, backward-compatible to Aver 0.10. Plus three restored T2 baselines and 9 coverage-gap fixes (see aallan#65). This is a methodology change for Aver scoring with two distinct flavours of compatibility impact: - On Aver 0.16+: required. Without this, every injected `aver run` crashes at typecheck and `run_correct = 0%`. - On Aver 0.10–0.15: scoring may differ slightly between v0.0.10 and v0.0.11 result files, because the 9 coverage-gap fixes mean `run_correct` is now measured against the full test_cases set rather than the partial set the baseline `main()` happened to print, and the 3 restored T2 baselines now contribute to the Aver baseline `run_correct` denominator. Aver baseline rises to 100% check@1 / 100% run_correct against Aver 0.15.2 (previously 95%/73% on the same compiler). Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected. Files touched: - pyproject.toml: 0.0.10 -> 0.0.11 - CITATION.cff: version + date-released bumped together - CHANGELOG.md: new [0.0.11] section with Compatibility note documenting both flavours of impact; link references updated - ROADMAP.md: prepended a v0.0.11 line above the v0.0.10 summary Verified: `pip install -e .` followed by `python -c "from importlib.metadata import version; print(version('vera-bench'))"` reports 0.0.11, full test suite green at 494 cases, Aver baselines 100%/100% locally on Aver 0.15.2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aallan · 2026-05-04T21:01:33Z

@jasisz Pushed 2f74785 bumping the bench version 0.0.10 → 0.0.11, same scaffolding as the v0.0.10 bump on PR #62. This is a methodology change for Aver scoring (different test-wrapper format, three restored baselines, nine coverage-gap fixes), so result files written by the post-merge harness shouldn't be tagged 0.0.10.

Compatibility note in the CHANGELOG calls out both flavours explicitly:

On Aver 0.16+ — required. Without this PR, every injected aver run crashes at typecheck and run_correct = 0%.
On Aver 0.10–0.15 — backward-compatible, but Aver scoring may differ slightly between v0.0.10 and v0.0.11 result files because (a) the 9 coverage-gap fixes change which test_cases get checked, and (b) the 3 restored T2 baselines now contribute to the run_correct denominator.

Files touched:

pyproject.toml: 0.0.10 → 0.0.11
CITATION.cff: version + date-released bumped together (date 2026-05-04)
CHANGELOG.md: new [0.0.11] section with Compatibility note documenting both flavours
ROADMAP.md: prepended a v0.0.11 line above the v0.0.10 summary

Verified importlib.metadata.version("vera-bench") == "0.0.11" after a fresh pip install -e ., full Python test suite green at 494 cases, Aver baselines 100% / 100% on Aver 0.15.2.

Local verification of the 0.16+ forward-compat claim wasn't possible — cargo search aver-lang only goes up to 0.15.2 and you mentioned testing with a 0.16-wasm-gc-probe build that isn't on crates.io yet. Taking your word for that one. The backward-compat to 0.15 is independently verified.

This should be the last push from us — once CI goes green I'll merge and tag v0.0.11.

aallan

Approving — ready to merge

Substantive change is correct, well-scoped, and well-tested. Combined with the version-bump scaffolding pushed in 2f74785, this is merge-ready.

What's verified

Backward-compat to current Aver ✅ Locally regenerated Aver baselines on Aver 0.15.2 (the latest crates.io release): 100% check@1, 100% run_correct, 60/60 across all 5 tiers. Previous baseline on the same compiler was 95%/73%, so the lift is real (3 restored T2 baselines + 9 coverage-gap fixes), not a measurement quirk.
No regression in non-Aver paths ✅ pytest tests/ green at 494 cases. validate job covers every problem JSON + canonical Vera solution and is green in CI.
CI green wall ✅ lint, security, dependency-audit, test (3.11/3.12/3.13), validate, codecov patch + project, CodeRabbit — all pass.
CodeRabbit happy ✅ no findings.

What's taken on faith

Forward-compat to Aver 0.16+ — cargo search aver-lang only goes up to 0.15.2; the 0.16-wasm-gc-probe build you tested against isn't on crates.io. The Console=String breaking change in 0.16 is independently documented, and the interpolation form is the canonical idiom that survives it, so the underlying reason is sound. We just can't run it locally to confirm.

Notable bonus

The 9 coverage-gap fixes are a hidden win. Some Aver baselines were main()-printing only a subset of their test_cases from the problem JSON, so run_correct was being measured against a partial set. Easy to miss because everything appeared to pass, but the fix is doing real correctness work — that's the kind of thing that's valuable to surface in the CHANGELOG, which the v0.0.11 Compatibility note now does.

After merge

Move the v0.0.11 tag to the merge commit, push, create the GitHub release.
The Aver baseline numbers in the new CHANGELOG (100%/100% on 0.15.2) are now the v0.0.11 reference; if you want to re-run the model sweeps on 0.15.2 to refresh the headline chart's Aver column, that's a cheap follow-up — but not blocking.
The assets/results-graph.png is still pinned to v0.0.7 content — same as before this PR. Unrelated to merge readiness.

Thanks again for two consecutive Aver-forward-compat PRs landing cleanly. The "ship before the upstream breaking change ships" cadence has been working well.

aallan · 2026-05-05T10:06:18Z

@jasisz Closing the loop on the forward-compat claim from this PR — aver-lang 0.16.0 landed on crates.io overnight. Upgraded locally and re-ran the baselines:

vera-bench baselines --language aver
  Problems       │        60
  check@1        │      100%
  run_correct    │      100%
  Tier 1 check@1 │ 100% (10)
  Tier 2 check@1 │ 100% (15)
  Tier 3 check@1 │ 100% (15)
  Tier 4 check@1 │ 100% (10)
  Tier 5 check@1 │ 100% (10)

Identical to the numbers on Aver 0.15.2 with this PR applied — the interpolation migration and the three restored T2 baselines all hold up cleanly on the released 0.16.0 compiler. The 0.16-wasm-gc-probe you tested against during development is observably the same Aver-language semantics as what shipped.

The "taken on faith" caveat from the v0.0.11 review is now closed. Thanks again for landing it preemptively — having the bench be 0.16-ready on day one of the release rather than scrambling after the fact has been the pattern with v0.0.10 / v0.0.11 and it's working really well.

@aallan

Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70. (Item 1 — extracting --parallel N into its own PR — addressed via PR aallan#73.) ### Item 2: README headline section -> single sentence in §Overview Removed the "AILANG: AI-designed language..." headline section (13 lines: the heading, the description paragraphs, the per-mode results table, the "full-circle finding" paragraph). The phrasing included editorial claims about VeraBench's identity that should be a project-owner call, and "added in this fork" wouldn't read correctly post-merge. Replaced the §Overview line about baselines with the form @aallan suggested verbatim: The same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/) as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis. Matches the existing Aver pattern: light-touch mention without results writeups in the README. ### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md Neither file is load-bearing — no code or tests reference them. Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or AVER_MAPPING.md. Numbers and writeups go in PR descriptions and external content; in-repo docs are reserved for things future maintainers need. ### Item 4: .coderabbit.yaml path_filters Added the two missing AILANG entries to mirror the existing {python, typescript, aver} pattern: - "!**/*.ail" (alongside !**/*.vera, !**/*.av) - "!solutions/ailang/**" (alongside the other solutions/* entries) This stops CodeRabbit from generating speculative findings on .ail solution files in future review passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jasisz requested a review from aallan as a code owner May 4, 2026 18:39

jasisz force-pushed the aver-016-console-string-migration branch from 8d180fb to 3afbb2a Compare May 4, 2026 18:50

aallan approved these changes May 4, 2026

View reviewed changes

aallan merged commit 5b5b2a8 into aallan:main May 4, 2026
10 checks passed

aallan mentioned this pull request May 5, 2026

0.16 "Anneal" — Console=String, native wasm-gc, soundier internals jasisz/aver#14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aver: migrate baselines + harness to Aver 0.16 Console=String#65

aver: migrate baselines + harness to Aver 0.16 Console=String#65
aallan merged 2 commits into
aallan:mainfrom
jasisz:aver-016-console-string-migration

jasisz commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested labels

Review ran into problems

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

aallan commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

aallan commented May 4, 2026

Uh oh!

aallan left a comment

Uh oh!

Uh oh!

aallan commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jasisz commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Backward compatibility

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested labels

Review ran into problems

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aallan commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

aallan commented May 4, 2026

Uh oh!

aallan left a comment

Choose a reason for hiding this comment

Approving — ready to merge

What's verified

What's taken on faith

Notable bonus

After merge

Uh oh!

Uh oh!

aallan commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jasisz commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading