Skip to content

aver: migrate baselines + harness to Aver 0.16 Console=String#65

Merged
aallan merged 2 commits into
aallan:mainfrom
jasisz:aver-016-console-string-migration
May 4, 2026
Merged

aver: migrate baselines + harness to Aver 0.16 Console=String#65
aallan merged 2 commits into
aallan:mainfrom
jasisz:aver-016-console-string-migration

Conversation

@jasisz

@jasisz jasisz commented May 4, 2026

Copy link
Copy Markdown
Contributor

Why

Aver 0.16 ("Anneal") tightens Console.print to require String — passing Int,
Bool, List<T>, Option<T> etc. is now a typecheck error. The previous form
(Console.print(<any>)) coerced silently, which is gone. The standard idiom is
string interpolation: Console.print(\"{x}\").

Without this change every baseline solution and every test wrapper the harness
injects fails to typecheck on Aver ≥ 0.16. On a fresh local run with
aver 0.16-wasm-gc-probe:

metric before after
baselines check@1 7% 100%
baselines run_correct 100%
Haiku 4.5 run (60 problems) crashed* run_correct 94%

*before the harness fix every injected aver run crashed at typecheck so LLM
runs reported run_correct=0% across the board.

What

  1. vera_bench/runner.py — the two Console.print({entry}({args}))
    injection sites in the per-test wrapper now emit Console.print(\"{<call>}\").
    Two-line change, exact same shape as the migrated baselines.

  2. solutions/aver/*.av — 56 baseline solutions migrated to interpolation
    (Console.print(EXPR)Console.print(\"{EXPR}\")). Mechanical, applied
    with a paren-balanced + string-aware parser so nested expressions and string
    arguments survive intact.

  3. solutions/aver/VB_T*_*.av — 9 solutions where main() printed only a
    subset of test_cases got their main() regenerated to print every test
    case from the problem JSON. Pre-existing coverage gap, surfaces only after
    the interpolation migration brings them past `aver check`.

  4. 3 new baselinesVB_T2_011_starts_with_prefix.av,
    VB_T2_012_ends_with_suffix.av, VB_T2_013_get_char_code.av were missing
    from solutions/aver/; added using String.startsWith / String.endsWith /
    String.charAt + Char.toCode.

Backward compatibility

Interpolation predates the Console=String breaking change by many versions, so
Console.print(\"{x}\") works on Aver 0.10–0.15 and on 0.16+. Nothing regresses
on older Aver — same baselines pass on the previous releases the bench has been
tracking.

Verification

```
$ vera-bench baselines --language aver
check@1: 100% run_correct: 100% (60/60 across all 5 tiers)

$ vera-bench run --model claude-haiku-4-5 --language aver
check@1 82%, verify@1 93%, run_correct 94%
```

Two run_correct misses with Haiku 4.5 are model-side, not harness/compiler:

  • VB-T1-007 (`safe_modulo`) — Haiku consistently ignores `@Int -> @Int`
    signature and returns `Result<Int, String>`.
  • VB-T4-003 (`is_even`) — flaky on retry, passed 100% when re-run.

Result is in line with the historical Haiku 4.5 ~96% on Aver 0.12.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Test execution output now prints evaluated results as properly formatted strings rather than raw expressions, ensuring clearer and more readable test run logs.

@jasisz jasisz requested a review from aallan as a code owner May 4, 2026 18:39
@coderabbitai

coderabbitai Bot commented May 4, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@aallan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7453080d-b991-45b4-a887-d3e7c7ef543e

📥 Commits

Reviewing files that changed from the base of the PR and between 3afbb2a and 2f74785.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • CITATION.cff
  • ROADMAP.md
  • pyproject.toml
📝 Walkthrough

Walkthrough

The Aver test-wrapper in the Vera benchmark harness now prints the entry-point invocation as a string-interpolated representation (including braces) instead of printing the raw expression, updated in both existing-module and new-module wrapper code paths.

Changes

Test Result Output Formatting

Layer / File(s) Summary
Core Change
vera_bench/runner.py
Replaced Console.print({entry_point}({args_str})) with Console.print("{{{entry_point}({args_str})}}") when generating fn main() -> Unit in both wrapper branches.
Context / Surrounding Logic
vera_bench/runner.py
Change occurs inside _evaluate_aver_code where the Aver wrapper code is synthesized for existing-module and wrapped-module cases (around lines 502–515).
Tests / Docs
(none changed)
No test or documentation files altered in this diff.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

problems, harness

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the primary change: migrating Aver code to version 0.16 with string-based Console output, which is the core technical change across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.62%. Comparing base (100075f) to head (2f74785).

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #65   +/-   ##
=======================================
  Coverage   83.62%   83.62%           
=======================================
  Files          10       10           
  Lines        1392     1392           
=======================================
  Hits         1164     1164           
  Misses        228      228           
Flag Coverage Δ
python 83.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Aver 0.16 ("Anneal") tightens Console.print to require String —
passing Int/Bool/List/Option is now a typecheck error. The standard
idiom is interpolation: Console.print("{x}").

- vera_bench/runner.py: the two `Console.print({entry}({args}))`
  injection sites in the per-test wrapper now emit
  Console.print("{<call>}"). Two-line change.
- solutions/aver/*.av: 56 baseline solutions migrated to interpolation
  via a paren-balanced + string-aware parser so nested expressions
  and string arguments survive intact.
- 9 baseline solutions had main() printing only a subset of
  test_cases — regenerated main() to print every test case from
  the problem JSON. Pre-existing coverage gap, surfaces only after
  migration brings them past `aver check`.
- 3 new baselines: VB_T2_011_starts_with_prefix,
  VB_T2_012_ends_with_suffix, VB_T2_013_get_char_code (using
  String.startsWith / String.endsWith / String.charAt + Char.toCode).

Backward compatible: interpolation predates Console=String by many
versions, so Console.print("{x}") works on Aver 0.10–0.15 and on
0.16+. Nothing regresses on older Aver.

Verified locally on Aver 0.16-wasm-gc-probe:
- baselines:        check@1 100%, run_correct 100% (60/60)
- Haiku 4.5 run:    check@1 82%, verify@1 93%, run_correct 94% (in
  line with the pre-migration ~96% historical Haiku 4.5 result on
  Aver 0.12).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jasisz jasisz force-pushed the aver-016-console-string-migration branch from 8d180fb to 3afbb2a Compare May 4, 2026 18:50
@aallan

aallan commented May 4, 2026

Copy link
Copy Markdown
Owner

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 4, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

1 similar comment
@coderabbitai

coderabbitai Bot commented May 4, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

The Aver test-wrapper harness and 56 canonical baselines now emit
string interpolation (`Console.print("{x}")`) instead of bare
expressions — required for Aver 0.16's typed `Console.print`,
backward-compatible to Aver 0.10. Plus three restored T2 baselines
and 9 coverage-gap fixes (see aallan#65).

This is a methodology change for Aver scoring with two distinct
flavours of compatibility impact:

- On Aver 0.16+: required. Without this, every injected `aver run`
  crashes at typecheck and `run_correct = 0%`.
- On Aver 0.10–0.15: scoring may differ slightly between v0.0.10 and
  v0.0.11 result files, because the 9 coverage-gap fixes mean
  `run_correct` is now measured against the full test_cases set
  rather than the partial set the baseline `main()` happened to
  print, and the 3 restored T2 baselines now contribute to the Aver
  baseline `run_correct` denominator.

Aver baseline rises to 100% check@1 / 100% run_correct against
Aver 0.15.2 (previously 95%/73% on the same compiler). Vera, Vera
spec-from-NL, Python, and TypeScript scoring is unaffected.

Files touched:
- pyproject.toml: 0.0.10 -> 0.0.11
- CITATION.cff: version + date-released bumped together
- CHANGELOG.md: new [0.0.11] section with Compatibility note
  documenting both flavours of impact; link references updated
- ROADMAP.md: prepended a v0.0.11 line above the v0.0.10 summary

Verified: `pip install -e .` followed by
`python -c "from importlib.metadata import version; print(version('vera-bench'))"`
reports 0.0.11, full test suite green at 494 cases, Aver baselines
100%/100% locally on Aver 0.15.2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aallan

aallan commented May 4, 2026

Copy link
Copy Markdown
Owner

@jasisz Pushed 2f74785 bumping the bench version 0.0.100.0.11, same scaffolding as the v0.0.10 bump on PR #62. This is a methodology change for Aver scoring (different test-wrapper format, three restored baselines, nine coverage-gap fixes), so result files written by the post-merge harness shouldn't be tagged 0.0.10.

Compatibility note in the CHANGELOG calls out both flavours explicitly:

  • On Aver 0.16+ — required. Without this PR, every injected aver run crashes at typecheck and run_correct = 0%.
  • On Aver 0.10–0.15 — backward-compatible, but Aver scoring may differ slightly between v0.0.10 and v0.0.11 result files because (a) the 9 coverage-gap fixes change which test_cases get checked, and (b) the 3 restored T2 baselines now contribute to the run_correct denominator.

Files touched:

  • pyproject.toml: 0.0.10 → 0.0.11
  • CITATION.cff: version + date-released bumped together (date 2026-05-04)
  • CHANGELOG.md: new [0.0.11] section with Compatibility note documenting both flavours
  • ROADMAP.md: prepended a v0.0.11 line above the v0.0.10 summary

Verified importlib.metadata.version("vera-bench") == "0.0.11" after a fresh pip install -e ., full Python test suite green at 494 cases, Aver baselines 100% / 100% on Aver 0.15.2.

Local verification of the 0.16+ forward-compat claim wasn't possible — cargo search aver-lang only goes up to 0.15.2 and you mentioned testing with a 0.16-wasm-gc-probe build that isn't on crates.io yet. Taking your word for that one. The backward-compat to 0.15 is independently verified.

This should be the last push from us — once CI goes green I'll merge and tag v0.0.11.

@aallan aallan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — ready to merge

Substantive change is correct, well-scoped, and well-tested. Combined with the version-bump scaffolding pushed in 2f74785, this is merge-ready.

What's verified

  • Backward-compat to current Aver ✅ Locally regenerated Aver baselines on Aver 0.15.2 (the latest crates.io release): 100% check@1, 100% run_correct, 60/60 across all 5 tiers. Previous baseline on the same compiler was 95%/73%, so the lift is real (3 restored T2 baselines + 9 coverage-gap fixes), not a measurement quirk.
  • No regression in non-Aver pathspytest tests/ green at 494 cases. validate job covers every problem JSON + canonical Vera solution and is green in CI.
  • CI green wall ✅ lint, security, dependency-audit, test (3.11/3.12/3.13), validate, codecov patch + project, CodeRabbit — all pass.
  • CodeRabbit happy ✅ no findings.

What's taken on faith

  • Forward-compat to Aver 0.16+cargo search aver-lang only goes up to 0.15.2; the 0.16-wasm-gc-probe build you tested against isn't on crates.io. The Console=String breaking change in 0.16 is independently documented, and the interpolation form is the canonical idiom that survives it, so the underlying reason is sound. We just can't run it locally to confirm.

Notable bonus

The 9 coverage-gap fixes are a hidden win. Some Aver baselines were main()-printing only a subset of their test_cases from the problem JSON, so run_correct was being measured against a partial set. Easy to miss because everything appeared to pass, but the fix is doing real correctness work — that's the kind of thing that's valuable to surface in the CHANGELOG, which the v0.0.11 Compatibility note now does.

After merge

  1. Move the v0.0.11 tag to the merge commit, push, create the GitHub release.
  2. The Aver baseline numbers in the new CHANGELOG (100%/100% on 0.15.2) are now the v0.0.11 reference; if you want to re-run the model sweeps on 0.15.2 to refresh the headline chart's Aver column, that's a cheap follow-up — but not blocking.
  3. The assets/results-graph.png is still pinned to v0.0.7 content — same as before this PR. Unrelated to merge readiness.

Thanks again for two consecutive Aver-forward-compat PRs landing cleanly. The "ship before the upstream breaking change ships" cadence has been working well.

@aallan aallan merged commit 5b5b2a8 into aallan:main May 4, 2026
10 checks passed
@aallan

aallan commented May 5, 2026

Copy link
Copy Markdown
Owner

@jasisz Closing the loop on the forward-compat claim from this PR — aver-lang 0.16.0 landed on crates.io overnight. Upgraded locally and re-ran the baselines:

vera-bench baselines --language aver
  Problems       │        60
  check@1        │      100%
  run_correct    │      100%
  Tier 1 check@1 │ 100% (10)
  Tier 2 check@1 │ 100% (15)
  Tier 3 check@1 │ 100% (15)
  Tier 4 check@1 │ 100% (10)
  Tier 5 check@1 │ 100% (10)

Identical to the numbers on Aver 0.15.2 with this PR applied — the interpolation migration and the three restored T2 baselines all hold up cleanly on the released 0.16.0 compiler. The 0.16-wasm-gc-probe you tested against during development is observably the same Aver-language semantics as what shipped.

The "taken on faith" caveat from the v0.0.11 review is now closed. Thanks again for landing it preemptively — having the bench be 0.16-ready on day one of the release rather than scrambling after the fact has been the pattern with v0.0.10 / v0.0.11 and it's working really well.

sunholo-voight-kampff added a commit to sunholo-voight-kampff/vera-bench that referenced this pull request May 22, 2026
Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70.
(Item 1 — extracting --parallel N into its own PR — addressed via
PR aallan#73.)

### Item 2: README headline section -> single sentence in §Overview

Removed the "AILANG: AI-designed language..." headline section
(13 lines: the heading, the description paragraphs, the per-mode
results table, the "full-circle finding" paragraph). The phrasing
included editorial claims about VeraBench's identity that should
be a project-owner call, and "added in this fork" wouldn't read
correctly post-merge.

Replaced the §Overview line about baselines with the form
@aallan suggested verbatim:

  The same problems are also run in Python, TypeScript,
  [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/)
  as baselines. AILANG and Aver are zero-training-data languages,
  providing additional data points alongside Vera for the
  language-design-vs-training-data thesis.

Matches the existing Aver pattern: light-touch mention without
results writeups in the README.

### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md

Neither file is load-bearing — no code or tests reference them.
Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or
AVER_MAPPING.md. Numbers and writeups go in PR descriptions and
external content; in-repo docs are reserved for things future
maintainers need.

### Item 4: .coderabbit.yaml path_filters

Added the two missing AILANG entries to mirror the existing
{python, typescript, aver} pattern:

    - "!**/*.ail"             (alongside !**/*.vera, !**/*.av)
    - "!solutions/ailang/**"  (alongside the other solutions/* entries)

This stops CodeRabbit from generating speculative findings on
.ail solution files in future review passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants