Skip to content

Commit 2f74785

Browse files
aallanclaude
andcommitted
Bump version to 0.0.11 for the Aver 0.16 Console=String migration
The Aver test-wrapper harness and 56 canonical baselines now emit string interpolation (`Console.print("{x}")`) instead of bare expressions — required for Aver 0.16's typed `Console.print`, backward-compatible to Aver 0.10. Plus three restored T2 baselines and 9 coverage-gap fixes (see #65). This is a methodology change for Aver scoring with two distinct flavours of compatibility impact: - On Aver 0.16+: required. Without this, every injected `aver run` crashes at typecheck and `run_correct = 0%`. - On Aver 0.10–0.15: scoring may differ slightly between v0.0.10 and v0.0.11 result files, because the 9 coverage-gap fixes mean `run_correct` is now measured against the full test_cases set rather than the partial set the baseline `main()` happened to print, and the 3 restored T2 baselines now contribute to the Aver baseline `run_correct` denominator. Aver baseline rises to 100% check@1 / 100% run_correct against Aver 0.15.2 (previously 95%/73% on the same compiler). Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected. Files touched: - pyproject.toml: 0.0.10 -> 0.0.11 - CITATION.cff: version + date-released bumped together - CHANGELOG.md: new [0.0.11] section with Compatibility note documenting both flavours of impact; link references updated - ROADMAP.md: prepended a v0.0.11 line above the v0.0.10 summary Verified: `pip install -e .` followed by `python -c "from importlib.metadata import version; print(version('vera-bench'))"` reports 0.0.11, full test suite green at 494 cases, Aver baselines 100%/100% locally on Aver 0.15.2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 3afbb2a commit 2f74785

4 files changed

Lines changed: 64 additions & 4 deletions

File tree

CHANGELOG.md

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,63 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.0.11] - 2026-05-04
11+
12+
### Changed
13+
14+
- Aver test-wrapper harness emits `Console.print("{<call>}")` (string
15+
interpolation) instead of `Console.print(<call>)`. Aver 0.16
16+
("Anneal") tightens `Console.print` to require `String` — the
17+
previous form silently coerced `Int`, `Bool`, `List<T>`, etc. and
18+
was a typecheck error from 0.16 onwards. Interpolation predates the
19+
breaking change by many versions, so the same wrapper works on
20+
Aver 0.10–0.15 and on 0.16+ (#65).
21+
- All 56 canonical Aver baseline solutions migrated from
22+
`Console.print(EXPR)` to `Console.print("{EXPR}")`. Mechanical and
23+
shape-preserving for nested expressions and string arguments.
24+
- 9 baselines whose `main()` printed only a subset of their
25+
problem-JSON `test_cases` had `main()` regenerated to print every
26+
test case. This was a pre-existing coverage gap that surfaces only
27+
after the interpolation migration brings them past `aver check`.
28+
29+
### Added
30+
31+
- 3 Aver baselines restored: `VB_T2_011_starts_with_prefix.av`,
32+
`VB_T2_012_ends_with_suffix.av`, `VB_T2_013_get_char_code.av`.
33+
Originally added in v0.0.9 then removed during PR #57 review
34+
because the Aver stdlib didn't expose `starts_with` / `ends_with` /
35+
`char_at` at the time. Aver 0.15+ has `String.startsWith`,
36+
`String.endsWith`, `String.charAt`, and `Char.toCode`, so the three
37+
baselines are reinstated.
38+
39+
### Compatibility note
40+
41+
Aver scoring on Aver 0.16+ requires v0.0.11 — without this release,
42+
every injected `aver run` crashes at typecheck and `run_correct = 0%`
43+
across the board. For Aver 0.10–0.15, scoring may differ slightly
44+
between v0.0.10 and v0.0.11 result files for the same model on the
45+
same problems:
46+
47+
- The 9 coverage-gap fixes mean `run_correct` is now measured against
48+
the full set of test cases declared in each problem JSON, rather
49+
than the partial set the baseline `main()` happened to print. Some
50+
problems that previously appeared to pass on a partial check may
51+
now fail on the full check, and vice versa.
52+
- The 3 restored T2 baselines (T2-011/012/013) now contribute to the
53+
Aver baseline `run_correct` denominator (60 / 60), where they
54+
previously contributed nothing (no canonical solution available, so
55+
pre-#65 Aver baselines reported 60 problems with 3 effectively
56+
excluded from scoring).
57+
58+
The Aver baseline rises to 100% check@1, 100% run_correct against
59+
Aver 0.15.2 with this PR; the previous baseline was 95%/73% on the
60+
same compiler. The lift is real (not a definitional artefact) but
61+
result files are tagged with `bench_version` so cross-version
62+
comparisons can detect this boundary.
63+
64+
Vera, Vera spec-from-NL, Python, and TypeScript scoring is
65+
unaffected.
66+
1067
## [0.0.10] - 2026-04-29
1168

1269
### Changed
@@ -191,7 +248,8 @@ Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected.
191248
- Claude Sonnet 4: 96% check@1, 96% verify@1, 83% run_correct (50 problems, full-spec mode)
192249
- Python canonical baselines: 100% run_correct (24 testable problems)
193250

194-
[Unreleased]: https://github.com/aallan/vera-bench/compare/v0.0.10...HEAD
251+
[Unreleased]: https://github.com/aallan/vera-bench/compare/v0.0.11...HEAD
252+
[0.0.11]: https://github.com/aallan/vera-bench/compare/v0.0.10...v0.0.11
195253
[0.0.10]: https://github.com/aallan/vera-bench/compare/v0.0.9...v0.0.10
196254
[0.0.9]: https://github.com/aallan/vera-bench/compare/v0.0.8...v0.0.9
197255
[0.0.8]: https://github.com/aallan/vera-bench/compare/v0.0.7...v0.0.8

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ cff-version: 1.2.0
22
title: "VeraBench: a benchmark suite for LLM code generation in Vera"
33
message: "If you use this benchmark, please cite it as below."
44
type: software
5-
version: "0.0.10"
6-
date-released: "2026-04-29"
5+
version: "0.0.11"
6+
date-released: "2026-05-04"
77
authors:
88
- given-names: Alasdair
99
family-names: Allan

ROADMAP.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
## Where we are
44

5+
**v0.0.11** — Aver test-wrapper and 56 canonical baselines migrated to string interpolation (`Console.print("{x}")`) for compatibility with Aver 0.16's typed `Console.print`. Three previously-removed Aver baselines (T2-011/012/013) restored using Aver 0.15+ stdlib. Coverage-gap fix in 9 baselines whose `main()` printed only a subset of test cases. Methodology change documented in CHANGELOG.
6+
57
**v0.0.10** — Aver evaluation harness strips module-header `effects [...]` declarations before injecting the test main, so canonical and LLM-generated solutions continue to compile under Aver 0.13's enforced effects boundary. No-op on Aver 0.12 and earlier; methodology change documented in CHANGELOG.
68

79
**v0.0.9** — 60 problems across 5 tiers (10 new T2/T3 problems with testable signatures). T1–T4 `run_correct` pool expanded from 18 to 30 testable problems. New T3 problems use Int-only signatures with internal ADT construction for CLI testability.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "vera-bench"
7-
version = "0.0.10"
7+
version = "0.0.11"
88
description = "HumanEval/MBPP-style benchmark for the Vera programming language"
99
readme = "README.md"
1010
license = "MIT"

0 commit comments

Comments
 (0)