Skip to content

Commit 87a835e

Browse files
aallanclaude
andcommitted
Bump version to 0.0.10 for the Aver effects-strip methodology change
The Aver evaluation harness now strips module-header `effects [...]` declarations before injecting the test main (the substantive change in this PR). On Aver 0.12 and earlier the strip is a no-op and Aver scoring is byte-identical to v0.0.9. Once Aver 0.13 ships and models start emitting `effects [...]` per the updated docs, the strip will activate on a measurable fraction of generations and prevent the underdeclared-effects type error — Aver `run_correct` rates will diverge between v0.0.9 and v0.0.10 on Aver 0.13+, so the bump records the methodology boundary in `bench_version` for cross-version analysis. Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected. Files touched: - pyproject.toml: 0.0.9 -> 0.0.10 (importlib.metadata source of truth) - CITATION.cff: version + date-released bumped together - CHANGELOG.md: new [0.0.10] section with Compatibility note explaining the no-op-until-Aver-0.13 nuance; link references updated - ROADMAP.md: prepended a v0.0.10 line above the v0.0.9 summary Verified: `pip install -e .` followed by `python -c "from importlib.metadata import version; print(version('vera-bench'))"` reports 0.0.10, full test suite green at 494 cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 918c3e3 commit 87a835e

4 files changed

Lines changed: 37 additions & 4 deletions

File tree

CHANGELOG.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.0.10] - 2026-04-29
11+
12+
### Changed
13+
14+
- Aver evaluation harness strips module-header `effects [...]` declarations
15+
before injecting the test main (#62). The injected main needs
16+
`! [Console.print]`, which would violate any narrower boundary the LLM
17+
declared (including the common `effects []` for "pure" modules) once
18+
Aver 0.13 ships and enforces the boundary as a hard type error.
19+
- The strip is window-scoped (only fires inside the module-header block,
20+
not on `effects [...]`-shaped lines elsewhere), tolerates arbitrary
21+
whitespace between `effects` and `[`, and tolerates trailing line
22+
comments after the closing `]`.
23+
24+
### Compatibility note
25+
26+
This is a methodology change for Aver scoring: the same LLM output now
27+
goes through an extra strip pass before reaching the compiler. On Aver
28+
0.12 and earlier the strip is a no-op (LLMs don't emit module-level
29+
`effects [...]` because the docs don't yet describe it), so today's
30+
Aver scores are byte-identical to v0.0.9. Once Aver 0.13 ships and the
31+
boundary becomes part of the doc nudge to models, Aver `run_correct`
32+
rates from v0.0.10 onwards will diverge from any v0.0.9-tagged Aver
33+
results run against Aver 0.13+ — the strip will activate on a measurable
34+
fraction of generations and prevent the underdeclared-effects type
35+
error. Result files are tagged with `bench_version` so cross-version
36+
comparisons can detect this boundary.
37+
38+
Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected.
39+
1040
## [0.0.9] - 2026-04-16
1141

1242
### Added
@@ -161,7 +191,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
161191
- Claude Sonnet 4: 96% check@1, 96% verify@1, 83% run_correct (50 problems, full-spec mode)
162192
- Python canonical baselines: 100% run_correct (24 testable problems)
163193

164-
[Unreleased]: https://github.com/aallan/vera-bench/compare/v0.0.9...HEAD
194+
[Unreleased]: https://github.com/aallan/vera-bench/compare/v0.0.10...HEAD
195+
[0.0.10]: https://github.com/aallan/vera-bench/compare/v0.0.9...v0.0.10
165196
[0.0.9]: https://github.com/aallan/vera-bench/compare/v0.0.8...v0.0.9
166197
[0.0.8]: https://github.com/aallan/vera-bench/compare/v0.0.7...v0.0.8
167198
[0.0.7]: https://github.com/aallan/vera-bench/compare/v0.0.6...v0.0.7

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ cff-version: 1.2.0
22
title: "VeraBench: a benchmark suite for LLM code generation in Vera"
33
message: "If you use this benchmark, please cite it as below."
44
type: software
5-
version: "0.0.9"
6-
date-released: "2026-04-16"
5+
version: "0.0.10"
6+
date-released: "2026-04-29"
77
authors:
88
- given-names: Alasdair
99
family-names: Allan

ROADMAP.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
## Where we are
44

5+
**v0.0.10** — Aver evaluation harness strips module-header `effects [...]` declarations before injecting the test main, so canonical and LLM-generated solutions continue to compile under Aver 0.13's enforced effects boundary. No-op on Aver 0.12 and earlier; methodology change documented in CHANGELOG.
6+
57
**v0.0.9** — 60 problems across 5 tiers (10 new T2/T3 problems with testable signatures). T1–T4 `run_correct` pool expanded from 18 to 30 testable problems. New T3 problems use Int-only signatures with internal ADT construction for CLI testability.
68

79
**v0.0.8** — 50 problems across 5 tiers with strengthened postconditions and explicit slot ordering descriptions. Working LLM harness (Anthropic, OpenAI, Moonshot), Python, TypeScript, and Aver baseline runners, cross-language generation comparison. Full benchmark runner script. SKILL.md and Aver's llms.txt fetched at runtime. Language-neutral problem descriptions (`description_neutral`) for fair cross-language prompting.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "vera-bench"
7-
version = "0.0.9"
7+
version = "0.0.10"
88
description = "HumanEval/MBPP-style benchmark for the Vera programming language"
99
readme = "README.md"
1010
license = "MIT"

0 commit comments

Comments
 (0)