Bump version to 0.0.10 for the Aver effects-strip methodology change

aallan · claude · aallan · commit 87a835e19d45 · 2026-04-29T10:33:17.000+01:00
The Aver evaluation harness now strips module-header `effects [...]`
declarations before injecting the test main (the substantive change in
this PR). On Aver 0.12 and earlier the strip is a no-op and Aver
scoring is byte-identical to v0.0.9. Once Aver 0.13 ships and models
start emitting `effects [...]` per the updated docs, the strip will
activate on a measurable fraction of generations and prevent the
underdeclared-effects type error — Aver `run_correct` rates will
diverge between v0.0.9 and v0.0.10 on Aver 0.13+, so the bump records
the methodology boundary in `bench_version` for cross-version analysis.

Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected.

Files touched:
- pyproject.toml: 0.0.9 -&gt; 0.0.10 (importlib.metadata source of truth)
- CITATION.cff: version + date-released bumped together
- CHANGELOG.md: new [0.0.10] section with Compatibility note explaining
  the no-op-until-Aver-0.13 nuance; link references updated
- ROADMAP.md: prepended a v0.0.10 line above the v0.0.9 summary

Verified: `pip install -e .` followed by
`python -c "from importlib.metadata import version; print(version('vera-bench'))"`
reports 0.0.10, full test suite green at 494 cases.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.0.10] - 2026-04-29
+
+### Changed
+
+- Aver evaluation harness strips module-header `effects [...]` declarations
+  before injecting the test main (#62). The injected main needs
+  `! [Console.print]`, which would violate any narrower boundary the LLM
+  declared (including the common `effects []` for "pure" modules) once
+  Aver 0.13 ships and enforces the boundary as a hard type error.
+- The strip is window-scoped (only fires inside the module-header block,
+  not on `effects [...]`-shaped lines elsewhere), tolerates arbitrary
+  whitespace between `effects` and `[`, and tolerates trailing line
+  comments after the closing `]`.
+
+### Compatibility note
+
+This is a methodology change for Aver scoring: the same LLM output now
+goes through an extra strip pass before reaching the compiler. On Aver
+0.12 and earlier the strip is a no-op (LLMs don't emit module-level
+`effects [...]` because the docs don't yet describe it), so today's
+Aver scores are byte-identical to v0.0.9. Once Aver 0.13 ships and the
+boundary becomes part of the doc nudge to models, Aver `run_correct`
+rates from v0.0.10 onwards will diverge from any v0.0.9-tagged Aver
+results run against Aver 0.13+ — the strip will activate on a measurable
+fraction of generations and prevent the underdeclared-effects type
+error. Result files are tagged with `bench_version` so cross-version
+comparisons can detect this boundary.
+
+Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected.
+
 ## [0.0.9] - 2026-04-16
 
 ### Added
@@ -161,7 +191,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Claude Sonnet 4: 96% check@1, 96% verify@1, 83% run_correct (50 problems, full-spec mode)
 - Python canonical baselines: 100% run_correct (24 testable problems)
 
-[Unreleased]: https://github.com/aallan/vera-bench/compare/v0.0.9...HEAD
+[Unreleased]: https://github.com/aallan/vera-bench/compare/v0.0.10...HEAD
+[0.0.10]: https://github.com/aallan/vera-bench/compare/v0.0.9...v0.0.10
 [0.0.9]: https://github.com/aallan/vera-bench/compare/v0.0.8...v0.0.9
 [0.0.8]: https://github.com/aallan/vera-bench/compare/v0.0.7...v0.0.8
 [0.0.7]: https://github.com/aallan/vera-bench/compare/v0.0.6...v0.0.7
diff --git a/CITATION.cff b/CITATION.cff
@@ -2,8 +2,8 @@ cff-version: 1.2.0
 title: "VeraBench: a benchmark suite for LLM code generation in Vera"
 message: "If you use this benchmark, please cite it as below."
 type: software
-version: "0.0.9"
-date-released: "2026-04-16"
+version: "0.0.10"
+date-released: "2026-04-29"
 authors:
   - given-names: Alasdair
     family-names: Allan
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -2,6 +2,8 @@
 
 ## Where we are
 
+**v0.0.10** — Aver evaluation harness strips module-header `effects [...]` declarations before injecting the test main, so canonical and LLM-generated solutions continue to compile under Aver 0.13's enforced effects boundary. No-op on Aver 0.12 and earlier; methodology change documented in CHANGELOG.
+
 **v0.0.9** — 60 problems across 5 tiers (10 new T2/T3 problems with testable signatures). T1–T4 `run_correct` pool expanded from 18 to 30 testable problems. New T3 problems use Int-only signatures with internal ADT construction for CLI testability.
 
 **v0.0.8** — 50 problems across 5 tiers with strengthened postconditions and explicit slot ordering descriptions. Working LLM harness (Anthropic, OpenAI, Moonshot), Python, TypeScript, and Aver baseline runners, cross-language generation comparison. Full benchmark runner script. SKILL.md and Aver's llms.txt fetched at runtime. Language-neutral problem descriptions (`description_neutral`) for fair cross-language prompting.
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "vera-bench"
-version = "0.0.9"
+version = "0.0.10"
 description = "HumanEval/MBPP-style benchmark for the Vera programming language"
 readme = "README.md"
 license = "MIT"