You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bump version to 0.0.10 for the Aver effects-strip methodology change
The Aver evaluation harness now strips module-header `effects [...]`
declarations before injecting the test main (the substantive change in
this PR). On Aver 0.12 and earlier the strip is a no-op and Aver
scoring is byte-identical to v0.0.9. Once Aver 0.13 ships and models
start emitting `effects [...]` per the updated docs, the strip will
activate on a measurable fraction of generations and prevent the
underdeclared-effects type error — Aver `run_correct` rates will
diverge between v0.0.9 and v0.0.10 on Aver 0.13+, so the bump records
the methodology boundary in `bench_version` for cross-version analysis.
Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected.
Files touched:
- pyproject.toml: 0.0.9 -> 0.0.10 (importlib.metadata source of truth)
- CITATION.cff: version + date-released bumped together
- CHANGELOG.md: new [0.0.10] section with Compatibility note explaining
the no-op-until-Aver-0.13 nuance; link references updated
- ROADMAP.md: prepended a v0.0.10 line above the v0.0.9 summary
Verified: `pip install -e .` followed by
`python -c "from importlib.metadata import version; print(version('vera-bench'))"`
reports 0.0.10, full test suite green at 494 cases.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: ROADMAP.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,8 @@
2
2
3
3
## Where we are
4
4
5
+
**v0.0.10** — Aver evaluation harness strips module-header `effects [...]` declarations before injecting the test main, so canonical and LLM-generated solutions continue to compile under Aver 0.13's enforced effects boundary. No-op on Aver 0.12 and earlier; methodology change documented in CHANGELOG.
6
+
5
7
**v0.0.9** — 60 problems across 5 tiers (10 new T2/T3 problems with testable signatures). T1–T4 `run_correct` pool expanded from 18 to 30 testable problems. New T3 problems use Int-only signatures with internal ADT construction for CLI testability.
6
8
7
9
**v0.0.8** — 50 problems across 5 tiers with strengthened postconditions and explicit slot ordering descriptions. Working LLM harness (Anthropic, OpenAI, Moonshot), Python, TypeScript, and Aver baseline runners, cross-language generation comparison. Full benchmark runner script. SKILL.md and Aver's llms.txt fetched at runtime. Language-neutral problem descriptions (`description_neutral`) for fair cross-language prompting.
0 commit comments