aver: strip module-level effects before injecting test main (planned 0.13) by jasisz · Pull Request #62 · aallan/vera-bench

jasisz · 2026-04-27T11:30:50Z

Summary

Aver 0.13 (the upcoming release) introduces a module-level effects [...] boundary that the type-checker enforces: every function's ! [Effect] must be covered, or compilation fails.

The bench harness currently asks the LLM for a function only, then strips the LLM's main() and injects its own Console.print(fn(...)) main with ! [Console.print]. If the LLM declared a narrower module boundary in the original source — including the very common effects [] for a "pure" module — the injected main violates that boundary and aver run fails with an underdeclared-effects type error before any test case runs.

This PR adds _strip_module_effects(code) to the runner, called right after _strip_aver_main(code). It removes the module-header effects [...] line (inline or multi-line), so the module reverts to legacy / no-boundary mode and the injected main type-checks.

Compatibility:

No-op on Aver 0.12 and earlier — those versions don't recognise effects [...] on the module header, so LLMs don't generate it and the strip pass simply doesn't fire.
Necessary the moment Aver 0.13 ships — once the boundary is enforced, any LLM that follows the language docs and declares it will fail bench without this fix. We measured a 50pp regression on tier1 with claude-haiku-4-5 on 0.13-dev when the doc nudge for effects [] reaches the model.

So merging now is safe: it doesn't change current behaviour, and it lets aver-bench continue to work on the day 0.13 lands without a follow-up rush.

What changed

vera_bench/runner.py: new _strip_module_effects(code) helper, wired into the test-main injection path.
tests/test_runner.py: 4 cases on TestStripModuleEffects (inline [], inline list, multi-line list, no-op).

Test plan

pytest tests/test_runner.py — 101 passed locally
Full suite pytest tests/ — 489 passed
ruff check + ruff format --check clean
Tier1 sanity with claude-haiku-4-5 on Aver 0.13-dev: 9/10 (matches Aver 0.12 baseline)

Context

Linked discussion: this is the bench-side counterpart to a docs change in jasisz/aver where the language now documents effects [...] as a first-class module feature
VB-T1-007 (safe_modulo) remains the one tier1 fail on haiku-4.5 in both old and new world — Result-wrap vs raw Int return — unrelated to this change

Summary by CodeRabbit

Tests
- Added comprehensive unit tests covering removal of module-level effects declarations across inline, multi-line and edge-case scenarios.
Chores
- Improved the evaluation/benchmark pipeline to post-process generated code (removing module-level effects and injected mains) for cleaner, consistently type-checked bench outputs, ensuring more reliable and predictable results.

Aver 0.13 enforces module-level `effects [...]` as a hard type-check boundary: every fn's `! [Effect]` must be covered or the program fails to compile. The benchmark prompts the LLM for a function only, then strips its `main()` and injects its own `Console.print(fn(...))` main with `! [Console.print]`. If the LLM declared a narrower module boundary in the original source — including the very common `effects []` for a "pure" module — the injected main violates the boundary and `aver run` fails with an underdeclared-effects type error before any test case even runs. Fix: drop the module-header `effects [...]` line as part of the main-replacement pass. The module reverts to legacy / no-boundary mode, the injected main type-checks, and the rest of the harness runs unchanged. Handles three shapes: effects [] effects [Console.print, Disk.readText] effects [ Console.print, Disk.readText, ] No-op when the module declares no `effects [...]`. Verified on tier1 with claude-haiku-4-5: 9/10 pass-rate (matches baseline). The known failure (VB-T1-007 safe_modulo) is unrelated — the model returns Result<Int, String> instead of raw Int.

Four test cases on TestStripModuleEffects: - inline `effects []` - inline `effects [Console.print, Disk.readText]` - multi-line bracketed list - no-op when the module declares no effects boundary Each asserts the boundary line is gone and the rest of the module (including a `! [Console.print]` inside the function body) is left untouched.

coderabbitai · 2026-04-27T11:31:06Z

Warning

Rate limit exceeded

@aallan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 44 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c6ec6233-2e85-4432-9cfe-55749f678790

📥 Commits

Reviewing files that changed from the base of the PR and between 661a0b7 and 87a835e.

📒 Files selected for processing (6)

CHANGELOG.md
CITATION.cff
ROADMAP.md
pyproject.toml
tests/test_runner.py
vera_bench/runner.py

📝 Walkthrough

Walkthrough

Remove module-level effects [...] declarations (single-line and multi-line) from LLM-generated Vera code within the module header, and run this cleanup in the Aver evaluation path before injecting the benchmark fn main(). Tests added to validate stripping and no-op scenarios.

Changes

Cohort / File(s)	Summary
Runner / Effects stripping `vera_bench/runner.py`	Adds private regex `_AVER_EFFECTS_OPEN_RE` and helper `_strip_module_effects` to remove `effects [...]` from the module header (handles `effects []`, `effects [...]`, multi-line bracketed forms and varying whitespace). Integrates this into the Aver evaluation flow by piping `_strip_aver_main` output through the new stripper.
Tests / Unit coverage `tests/test_runner.py`	Adds `TestStripModuleEffects` tests importing `_strip_module_effects` covering removal of inline and multi-line module `effects`, preservation when no module `effects` present, preservation of `effects` outside a module header or inside function bodies.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

harness

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding module-level effects stripping before injecting test main in the Aver evaluation path.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 22 minutes and 44 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_runner.py`:
- Around line 886-937: Add a test in TestStripModuleEffects that verifies
_strip_module_effects is a no-op when the input lacks a module declaration (to
cover the else branch in runner.py that later wraps code); create a case where
the input contains a top-level "effects [...]" line but no "module ..." header
(e.g., code = 'effects [Console.print]\n\nfn f() -> Unit\n   
Console.print("hi")\n') and assert the result equals the original code (or at
least that the top-level effects line is preserved and function body still
contains Console.print) so we lock in that _strip_module_effects does not remove
effects when there is no module header.

In `@vera_bench/runner.py`:
- Around line 569-601: The current _strip_module_effects implementation
unconditionally drops any effects [...] block anywhere and uses
stripped.endswith("]") to find the terminator, which can mis-strip if Aver later
allows nested blocks or trailing tokens on the closing line; update
_strip_module_effects to only operate in the module header region (the first
non-blank block immediately under a `module X` declaration) and to detect an
effects list with a regex anchored to that header (e.g. match a line like
/^\s*effects\s*\[/), then drop lines up through the first line that contains a
closing ']' (allowing trailing comments/parameters on that line) while leaving
other occurrences alone; keep the function name _strip_module_effects and ensure
the logic is scoped to the module header window before skipping lines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0aebeecf-2057-42fc-b45e-1971b9f0238a

📥 Commits

Reviewing files that changed from the base of the PR and between bd9b6d5 and 5f1611a.

📒 Files selected for processing (2)

tests/test_runner.py
vera_bench/runner.py

codecov · 2026-04-27T11:37:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.62%. Comparing base (988cb33) to head (87a835e).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #62      +/-   ##
==========================================
+ Coverage   83.30%   83.62%   +0.31%     
==========================================
  Files          10       10              
  Lines        1366     1392      +26     
==========================================
+ Hits         1138     1164      +26     
  Misses        228      228

Flag	Coverage Δ
python	`83.62% <100.00%> (+0.31%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Addressing CodeRabbit review on PR aallan#62: The original `_strip_module_effects` matched any line starting with `effects [` regardless of where it sat in the file. That worked for today's Aver but is fragile: if Aver later allows `effects [...]` in some other position (or an LLM emits an `effects [...]`-shaped line inside a function body), the bench would silently delete it. Scope the strip to the module header — i.e. the indented region immediately under `module X`, ending at the next top-level item. That mirrors how the Aver parser scopes `intent` / `exposes` / `depends` / `effects`. Two new tests lock in the behaviour: - `test_no_op_when_no_module_declaration`: input without a `module X` line stays untouched, even if it starts with an `effects [...]` token. - `test_only_strips_inside_module_header`: an `effects [...]`-shaped line inside a function body survives; only the header occurrence is removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aallan · 2026-04-29T08:53:09Z

@jasisz Sorry for the delay on getting back to this — the substantive change is reviewed and looks solid (will leave detailed feedback once CI is green), but it can't merge yet because the dependency-audit job is failing on an unrelated issue.

The failure is CVE-2026-3219 in pip 26.0.1: actions/setup-python@v6 bakes that pip version into its toolchain image, and pip-audit running inside the runner reports the runner's own pip as vulnerable. Vera hit the same problem at aallan/vera#537; we've applied the same workaround in vera-bench at #64 (with #63 tracking eventual cleanup).

Once #64 lands you'll just need to rebase this branch onto the new main and CI should go green. Happy to do that for you if "allow edits from maintainers" is on, or you can do it yourself — whatever's easier.

Thanks again for getting ahead of the Aver 0.13 boundary change here. The "no-op until 0.13 ships" property of the strip pass is exactly the right way to land forward-compat work, and we want this in before models start picking up the doc nudge for effects [].

…N_ISSUES The dependency-audit job started failing on PR aallan#62 because actions/setup-python@v6 bakes pip 26.0.1 into its Python 3.12 image, and pip 26.0.1 has CVE-2026-3219 (archive handling). The fix landed in pip 26.1 on 2026-04-26 but won't reach the runner image until GitHub refreshes the toolchain. Workaround mirrors aallan/vera#537: a `pip install --upgrade pip` step before pip-audit runs, pulling pip 26.1 from PyPI to replace the bundled 26.0.1. Inline comment in ci.yml points at the tracking issue (aallan#63) so the workaround doesn't quietly outlive its reason. Also opens KNOWN_ISSUES.md as the catalogue location for active workarounds, dev-env gotchas, and analytical caveats — each with an explicit "removal trigger" so cleanup is straightforward later. Initial entries: - The CI workaround above (aallan#63) - assets/results-graph.png pinned to v0.0.7 content until the v0.0.9 narrative writeup - input_tokens semantic shift across PR aallan#60's prompt-caching merge (analytical caveat for cost trending across that boundary) - /opt/homebrew/bin/vera is not the Vera programming language (dev-env collision with an unrelated Homebrew package) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vera_bench/runner.py`:
- Around line 617-621: The module header opener check is too strict (only
matches "effects [" or "effects[") so variants like "effects    [" are missed;
add a compiled regex _AVER_EFFECTS_OPEN_RE = re.compile(r"^effects\s*\[") and
replace the long tuple check with a single match call: use
_AVER_EFFECTS_OPEN_RE.match(stripped) in the existing conditional that
references in_module_header, indent_len and stripped so the header detection
tolerates arbitrary whitespace before the '['; ensure re is imported/available
where _AVER_EFFECTS_OPEN_RE is defined and referenced.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a92443f8-032e-4650-8bd9-a535d0a0e3f4

📥 Commits

Reviewing files that changed from the base of the PR and between 5f1611a and b0545a7.

📒 Files selected for processing (2)

tests/test_runner.py
vera_bench/runner.py

The opener check used startswith("effects [", "effects["), which misses LLM output with non-canonical whitespace (multiple spaces, tab) between the keyword and the bracket — variants the Aver parser still accepts. Replace the prefix tuple with _AVER_EFFECTS_OPEN_RE = re.compile( r"^effects\\s*\\[") and cover the variants in TestStripModuleEffects.

jasisz · 2026-04-29T09:09:12Z

Thanks for handling the rebase yourself, and for the quick #64 — saved me a context-switch out of Aver 0.14 work, much appreciated.

Pushed CR's whitespace point as 661a0b7: _AVER_EFFECTS_OPEN_RE = re.compile(r"^effects\s*\[") plus a test covering opener variants the parser still accepts (effects[, effects [, effects [, effects\t[). CI should be green once it runs.

Glad the no-op-then-activate property landed well — happy to use the same shape for future Aver-X.Y boundary changes; that way the bench can stay continuously valid for whichever Aver version a model actually emits.

coderabbitai

♻️ Duplicate comments (1)

vera_bench/runner.py (1)

597-602: ⚠️ Potential issue | 🟠 Major

Bracket-close detection is too strict and can over-strip the file.

At Line 600 and Line 625, using stripped.endswith("]") means a closing bracket with trailing tokens is treated as unclosed. In the multi-line path, that can keep skip_until_close active and silently drop later declarations.

Proposed fix

-            if stripped.endswith("]"):
+            if "]" in stripped:
                 skip_until_close = False
             continue
@@
-            if stripped.endswith("]"):
+            if "]" in stripped:
                 continue
             skip_until_close = True
             continue

Also applies to: 625-627

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 597 - 602, The multi-line bracket-close
detection using stripped.endswith("]") (the skip_until_close handling in the
loop) is too strict and misses cases where a closing ']' appears earlier on the
line with trailing tokens; update both occurrences that check
stripped.endswith("]") to instead detect a ']' anywhere on the line (e.g., check
for ']' in the original line or use a regex/search) so skip_until_close is
cleared as soon as a closing bracket is present; modify the checks where
skip_until_close is toggled (the block referencing stripped.endswith("]") and
the similar block later around the other occurrence) to use the broader presence
test and keep existing continue behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 597-602: The multi-line bracket-close detection using
stripped.endswith("]") (the skip_until_close handling in the loop) is too strict
and misses cases where a closing ']' appears earlier on the line with trailing
tokens; update both occurrences that check stripped.endswith("]") to instead
detect a ']' anywhere on the line (e.g., check for ']' in the original line or
use a regex/search) so skip_until_close is cleared as soon as a closing bracket
is present; modify the checks where skip_until_close is toggled (the block
referencing stripped.endswith("]") and the similar block later around the other
occurrence) to use the broader presence test and keep existing continue
behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1202c542-f63d-4276-9697-cad9ebf99220

📥 Commits

Reviewing files that changed from the base of the PR and between b0545a7 and 661a0b7.

📒 Files selected for processing (2)

tests/test_runner.py
vera_bench/runner.py

aallan · 2026-04-29T09:13:11Z

@jasisz Heads-up: I'd just finished writing the same regex fix locally (_AVER_EFFECTS_OPEN_RE = re.compile(r"^effects\s*\[") + a parameterized whitespace-tolerance test) when I went to push and discovered you'd already landed 661a0b7 doing exactly that. Independent convergence — same regex, same test shape, same four opener variants covered.

Your test is slightly better than what I wrote: asserting "Console.print]" not in header_part is more precise than asserting "effects" not in header_part — yours would catch a hypothetical regression where the strip removes only the keyword but leaves the bracket contents, mine wouldn't. Good instinct on the test target.

Abandoned my local commit, no duplicate push needed. Verified the suite locally on your branch — 104 in test_runner.py pass, including the new whitespace-tolerance case. CI should go green; I'll merge once it does and CodeRabbit posts its final pass.

Thanks again for getting ahead of the Aver 0.13 boundary change here. The whole strip pass — including the regex hardening — is the kind of forward-compat work that's worth landing well before the breaking change ships.

Both `stripped.endswith("]")` checks in _strip_module_effects were brittle to Aver's trailing line comments (`// ...`). For an LLM-emitted declaration like: effects [Console.print] // pure module the strip pass would: 1. match the opener regex (correctly) 2. fail `endswith("]")` because the line ends in `e` (`pure`) 3. enter `skip_until_close` mode 4. chew through subsequent lines until something happens to end in `]` — likely catastrophic for the function body Same hazard on the closing line of a multi-line list: effects [ Console.print, ] // pure module Replace both `endswith("]")` checks with `"]" in stripped`. Module-level effect lists are simple identifier sequences with no string literals or nested brackets, so the theoretical false-match concern (a `]` inside a string in the list) doesn't apply now or in any planned Aver evolution. Tests: - `test_strips_inline_effects_with_trailing_comment` — the inline variant; asserts the function body is intact (i.e. NOT eaten by a runaway skip_until_close). - `test_strips_multiline_effects_with_trailing_comment_on_close` — same hazard on the closing line of a multi-line list. Both new tests pass; full `tests/test_runner.py` suite green (106 passed locally). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aallan · 2026-04-29T09:25:49Z

@coderabbitai @jasisz Pushed 918c3e3 addressing the bracket-close finding from the duplicate-comments review.

CodeRabbit was right — this isn't really a duplicate, it's the second half of an earlier two-part observation (opener strict-match + closer strict-match) where Szymon's 661a0b7 addressed the opener but not the closer. The endswith("]") check at lines 600 and 625 had a worse-than-no-op failure mode for effects [Console.print] // pure module: the opener regex would match, endswith("]") would be False (line ends in e), and skip_until_close would chew through the function body until something happened to end in ].

Applied the suggested fix verbatim (endswith("]") → "]" in stripped in both spots), with a comment explaining the trailing-comment hazard at each call site so the rationale doesn't evaporate.

Two new tests in TestStripModuleEffects:

test_strips_inline_effects_with_trailing_comment — the inline variant; explicitly asserts the function body is intact (i.e. NOT eaten by a runaway skip_until_close).
test_strips_multiline_effects_with_trailing_comment_on_close — same hazard on the closing line of a multi-line list.

Both pass; full tests/test_runner.py green at 106 cases (104 + 2 new). ruff check and ruff format --check clean.

@jasisz — sorry to push to your branch a second time today. Same convergence pattern as this morning's regex fix: CodeRabbit re-raised an unaddressed half of its earlier comment, the suggested diff was small and mechanical, and pushing was faster than waiting. If you'd prefer me to comment-and-let-you-push for future iterations on this PR, just say so — happy to switch modes.

coderabbitai · 2026-04-29T09:26:22Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the current state of runner.py around the relevant lines
rg -n 'endswith|in stripped|skip_until_close|AVER_EFFECTS' vera_bench/runner.py -A2 -B2

echo "---"
# Also check the new tests
rg -n 'trailing_comment' tests/test_runner.py -A20

Length of output: 3908

@aallan The fix in 918c3e3 is exactly right. Verified against the current branch:

Line 603 (skip_until_close branch) and Line 632 (inline-effects branch) both now use "]" in stripped — the explanatory comments at each site make the trailing-comment rationale durable.
The two new tests are well-targeted: asserting "fn f() -> Unit" in result and 'Console.print("hi")' in result directly catches the runaway-skip_until_close failure mode rather than just checking the effects line is gone. That's the right threat model to pin.

With _AVER_EFFECTS_OPEN_RE (whitespace-tolerant opener from 661a0b7) and the "]" in stripped closer now in place, _strip_module_effects handles the full matrix of Aver-generated forms correctly. The 106-case suite looks solid — happy to see this land.

🧠 Learnings used

Learnt from: aallan
Repo: aallan/vera-bench PR: 52
File: scripts/run_full_benchmark.py:237-245
Timestamp: 2026-04-13T18:21:35.683Z
Learning: In aallan/vera-bench, PR `#52` (docs/aver-support-docs) explicitly depends on PR `#48` for the Aver baseline runner implementation. `run_aver_baseline()` is added to `vera_bench/baseline_runner.py` in PR `#48` and wired into the CLI there. The `NotImplementedError` guard for unknown languages does NOT apply to "aver" once PR `#48` is merged/rebased. Do not flag `--language aver` usage in `scripts/run_full_benchmark.py` as broken; it is valid after the PR `#48` dependency is satisfied.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T4_004_power.av:6-10
Timestamp: 2026-04-13T07:53:25.971Z
Learning: In vera-bench (aallan/vera-bench), Aver source files (`solutions/aver/**`, `**/*.av`) are excluded from CodeRabbit review via `path_filters` in `.coderabbit.yaml`. CodeRabbit has no training data for the Aver language and applying general-purpose heuristics to Aver code produces false positives. Aver's `verify` blocks constrain the callable input domain in the same way Vera's `requires`/`ensures` clauses do — edge cases that appear unreachable under those constraints should not be flagged as deficiencies. Do not review `*.av` files.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T1_007_safe_modulo.av:4-12
Timestamp: 2026-04-13T07:53:20.071Z
Learning: In vera-bench (aallan/vera-bench), Aver (`.av`) source files use `verify` blocks and inline `?` doc-strings as precondition/postcondition contracts that constrain the valid input domain — exactly analogous to Vera's `requires`/`ensures` clauses. Edge cases excluded by these contracts (e.g. `b == 0` when the contract states `b != 0`) are not reachable in valid usage. Do not flag missing runtime guards for inputs that are excluded by an Aver contract. Aver solution files live under `solutions/aver/` and are excluded from automated review via `path_filters: ["!solutions/aver/**", "!**/*.av"]` in `.coderabbit.yaml`.

Learnt from: aallan
Repo: aallan/vera-bench PR: 64
File: .github/workflows/ci.yml:143-149
Timestamp: 2026-04-29T08:53:07.436Z
Learning: In aallan/vera-bench, the `dependency-audit` job in `.github/workflows/ci.yml` uses `pip install --upgrade pip` (rather than the `pip-version` input on `actions/setup-pythonv6`) to work around CVE-2026-3219. This form is intentional: it mirrors aallan/vera#537 verbatim because vera and vera-bench share a CI playbook by convention. Do not suggest replacing it with `pip-version: "26.1"` — diverging the mechanism would silently fragment the shared playbook. The trade-off is documented in vera-bench issue `#63`.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T4_005_sum_to_n.av:8-10
Timestamp: 2026-04-13T07:53:28.149Z
Learning: In vera-bench (aallan/vera-bench), Aver is a statically typed language (file extension `.av`) for which CodeRabbit has no training data. Aver's `verify` blocks serve the same role as Vera's `requires`/`ensures` clauses: they constrain the valid input domain and make out-of-contract edge cases (e.g., negative inputs to a natural-number function) unreachable in valid usage. Do not apply general-purpose type or recursion heuristics to `.av` files — all Aver solutions under `solutions/aver/` are excluded from review via `path_filters` (`!solutions/aver/**`, `!**/*.av`) in `.coderabbit.yaml`.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T4_001_fibonacci.av:7-12
Timestamp: 2026-04-13T07:53:21.497Z
Learning: In vera-bench (aallan/vera-bench), Aver is a zero-training-data language with its own type system, contract model (`verify` blocks), and syntax (`.av` files). Do not apply Vera-specific, Haskell-style, or general FP heuristics (e.g., `Nat` types, `decreases` clauses, specific function naming conventions from the problem spec) to Aver source files in `solutions/aver/**`. Aver's `verify` blocks constrain the input domain in the same way Vera's `requires`/`ensures` clauses do. Aver solution files are excluded from automated review via `path_filters: !solutions/aver/**` and `!**/*.av` in `.coderabbit.yaml`.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T4_003_even_odd_mutual_recursion.av:7-17
Timestamp: 2026-04-13T07:53:25.467Z
Learning: In vera-bench (aallan/vera-bench), Aver (`.av`) source files in `solutions/aver/` are excluded from CodeRabbit review via `path_filters` (`!solutions/aver/**`, `!**/*.av`), matching the existing exclusions for Vera (`!**/*.vera`), Python (`!solutions/python/**`), and TypeScript (`!solutions/typescript/**`). Do not review `.av` files.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T5_002_greeter_io_boundary.av:8-8
Timestamp: 2026-04-13T07:53:35.548Z
Learning: In the vera-bench repository (aallan/vera-bench), Aver source files (`solutions/aver/**`, `**/*.av`) should be excluded from CodeRabbit automated reviews via `path_filters` in `.coderabbit.yaml`, matching the existing exclusions for `!**/*.vera`, `!solutions/python/**`, and `!solutions/typescript/**`. CodeRabbit has no training data for the Aver language and applying general-purpose heuristics produces false positives.

Learnt from: aallan
Repo: aallan/vera-bench PR: 36
File: tests/test_models.py:84-89
Timestamp: 2026-03-31T14:05:18.280Z
Learning: In `tests/test_models.py` in the `vera-bench` repository, the Anthropic and OpenAI client tests deliberately patch `vera_bench.models.anthropic` / `vera_bench.models.openai` (module-level references) rather than the constructors directly. This is an intentional choice: the `[llm]` extras (`pip install -e .[llm]`) are always installed in CI, so the SDK modules are guaranteed present in `sys.modules`. Do not flag this patch-target approach as incorrect during code review.

The Aver evaluation harness now strips module-header `effects [...]` declarations before injecting the test main (the substantive change in this PR). On Aver 0.12 and earlier the strip is a no-op and Aver scoring is byte-identical to v0.0.9. Once Aver 0.13 ships and models start emitting `effects [...]` per the updated docs, the strip will activate on a measurable fraction of generations and prevent the underdeclared-effects type error — Aver `run_correct` rates will diverge between v0.0.9 and v0.0.10 on Aver 0.13+, so the bump records the methodology boundary in `bench_version` for cross-version analysis. Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected. Files touched: - pyproject.toml: 0.0.9 -> 0.0.10 (importlib.metadata source of truth) - CITATION.cff: version + date-released bumped together - CHANGELOG.md: new [0.0.10] section with Compatibility note explaining the no-op-until-Aver-0.13 nuance; link references updated - ROADMAP.md: prepended a v0.0.10 line above the v0.0.9 summary Verified: `pip install -e .` followed by `python -c "from importlib.metadata import version; print(version('vera-bench'))"` reports 0.0.10, full test suite green at 494 cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aallan · 2026-04-29T09:33:41Z

@jasisz One more touch on the branch: pushed 87a835e bumping the bench version 0.0.9 → 0.0.10. Reasoning is that this PR is a methodology change (post-processing on LLM output before compilation) — same input, different output to the compiler — so result files written by the post-merge harness aren't directly comparable to v0.0.9 result files once Aver 0.13 ships and the strip starts firing on real generations.

On Aver 0.9.5/0.12 the strip is a no-op and scores are byte-identical, so the bump is preemptive — but it records the methodology boundary in bench_version now, rather than retroactively when the divergence becomes observable.

Files touched:

pyproject.toml: 0.0.9 → 0.0.10
CITATION.cff: version + date-released bumped together (date 2026-04-29)
CHANGELOG.md: new [0.0.10] section with a Compatibility note explaining the no-op-until-Aver-0.13 nuance
ROADMAP.md: prepended a v0.0.10 summary line

Verified importlib.metadata.version("vera-bench") == "0.0.10" after a fresh pip install -e ., full test suite green at 494 cases.

This should be the last push from us — once CI goes green I'll merge and tag v0.0.10.

aallan

CI is all green. Ready to merge.

@aallan

Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70. (Item 1 — extracting --parallel N into its own PR — addressed via PR aallan#73.) ### Item 2: README headline section -> single sentence in §Overview Removed the "AILANG: AI-designed language..." headline section (13 lines: the heading, the description paragraphs, the per-mode results table, the "full-circle finding" paragraph). The phrasing included editorial claims about VeraBench's identity that should be a project-owner call, and "added in this fork" wouldn't read correctly post-merge. Replaced the §Overview line about baselines with the form @aallan suggested verbatim: The same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/) as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis. Matches the existing Aver pattern: light-touch mention without results writeups in the README. ### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md Neither file is load-bearing — no code or tests reference them. Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or AVER_MAPPING.md. Numbers and writeups go in PR descriptions and external content; in-repo docs are reserved for things future maintainers need. ### Item 4: .coderabbit.yaml path_filters Added the two missing AILANG entries to mirror the existing {python, typescript, aver} pattern: - "!**/*.ail" (alongside !**/*.vera, !**/*.av) - "!solutions/ailang/**" (alongside the other solutions/* entries) This stops CodeRabbit from generating speculative findings on .ail solution files in future review passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jasisz added 2 commits April 27, 2026 13:12

jasisz requested a review from aallan as a code owner April 27, 2026 11:30

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/test_runner.py

Comment thread vera_bench/runner.py

This was referenced Apr 29, 2026

CI: drop pip --upgrade in dependency-audit once setup-python ships pip 26.1+ #63

Closed

ci: work around CVE-2026-3219 in setup-python's bundled pip #64

Merged

Merge branch 'main' into aver-strip-module-effects

b0545a7

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread vera_bench/runner.py

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

aallan approved these changes Apr 29, 2026

View reviewed changes

aallan merged commit 100075f into aallan:main Apr 29, 2026
8 checks passed

aallan mentioned this pull request May 4, 2026

aver: migrate baselines + harness to Aver 0.16 Console=String #65

Merged

Conversation

jasisz commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Test plan

Context

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aallan commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jasisz commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

aallan commented Apr 29, 2026

Uh oh!

aallan commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Uh oh!

aallan commented Apr 29, 2026

Uh oh!

aallan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jasisz commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

codecov Bot commented Apr 27, 2026 •

edited

Loading

jasisz commented Apr 29, 2026 •

edited

Loading