Skip to content

aver: strip module-level effects before injecting test main (planned 0.13)#62

Merged
aallan merged 7 commits into
aallan:mainfrom
jasisz:aver-strip-module-effects
Apr 29, 2026
Merged

aver: strip module-level effects before injecting test main (planned 0.13)#62
aallan merged 7 commits into
aallan:mainfrom
jasisz:aver-strip-module-effects

Conversation

@jasisz

@jasisz jasisz commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Aver 0.13 (the upcoming release) introduces a module-level effects [...] boundary that the type-checker enforces: every function's ! [Effect] must be covered, or compilation fails.

The bench harness currently asks the LLM for a function only, then strips the LLM's main() and injects its own Console.print(fn(...)) main with ! [Console.print]. If the LLM declared a narrower module boundary in the original source — including the very common effects [] for a "pure" module — the injected main violates that boundary and aver run fails with an underdeclared-effects type error before any test case runs.

This PR adds _strip_module_effects(code) to the runner, called right after _strip_aver_main(code). It removes the module-header effects [...] line (inline or multi-line), so the module reverts to legacy / no-boundary mode and the injected main type-checks.

Compatibility:

  • No-op on Aver 0.12 and earlier — those versions don't recognise effects [...] on the module header, so LLMs don't generate it and the strip pass simply doesn't fire.
  • Necessary the moment Aver 0.13 ships — once the boundary is enforced, any LLM that follows the language docs and declares it will fail bench without this fix. We measured a 50pp regression on tier1 with claude-haiku-4-5 on 0.13-dev when the doc nudge for effects [] reaches the model.

So merging now is safe: it doesn't change current behaviour, and it lets aver-bench continue to work on the day 0.13 lands without a follow-up rush.

What changed

  • vera_bench/runner.py: new _strip_module_effects(code) helper, wired into the test-main injection path.
  • tests/test_runner.py: 4 cases on TestStripModuleEffects (inline [], inline list, multi-line list, no-op).

Test plan

  • pytest tests/test_runner.py — 101 passed locally
  • Full suite pytest tests/ — 489 passed
  • ruff check + ruff format --check clean
  • Tier1 sanity with claude-haiku-4-5 on Aver 0.13-dev: 9/10 (matches Aver 0.12 baseline)

Context

  • Linked discussion: this is the bench-side counterpart to a docs change in jasisz/aver where the language now documents effects [...] as a first-class module feature
  • VB-T1-007 (safe_modulo) remains the one tier1 fail on haiku-4.5 in both old and new world — Result-wrap vs raw Int return — unrelated to this change

Summary by CodeRabbit

  • Tests

    • Added comprehensive unit tests covering removal of module-level effects declarations across inline, multi-line and edge-case scenarios.
  • Chores

    • Improved the evaluation/benchmark pipeline to post-process generated code (removing module-level effects and injected mains) for cleaner, consistently type-checked bench outputs, ensuring more reliable and predictable results.

jasisz added 2 commits April 27, 2026 13:12
Aver 0.13 enforces module-level `effects [...]` as a hard
type-check boundary: every fn's `! [Effect]` must be covered or the
program fails to compile. The benchmark prompts the LLM for a
function only, then strips its `main()` and injects its own
`Console.print(fn(...))` main with `! [Console.print]`.

If the LLM declared a narrower module boundary in the original
source — including the very common `effects []` for a "pure"
module — the injected main violates the boundary and `aver run`
fails with an underdeclared-effects type error before any test case
even runs.

Fix: drop the module-header `effects [...]` line as part of the
main-replacement pass. The module reverts to legacy / no-boundary
mode, the injected main type-checks, and the rest of the harness
runs unchanged. Handles three shapes:

  effects []
  effects [Console.print, Disk.readText]
  effects [
      Console.print,
      Disk.readText,
  ]

No-op when the module declares no `effects [...]`.

Verified on tier1 with claude-haiku-4-5: 9/10 pass-rate (matches
baseline). The known failure (VB-T1-007 safe_modulo) is unrelated —
the model returns Result<Int, String> instead of raw Int.
Four test cases on TestStripModuleEffects:
- inline `effects []`
- inline `effects [Console.print, Disk.readText]`
- multi-line bracketed list
- no-op when the module declares no effects boundary

Each asserts the boundary line is gone and the rest of the module
(including a `! [Console.print]` inside the function body) is left
untouched.
@jasisz jasisz requested a review from aallan as a code owner April 27, 2026 11:30
@coderabbitai

coderabbitai Bot commented Apr 27, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@aallan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 44 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c6ec6233-2e85-4432-9cfe-55749f678790

📥 Commits

Reviewing files that changed from the base of the PR and between 661a0b7 and 87a835e.

📒 Files selected for processing (6)
  • CHANGELOG.md
  • CITATION.cff
  • ROADMAP.md
  • pyproject.toml
  • tests/test_runner.py
  • vera_bench/runner.py
📝 Walkthrough

Walkthrough

Remove module-level effects [...] declarations (single-line and multi-line) from LLM-generated Vera code within the module header, and run this cleanup in the Aver evaluation path before injecting the benchmark fn main(). Tests added to validate stripping and no-op scenarios.

Changes

Cohort / File(s) Summary
Runner / Effects stripping
vera_bench/runner.py
Adds private regex _AVER_EFFECTS_OPEN_RE and helper _strip_module_effects to remove effects [...] from the module header (handles effects [], effects [...], multi-line bracketed forms and varying whitespace). Integrates this into the Aver evaluation flow by piping _strip_aver_main output through the new stripper.
Tests / Unit coverage
tests/test_runner.py
Adds TestStripModuleEffects tests importing _strip_module_effects covering removal of inline and multi-line module effects, preservation when no module effects present, preservation of effects outside a module header or inside function bodies.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

harness

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding module-level effects stripping before injecting test main in the Aver evaluation path.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 22 minutes and 44 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_runner.py`:
- Around line 886-937: Add a test in TestStripModuleEffects that verifies
_strip_module_effects is a no-op when the input lacks a module declaration (to
cover the else branch in runner.py that later wraps code); create a case where
the input contains a top-level "effects [...]" line but no "module ..." header
(e.g., code = 'effects [Console.print]\n\nfn f() -> Unit\n   
Console.print("hi")\n') and assert the result equals the original code (or at
least that the top-level effects line is preserved and function body still
contains Console.print) so we lock in that _strip_module_effects does not remove
effects when there is no module header.

In `@vera_bench/runner.py`:
- Around line 569-601: The current _strip_module_effects implementation
unconditionally drops any effects [...] block anywhere and uses
stripped.endswith("]") to find the terminator, which can mis-strip if Aver later
allows nested blocks or trailing tokens on the closing line; update
_strip_module_effects to only operate in the module header region (the first
non-blank block immediately under a `module X` declaration) and to detect an
effects list with a regex anchored to that header (e.g. match a line like
/^\s*effects\s*\[/), then drop lines up through the first line that contains a
closing ']' (allowing trailing comments/parameters on that line) while leaving
other occurrences alone; keep the function name _strip_module_effects and ensure
the logic is scoped to the module header window before skipping lines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0aebeecf-2057-42fc-b45e-1971b9f0238a

📥 Commits

Reviewing files that changed from the base of the PR and between bd9b6d5 and 5f1611a.

📒 Files selected for processing (2)
  • tests/test_runner.py
  • vera_bench/runner.py

Comment thread tests/test_runner.py
Comment thread vera_bench/runner.py
@codecov

codecov Bot commented Apr 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.62%. Comparing base (988cb33) to head (87a835e).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #62      +/-   ##
==========================================
+ Coverage   83.30%   83.62%   +0.31%     
==========================================
  Files          10       10              
  Lines        1366     1392      +26     
==========================================
+ Hits         1138     1164      +26     
  Misses        228      228              
Flag Coverage Δ
python 83.62% <100.00%> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Addressing CodeRabbit review on PR aallan#62:

The original `_strip_module_effects` matched any line starting with
`effects [` regardless of where it sat in the file. That worked for
today's Aver but is fragile: if Aver later allows `effects [...]` in
some other position (or an LLM emits an `effects [...]`-shaped line
inside a function body), the bench would silently delete it.

Scope the strip to the module header — i.e. the indented region
immediately under `module X`, ending at the next top-level item.
That mirrors how the Aver parser scopes `intent` / `exposes` /
`depends` / `effects`. Two new tests lock in the behaviour:

- `test_no_op_when_no_module_declaration`: input without a `module
  X` line stays untouched, even if it starts with an `effects [...]`
  token.
- `test_only_strips_inside_module_header`: an `effects [...]`-shaped
  line inside a function body survives; only the header occurrence
  is removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aallan

aallan commented Apr 29, 2026

Copy link
Copy Markdown
Owner

@jasisz Sorry for the delay on getting back to this — the substantive change is reviewed and looks solid (will leave detailed feedback once CI is green), but it can't merge yet because the dependency-audit job is failing on an unrelated issue.

The failure is CVE-2026-3219 in pip 26.0.1: actions/setup-python@v6 bakes that pip version into its toolchain image, and pip-audit running inside the runner reports the runner's own pip as vulnerable. Vera hit the same problem at aallan/vera#537; we've applied the same workaround in vera-bench at #64 (with #63 tracking eventual cleanup).

Once #64 lands you'll just need to rebase this branch onto the new main and CI should go green. Happy to do that for you if "allow edits from maintainers" is on, or you can do it yourself — whatever's easier.

Thanks again for getting ahead of the Aver 0.13 boundary change here. The "no-op until 0.13 ships" property of the strip pass is exactly the right way to land forward-compat work, and we want this in before models start picking up the doc nudge for effects [].

aallan added a commit to jasisz/vera-bench that referenced this pull request Apr 29, 2026
…N_ISSUES

The dependency-audit job started failing on PR aallan#62 because
actions/setup-python@v6 bakes pip 26.0.1 into its Python 3.12 image,
and pip 26.0.1 has CVE-2026-3219 (archive handling). The fix landed
in pip 26.1 on 2026-04-26 but won't reach the runner image until
GitHub refreshes the toolchain.

Workaround mirrors aallan/vera#537: a `pip install --upgrade pip`
step before pip-audit runs, pulling pip 26.1 from PyPI to replace
the bundled 26.0.1. Inline comment in ci.yml points at the tracking
issue (aallan#63) so the workaround doesn't quietly outlive its reason.

Also opens KNOWN_ISSUES.md as the catalogue location for active
workarounds, dev-env gotchas, and analytical caveats — each with an
explicit "removal trigger" so cleanup is straightforward later.

Initial entries:
- The CI workaround above (aallan#63)
- assets/results-graph.png pinned to v0.0.7 content until the
  v0.0.9 narrative writeup
- input_tokens semantic shift across PR aallan#60's prompt-caching merge
  (analytical caveat for cost trending across that boundary)
- /opt/homebrew/bin/vera is not the Vera programming language
  (dev-env collision with an unrelated Homebrew package)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vera_bench/runner.py`:
- Around line 617-621: The module header opener check is too strict (only
matches "effects [" or "effects[") so variants like "effects    [" are missed;
add a compiled regex _AVER_EFFECTS_OPEN_RE = re.compile(r"^effects\s*\[") and
replace the long tuple check with a single match call: use
_AVER_EFFECTS_OPEN_RE.match(stripped) in the existing conditional that
references in_module_header, indent_len and stripped so the header detection
tolerates arbitrary whitespace before the '['; ensure re is imported/available
where _AVER_EFFECTS_OPEN_RE is defined and referenced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a92443f8-032e-4650-8bd9-a535d0a0e3f4

📥 Commits

Reviewing files that changed from the base of the PR and between 5f1611a and b0545a7.

📒 Files selected for processing (2)
  • tests/test_runner.py
  • vera_bench/runner.py

Comment thread vera_bench/runner.py
The opener check used startswith("effects [", "effects["), which misses
LLM output with non-canonical whitespace (multiple spaces, tab) between
the keyword and the bracket — variants the Aver parser still accepts.
Replace the prefix tuple with _AVER_EFFECTS_OPEN_RE = re.compile(
r"^effects\\s*\\[") and cover the variants in TestStripModuleEffects.
@jasisz

jasisz commented Apr 29, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for handling the rebase yourself, and for the quick #64 — saved me a context-switch out of Aver 0.14 work, much appreciated.

Pushed CR's whitespace point as 661a0b7: _AVER_EFFECTS_OPEN_RE = re.compile(r"^effects\s*\[") plus a test covering opener variants the parser still accepts (effects[, effects [, effects [, effects\t[). CI should be green once it runs.

Glad the no-op-then-activate property landed well — happy to use the same shape for future Aver-X.Y boundary changes; that way the bench can stay continuously valid for whichever Aver version a model actually emits.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
vera_bench/runner.py (1)

597-602: ⚠️ Potential issue | 🟠 Major

Bracket-close detection is too strict and can over-strip the file.

At Line 600 and Line 625, using stripped.endswith("]") means a closing bracket with trailing tokens is treated as unclosed. In the multi-line path, that can keep skip_until_close active and silently drop later declarations.

Proposed fix
-            if stripped.endswith("]"):
+            if "]" in stripped:
                 skip_until_close = False
             continue
@@
-            if stripped.endswith("]"):
+            if "]" in stripped:
                 continue
             skip_until_close = True
             continue

Also applies to: 625-627

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 597 - 602, The multi-line bracket-close
detection using stripped.endswith("]") (the skip_until_close handling in the
loop) is too strict and misses cases where a closing ']' appears earlier on the
line with trailing tokens; update both occurrences that check
stripped.endswith("]") to instead detect a ']' anywhere on the line (e.g., check
for ']' in the original line or use a regex/search) so skip_until_close is
cleared as soon as a closing bracket is present; modify the checks where
skip_until_close is toggled (the block referencing stripped.endswith("]") and
the similar block later around the other occurrence) to use the broader presence
test and keep existing continue behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 597-602: The multi-line bracket-close detection using
stripped.endswith("]") (the skip_until_close handling in the loop) is too strict
and misses cases where a closing ']' appears earlier on the line with trailing
tokens; update both occurrences that check stripped.endswith("]") to instead
detect a ']' anywhere on the line (e.g., check for ']' in the original line or
use a regex/search) so skip_until_close is cleared as soon as a closing bracket
is present; modify the checks where skip_until_close is toggled (the block
referencing stripped.endswith("]") and the similar block later around the other
occurrence) to use the broader presence test and keep existing continue
behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1202c542-f63d-4276-9697-cad9ebf99220

📥 Commits

Reviewing files that changed from the base of the PR and between b0545a7 and 661a0b7.

📒 Files selected for processing (2)
  • tests/test_runner.py
  • vera_bench/runner.py

@aallan

aallan commented Apr 29, 2026

Copy link
Copy Markdown
Owner

@jasisz Heads-up: I'd just finished writing the same regex fix locally (_AVER_EFFECTS_OPEN_RE = re.compile(r"^effects\s*\[") + a parameterized whitespace-tolerance test) when I went to push and discovered you'd already landed 661a0b7 doing exactly that. Independent convergence — same regex, same test shape, same four opener variants covered.

Your test is slightly better than what I wrote: asserting "Console.print]" not in header_part is more precise than asserting "effects" not in header_part — yours would catch a hypothetical regression where the strip removes only the keyword but leaves the bracket contents, mine wouldn't. Good instinct on the test target.

Abandoned my local commit, no duplicate push needed. Verified the suite locally on your branch — 104 in test_runner.py pass, including the new whitespace-tolerance case. CI should go green; I'll merge once it does and CodeRabbit posts its final pass.

Thanks again for getting ahead of the Aver 0.13 boundary change here. The whole strip pass — including the regex hardening — is the kind of forward-compat work that's worth landing well before the breaking change ships.

Both `stripped.endswith("]")` checks in _strip_module_effects were
brittle to Aver's trailing line comments (`// ...`). For an LLM-emitted
declaration like:

    effects [Console.print] // pure module

the strip pass would:
  1. match the opener regex (correctly)
  2. fail `endswith("]")` because the line ends in `e` (`pure`)
  3. enter `skip_until_close` mode
  4. chew through subsequent lines until something happens to end
     in `]` — likely catastrophic for the function body

Same hazard on the closing line of a multi-line list:

    effects [
        Console.print,
    ] // pure module

Replace both `endswith("]")` checks with `"]" in stripped`. Module-level
effect lists are simple identifier sequences with no string literals or
nested brackets, so the theoretical false-match concern (a `]` inside a
string in the list) doesn't apply now or in any planned Aver evolution.

Tests:
- `test_strips_inline_effects_with_trailing_comment` — the inline
  variant; asserts the function body is intact (i.e. NOT eaten by a
  runaway skip_until_close).
- `test_strips_multiline_effects_with_trailing_comment_on_close` — same
  hazard on the closing line of a multi-line list.

Both new tests pass; full `tests/test_runner.py` suite green
(106 passed locally).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aallan

aallan commented Apr 29, 2026

Copy link
Copy Markdown
Owner

@coderabbitai @jasisz Pushed 918c3e3 addressing the bracket-close finding from the duplicate-comments review.

CodeRabbit was right — this isn't really a duplicate, it's the second half of an earlier two-part observation (opener strict-match + closer strict-match) where Szymon's 661a0b7 addressed the opener but not the closer. The endswith("]") check at lines 600 and 625 had a worse-than-no-op failure mode for effects [Console.print] // pure module: the opener regex would match, endswith("]") would be False (line ends in e), and skip_until_close would chew through the function body until something happened to end in ].

Applied the suggested fix verbatim (endswith("]")"]" in stripped in both spots), with a comment explaining the trailing-comment hazard at each call site so the rationale doesn't evaporate.

Two new tests in TestStripModuleEffects:

  • test_strips_inline_effects_with_trailing_comment — the inline variant; explicitly asserts the function body is intact (i.e. NOT eaten by a runaway skip_until_close).
  • test_strips_multiline_effects_with_trailing_comment_on_close — same hazard on the closing line of a multi-line list.

Both pass; full tests/test_runner.py green at 106 cases (104 + 2 new). ruff check and ruff format --check clean.

@jasisz — sorry to push to your branch a second time today. Same convergence pattern as this morning's regex fix: CodeRabbit re-raised an unaddressed half of its earlier comment, the suggested diff was small and mechanical, and pushing was faster than waiting. If you'd prefer me to comment-and-let-you-push for future iterations on this PR, just say so — happy to switch modes.

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the current state of runner.py around the relevant lines
rg -n 'endswith|in stripped|skip_until_close|AVER_EFFECTS' vera_bench/runner.py -A2 -B2

echo "---"
# Also check the new tests
rg -n 'trailing_comment' tests/test_runner.py -A20

Length of output: 3908


@aallan The fix in 918c3e3 is exactly right. Verified against the current branch:

  • Line 603 (skip_until_close branch) and Line 632 (inline-effects branch) both now use "]" in stripped — the explanatory comments at each site make the trailing-comment rationale durable.
  • The two new tests are well-targeted: asserting "fn f() -> Unit" in result and 'Console.print("hi")' in result directly catches the runaway-skip_until_close failure mode rather than just checking the effects line is gone. That's the right threat model to pin.

With _AVER_EFFECTS_OPEN_RE (whitespace-tolerant opener from 661a0b7) and the "]" in stripped closer now in place, _strip_module_effects handles the full matrix of Aver-generated forms correctly. The 106-case suite looks solid — happy to see this land.


🧠 Learnings used
Learnt from: aallan
Repo: aallan/vera-bench PR: 52
File: scripts/run_full_benchmark.py:237-245
Timestamp: 2026-04-13T18:21:35.683Z
Learning: In aallan/vera-bench, PR `#52` (docs/aver-support-docs) explicitly depends on PR `#48` for the Aver baseline runner implementation. `run_aver_baseline()` is added to `vera_bench/baseline_runner.py` in PR `#48` and wired into the CLI there. The `NotImplementedError` guard for unknown languages does NOT apply to "aver" once PR `#48` is merged/rebased. Do not flag `--language aver` usage in `scripts/run_full_benchmark.py` as broken; it is valid after the PR `#48` dependency is satisfied.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T4_004_power.av:6-10
Timestamp: 2026-04-13T07:53:25.971Z
Learning: In vera-bench (aallan/vera-bench), Aver source files (`solutions/aver/**`, `**/*.av`) are excluded from CodeRabbit review via `path_filters` in `.coderabbit.yaml`. CodeRabbit has no training data for the Aver language and applying general-purpose heuristics to Aver code produces false positives. Aver's `verify` blocks constrain the callable input domain in the same way Vera's `requires`/`ensures` clauses do — edge cases that appear unreachable under those constraints should not be flagged as deficiencies. Do not review `*.av` files.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T1_007_safe_modulo.av:4-12
Timestamp: 2026-04-13T07:53:20.071Z
Learning: In vera-bench (aallan/vera-bench), Aver (`.av`) source files use `verify` blocks and inline `?` doc-strings as precondition/postcondition contracts that constrain the valid input domain — exactly analogous to Vera's `requires`/`ensures` clauses. Edge cases excluded by these contracts (e.g. `b == 0` when the contract states `b != 0`) are not reachable in valid usage. Do not flag missing runtime guards for inputs that are excluded by an Aver contract. Aver solution files live under `solutions/aver/` and are excluded from automated review via `path_filters: ["!solutions/aver/**", "!**/*.av"]` in `.coderabbit.yaml`.

Learnt from: aallan
Repo: aallan/vera-bench PR: 64
File: .github/workflows/ci.yml:143-149
Timestamp: 2026-04-29T08:53:07.436Z
Learning: In aallan/vera-bench, the `dependency-audit` job in `.github/workflows/ci.yml` uses `pip install --upgrade pip` (rather than the `pip-version` input on `actions/setup-pythonv6`) to work around CVE-2026-3219. This form is intentional: it mirrors aallan/vera#537 verbatim because vera and vera-bench share a CI playbook by convention. Do not suggest replacing it with `pip-version: "26.1"` — diverging the mechanism would silently fragment the shared playbook. The trade-off is documented in vera-bench issue `#63`.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T4_005_sum_to_n.av:8-10
Timestamp: 2026-04-13T07:53:28.149Z
Learning: In vera-bench (aallan/vera-bench), Aver is a statically typed language (file extension `.av`) for which CodeRabbit has no training data. Aver's `verify` blocks serve the same role as Vera's `requires`/`ensures` clauses: they constrain the valid input domain and make out-of-contract edge cases (e.g., negative inputs to a natural-number function) unreachable in valid usage. Do not apply general-purpose type or recursion heuristics to `.av` files — all Aver solutions under `solutions/aver/` are excluded from review via `path_filters` (`!solutions/aver/**`, `!**/*.av`) in `.coderabbit.yaml`.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T4_001_fibonacci.av:7-12
Timestamp: 2026-04-13T07:53:21.497Z
Learning: In vera-bench (aallan/vera-bench), Aver is a zero-training-data language with its own type system, contract model (`verify` blocks), and syntax (`.av` files). Do not apply Vera-specific, Haskell-style, or general FP heuristics (e.g., `Nat` types, `decreases` clauses, specific function naming conventions from the problem spec) to Aver source files in `solutions/aver/**`. Aver's `verify` blocks constrain the input domain in the same way Vera's `requires`/`ensures` clauses do. Aver solution files are excluded from automated review via `path_filters: !solutions/aver/**` and `!**/*.av` in `.coderabbit.yaml`.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T4_003_even_odd_mutual_recursion.av:7-17
Timestamp: 2026-04-13T07:53:25.467Z
Learning: In vera-bench (aallan/vera-bench), Aver (`.av`) source files in `solutions/aver/` are excluded from CodeRabbit review via `path_filters` (`!solutions/aver/**`, `!**/*.av`), matching the existing exclusions for Vera (`!**/*.vera`), Python (`!solutions/python/**`), and TypeScript (`!solutions/typescript/**`). Do not review `.av` files.

Learnt from: aallan
Repo: aallan/vera-bench PR: 48
File: solutions/aver/VB_T5_002_greeter_io_boundary.av:8-8
Timestamp: 2026-04-13T07:53:35.548Z
Learning: In the vera-bench repository (aallan/vera-bench), Aver source files (`solutions/aver/**`, `**/*.av`) should be excluded from CodeRabbit automated reviews via `path_filters` in `.coderabbit.yaml`, matching the existing exclusions for `!**/*.vera`, `!solutions/python/**`, and `!solutions/typescript/**`. CodeRabbit has no training data for the Aver language and applying general-purpose heuristics produces false positives.

Learnt from: aallan
Repo: aallan/vera-bench PR: 36
File: tests/test_models.py:84-89
Timestamp: 2026-03-31T14:05:18.280Z
Learning: In `tests/test_models.py` in the `vera-bench` repository, the Anthropic and OpenAI client tests deliberately patch `vera_bench.models.anthropic` / `vera_bench.models.openai` (module-level references) rather than the constructors directly. This is an intentional choice: the `[llm]` extras (`pip install -e .[llm]`) are always installed in CI, so the SDK modules are guaranteed present in `sys.modules`. Do not flag this patch-target approach as incorrect during code review.

The Aver evaluation harness now strips module-header `effects [...]`
declarations before injecting the test main (the substantive change in
this PR). On Aver 0.12 and earlier the strip is a no-op and Aver
scoring is byte-identical to v0.0.9. Once Aver 0.13 ships and models
start emitting `effects [...]` per the updated docs, the strip will
activate on a measurable fraction of generations and prevent the
underdeclared-effects type error — Aver `run_correct` rates will
diverge between v0.0.9 and v0.0.10 on Aver 0.13+, so the bump records
the methodology boundary in `bench_version` for cross-version analysis.

Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected.

Files touched:
- pyproject.toml: 0.0.9 -> 0.0.10 (importlib.metadata source of truth)
- CITATION.cff: version + date-released bumped together
- CHANGELOG.md: new [0.0.10] section with Compatibility note explaining
  the no-op-until-Aver-0.13 nuance; link references updated
- ROADMAP.md: prepended a v0.0.10 line above the v0.0.9 summary

Verified: `pip install -e .` followed by
`python -c "from importlib.metadata import version; print(version('vera-bench'))"`
reports 0.0.10, full test suite green at 494 cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aallan

aallan commented Apr 29, 2026

Copy link
Copy Markdown
Owner

@jasisz One more touch on the branch: pushed 87a835e bumping the bench version 0.0.90.0.10. Reasoning is that this PR is a methodology change (post-processing on LLM output before compilation) — same input, different output to the compiler — so result files written by the post-merge harness aren't directly comparable to v0.0.9 result files once Aver 0.13 ships and the strip starts firing on real generations.

On Aver 0.9.5/0.12 the strip is a no-op and scores are byte-identical, so the bump is preemptive — but it records the methodology boundary in bench_version now, rather than retroactively when the divergence becomes observable.

Files touched:

  • pyproject.toml: 0.0.9 → 0.0.10
  • CITATION.cff: version + date-released bumped together (date 2026-04-29)
  • CHANGELOG.md: new [0.0.10] section with a Compatibility note explaining the no-op-until-Aver-0.13 nuance
  • ROADMAP.md: prepended a v0.0.10 summary line

Verified importlib.metadata.version("vera-bench") == "0.0.10" after a fresh pip install -e ., full test suite green at 494 cases.

This should be the last push from us — once CI goes green I'll merge and tag v0.0.10.

@aallan aallan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is all green. Ready to merge.

@aallan aallan merged commit 100075f into aallan:main Apr 29, 2026
8 checks passed
sunholo-voight-kampff added a commit to sunholo-voight-kampff/vera-bench that referenced this pull request May 22, 2026
Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70.
(Item 1 — extracting --parallel N into its own PR — addressed via
PR aallan#73.)

### Item 2: README headline section -> single sentence in §Overview

Removed the "AILANG: AI-designed language..." headline section
(13 lines: the heading, the description paragraphs, the per-mode
results table, the "full-circle finding" paragraph). The phrasing
included editorial claims about VeraBench's identity that should
be a project-owner call, and "added in this fork" wouldn't read
correctly post-merge.

Replaced the §Overview line about baselines with the form
@aallan suggested verbatim:

  The same problems are also run in Python, TypeScript,
  [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/)
  as baselines. AILANG and Aver are zero-training-data languages,
  providing additional data points alongside Vera for the
  language-design-vs-training-data thesis.

Matches the existing Aver pattern: light-touch mention without
results writeups in the README.

### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md

Neither file is load-bearing — no code or tests reference them.
Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or
AVER_MAPPING.md. Numbers and writeups go in PR descriptions and
external content; in-repo docs are reserved for things future
maintainers need.

### Item 4: .coderabbit.yaml path_filters

Added the two missing AILANG entries to mirror the existing
{python, typescript, aver} pattern:

    - "!**/*.ail"             (alongside !**/*.vera, !**/*.av)
    - "!solutions/ailang/**"  (alongside the other solutions/* entries)

This stops CodeRabbit from generating speculative findings on
.ail solution files in future review passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants