aver: strip module-level effects before injecting test main (planned 0.13)#62
Conversation
Aver 0.13 enforces module-level `effects [...]` as a hard
type-check boundary: every fn's `! [Effect]` must be covered or the
program fails to compile. The benchmark prompts the LLM for a
function only, then strips its `main()` and injects its own
`Console.print(fn(...))` main with `! [Console.print]`.
If the LLM declared a narrower module boundary in the original
source — including the very common `effects []` for a "pure"
module — the injected main violates the boundary and `aver run`
fails with an underdeclared-effects type error before any test case
even runs.
Fix: drop the module-header `effects [...]` line as part of the
main-replacement pass. The module reverts to legacy / no-boundary
mode, the injected main type-checks, and the rest of the harness
runs unchanged. Handles three shapes:
effects []
effects [Console.print, Disk.readText]
effects [
Console.print,
Disk.readText,
]
No-op when the module declares no `effects [...]`.
Verified on tier1 with claude-haiku-4-5: 9/10 pass-rate (matches
baseline). The known failure (VB-T1-007 safe_modulo) is unrelated —
the model returns Result<Int, String> instead of raw Int.
Four test cases on TestStripModuleEffects: - inline `effects []` - inline `effects [Console.print, Disk.readText]` - multi-line bracketed list - no-op when the module declares no effects boundary Each asserts the boundary line is gone and the rest of the module (including a `! [Console.print]` inside the function body) is left untouched.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughRemove module-level Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 22 minutes and 44 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_runner.py`:
- Around line 886-937: Add a test in TestStripModuleEffects that verifies
_strip_module_effects is a no-op when the input lacks a module declaration (to
cover the else branch in runner.py that later wraps code); create a case where
the input contains a top-level "effects [...]" line but no "module ..." header
(e.g., code = 'effects [Console.print]\n\nfn f() -> Unit\n
Console.print("hi")\n') and assert the result equals the original code (or at
least that the top-level effects line is preserved and function body still
contains Console.print) so we lock in that _strip_module_effects does not remove
effects when there is no module header.
In `@vera_bench/runner.py`:
- Around line 569-601: The current _strip_module_effects implementation
unconditionally drops any effects [...] block anywhere and uses
stripped.endswith("]") to find the terminator, which can mis-strip if Aver later
allows nested blocks or trailing tokens on the closing line; update
_strip_module_effects to only operate in the module header region (the first
non-blank block immediately under a `module X` declaration) and to detect an
effects list with a regex anchored to that header (e.g. match a line like
/^\s*effects\s*\[/), then drop lines up through the first line that contains a
closing ']' (allowing trailing comments/parameters on that line) while leaving
other occurrences alone; keep the function name _strip_module_effects and ensure
the logic is scoped to the module header window before skipping lines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0aebeecf-2057-42fc-b45e-1971b9f0238a
📒 Files selected for processing (2)
tests/test_runner.pyvera_bench/runner.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #62 +/- ##
==========================================
+ Coverage 83.30% 83.62% +0.31%
==========================================
Files 10 10
Lines 1366 1392 +26
==========================================
+ Hits 1138 1164 +26
Misses 228 228
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Addressing CodeRabbit review on PR aallan#62: The original `_strip_module_effects` matched any line starting with `effects [` regardless of where it sat in the file. That worked for today's Aver but is fragile: if Aver later allows `effects [...]` in some other position (or an LLM emits an `effects [...]`-shaped line inside a function body), the bench would silently delete it. Scope the strip to the module header — i.e. the indented region immediately under `module X`, ending at the next top-level item. That mirrors how the Aver parser scopes `intent` / `exposes` / `depends` / `effects`. Two new tests lock in the behaviour: - `test_no_op_when_no_module_declaration`: input without a `module X` line stays untouched, even if it starts with an `effects [...]` token. - `test_only_strips_inside_module_header`: an `effects [...]`-shaped line inside a function body survives; only the header occurrence is removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@jasisz Sorry for the delay on getting back to this — the substantive change is reviewed and looks solid (will leave detailed feedback once CI is green), but it can't merge yet because the The failure is CVE-2026-3219 in pip 26.0.1: Once #64 lands you'll just need to rebase this branch onto the new main and CI should go green. Happy to do that for you if "allow edits from maintainers" is on, or you can do it yourself — whatever's easier. Thanks again for getting ahead of the Aver 0.13 boundary change here. The "no-op until 0.13 ships" property of the strip pass is exactly the right way to land forward-compat work, and we want this in before models start picking up the doc nudge for |
…N_ISSUES The dependency-audit job started failing on PR aallan#62 because actions/setup-python@v6 bakes pip 26.0.1 into its Python 3.12 image, and pip 26.0.1 has CVE-2026-3219 (archive handling). The fix landed in pip 26.1 on 2026-04-26 but won't reach the runner image until GitHub refreshes the toolchain. Workaround mirrors aallan/vera#537: a `pip install --upgrade pip` step before pip-audit runs, pulling pip 26.1 from PyPI to replace the bundled 26.0.1. Inline comment in ci.yml points at the tracking issue (aallan#63) so the workaround doesn't quietly outlive its reason. Also opens KNOWN_ISSUES.md as the catalogue location for active workarounds, dev-env gotchas, and analytical caveats — each with an explicit "removal trigger" so cleanup is straightforward later. Initial entries: - The CI workaround above (aallan#63) - assets/results-graph.png pinned to v0.0.7 content until the v0.0.9 narrative writeup - input_tokens semantic shift across PR aallan#60's prompt-caching merge (analytical caveat for cost trending across that boundary) - /opt/homebrew/bin/vera is not the Vera programming language (dev-env collision with an unrelated Homebrew package) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@vera_bench/runner.py`:
- Around line 617-621: The module header opener check is too strict (only
matches "effects [" or "effects[") so variants like "effects [" are missed;
add a compiled regex _AVER_EFFECTS_OPEN_RE = re.compile(r"^effects\s*\[") and
replace the long tuple check with a single match call: use
_AVER_EFFECTS_OPEN_RE.match(stripped) in the existing conditional that
references in_module_header, indent_len and stripped so the header detection
tolerates arbitrary whitespace before the '['; ensure re is imported/available
where _AVER_EFFECTS_OPEN_RE is defined and referenced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a92443f8-032e-4650-8bd9-a535d0a0e3f4
📒 Files selected for processing (2)
tests/test_runner.pyvera_bench/runner.py
The opener check used startswith("effects [", "effects["), which misses
LLM output with non-canonical whitespace (multiple spaces, tab) between
the keyword and the bracket — variants the Aver parser still accepts.
Replace the prefix tuple with _AVER_EFFECTS_OPEN_RE = re.compile(
r"^effects\\s*\\[") and cover the variants in TestStripModuleEffects.
|
Thanks for handling the rebase yourself, and for the quick #64 — saved me a context-switch out of Aver 0.14 work, much appreciated. Pushed CR's whitespace point as Glad the no-op-then-activate property landed well — happy to use the same shape for future Aver-X.Y boundary changes; that way the bench can stay continuously valid for whichever Aver version a model actually emits. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
vera_bench/runner.py (1)
597-602:⚠️ Potential issue | 🟠 MajorBracket-close detection is too strict and can over-strip the file.
At Line 600 and Line 625, using
stripped.endswith("]")means a closing bracket with trailing tokens is treated as unclosed. In the multi-line path, that can keepskip_until_closeactive and silently drop later declarations.Proposed fix
- if stripped.endswith("]"): + if "]" in stripped: skip_until_close = False continue @@ - if stripped.endswith("]"): + if "]" in stripped: continue skip_until_close = True continueAlso applies to: 625-627
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vera_bench/runner.py` around lines 597 - 602, The multi-line bracket-close detection using stripped.endswith("]") (the skip_until_close handling in the loop) is too strict and misses cases where a closing ']' appears earlier on the line with trailing tokens; update both occurrences that check stripped.endswith("]") to instead detect a ']' anywhere on the line (e.g., check for ']' in the original line or use a regex/search) so skip_until_close is cleared as soon as a closing bracket is present; modify the checks where skip_until_close is toggled (the block referencing stripped.endswith("]") and the similar block later around the other occurrence) to use the broader presence test and keep existing continue behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@vera_bench/runner.py`:
- Around line 597-602: The multi-line bracket-close detection using
stripped.endswith("]") (the skip_until_close handling in the loop) is too strict
and misses cases where a closing ']' appears earlier on the line with trailing
tokens; update both occurrences that check stripped.endswith("]") to instead
detect a ']' anywhere on the line (e.g., check for ']' in the original line or
use a regex/search) so skip_until_close is cleared as soon as a closing bracket
is present; modify the checks where skip_until_close is toggled (the block
referencing stripped.endswith("]") and the similar block later around the other
occurrence) to use the broader presence test and keep existing continue
behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1202c542-f63d-4276-9697-cad9ebf99220
📒 Files selected for processing (2)
tests/test_runner.pyvera_bench/runner.py
|
@jasisz Heads-up: I'd just finished writing the same regex fix locally ( Your test is slightly better than what I wrote: asserting Abandoned my local commit, no duplicate push needed. Verified the suite locally on your branch — 104 in Thanks again for getting ahead of the Aver 0.13 boundary change here. The whole strip pass — including the regex hardening — is the kind of forward-compat work that's worth landing well before the breaking change ships. |
Both `stripped.endswith("]")` checks in _strip_module_effects were
brittle to Aver's trailing line comments (`// ...`). For an LLM-emitted
declaration like:
effects [Console.print] // pure module
the strip pass would:
1. match the opener regex (correctly)
2. fail `endswith("]")` because the line ends in `e` (`pure`)
3. enter `skip_until_close` mode
4. chew through subsequent lines until something happens to end
in `]` — likely catastrophic for the function body
Same hazard on the closing line of a multi-line list:
effects [
Console.print,
] // pure module
Replace both `endswith("]")` checks with `"]" in stripped`. Module-level
effect lists are simple identifier sequences with no string literals or
nested brackets, so the theoretical false-match concern (a `]` inside a
string in the list) doesn't apply now or in any planned Aver evolution.
Tests:
- `test_strips_inline_effects_with_trailing_comment` — the inline
variant; asserts the function body is intact (i.e. NOT eaten by a
runaway skip_until_close).
- `test_strips_multiline_effects_with_trailing_comment_on_close` — same
hazard on the closing line of a multi-line list.
Both new tests pass; full `tests/test_runner.py` suite green
(106 passed locally).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@coderabbitai @jasisz Pushed 918c3e3 addressing the bracket-close finding from the duplicate-comments review. CodeRabbit was right — this isn't really a duplicate, it's the second half of an earlier two-part observation (opener strict-match + closer strict-match) where Szymon's 661a0b7 addressed the opener but not the closer. The Applied the suggested fix verbatim ( Two new tests in
Both pass; full @jasisz — sorry to push to your branch a second time today. Same convergence pattern as this morning's regex fix: CodeRabbit re-raised an unaddressed half of its earlier comment, the suggested diff was small and mechanical, and pushing was faster than waiting. If you'd prefer me to comment-and-let-you-push for future iterations on this PR, just say so — happy to switch modes. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 3908
With 🧠 Learnings used |
The Aver evaluation harness now strips module-header `effects [...]`
declarations before injecting the test main (the substantive change in
this PR). On Aver 0.12 and earlier the strip is a no-op and Aver
scoring is byte-identical to v0.0.9. Once Aver 0.13 ships and models
start emitting `effects [...]` per the updated docs, the strip will
activate on a measurable fraction of generations and prevent the
underdeclared-effects type error — Aver `run_correct` rates will
diverge between v0.0.9 and v0.0.10 on Aver 0.13+, so the bump records
the methodology boundary in `bench_version` for cross-version analysis.
Vera, Vera spec-from-NL, Python, and TypeScript scoring is unaffected.
Files touched:
- pyproject.toml: 0.0.9 -> 0.0.10 (importlib.metadata source of truth)
- CITATION.cff: version + date-released bumped together
- CHANGELOG.md: new [0.0.10] section with Compatibility note explaining
the no-op-until-Aver-0.13 nuance; link references updated
- ROADMAP.md: prepended a v0.0.10 line above the v0.0.9 summary
Verified: `pip install -e .` followed by
`python -c "from importlib.metadata import version; print(version('vera-bench'))"`
reports 0.0.10, full test suite green at 494 cases.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@jasisz One more touch on the branch: pushed 87a835e bumping the bench version On Aver 0.9.5/0.12 the strip is a no-op and scores are byte-identical, so the bump is preemptive — but it records the methodology boundary in Files touched:
Verified This should be the last push from us — once CI goes green I'll merge and tag v0.0.10. |
aallan
left a comment
There was a problem hiding this comment.
CI is all green. Ready to merge.
Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70. (Item 1 — extracting --parallel N into its own PR — addressed via PR aallan#73.) ### Item 2: README headline section -> single sentence in §Overview Removed the "AILANG: AI-designed language..." headline section (13 lines: the heading, the description paragraphs, the per-mode results table, the "full-circle finding" paragraph). The phrasing included editorial claims about VeraBench's identity that should be a project-owner call, and "added in this fork" wouldn't read correctly post-merge. Replaced the §Overview line about baselines with the form @aallan suggested verbatim: The same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/) as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis. Matches the existing Aver pattern: light-touch mention without results writeups in the README. ### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md Neither file is load-bearing — no code or tests reference them. Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or AVER_MAPPING.md. Numbers and writeups go in PR descriptions and external content; in-repo docs are reserved for things future maintainers need. ### Item 4: .coderabbit.yaml path_filters Added the two missing AILANG entries to mirror the existing {python, typescript, aver} pattern: - "!**/*.ail" (alongside !**/*.vera, !**/*.av) - "!solutions/ailang/**" (alongside the other solutions/* entries) This stops CodeRabbit from generating speculative findings on .ail solution files in future review passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Aver 0.13 (the upcoming release) introduces a module-level
effects [...]boundary that the type-checker enforces: every function's! [Effect]must be covered, or compilation fails.The bench harness currently asks the LLM for a function only, then strips the LLM's
main()and injects its ownConsole.print(fn(...))main with! [Console.print]. If the LLM declared a narrower module boundary in the original source — including the very commoneffects []for a "pure" module — the injected main violates that boundary andaver runfails with an underdeclared-effects type error before any test case runs.This PR adds
_strip_module_effects(code)to the runner, called right after_strip_aver_main(code). It removes the module-headereffects [...]line (inline or multi-line), so the module reverts to legacy / no-boundary mode and the injected main type-checks.Compatibility:
effects [...]on the module header, so LLMs don't generate it and the strip pass simply doesn't fire.effects []reaches the model.So merging now is safe: it doesn't change current behaviour, and it lets
aver-benchcontinue to work on the day 0.13 lands without a follow-up rush.What changed
vera_bench/runner.py: new_strip_module_effects(code)helper, wired into the test-main injection path.tests/test_runner.py: 4 cases onTestStripModuleEffects(inline[], inline list, multi-line list, no-op).Test plan
pytest tests/test_runner.py— 101 passed locallypytest tests/— 489 passedruff check+ruff format --checkcleanContext
effects [...]as a first-class module featuresafe_modulo) remains the one tier1 fail on haiku-4.5 in both old and new world — Result-wrap vs raw Int return — unrelated to this changeSummary by CodeRabbit
Tests
Chores