Add Aver language support + language-neutral problem descriptions by jasisz · Pull Request #48 · aallan/vera-bench

jasisz · 2026-04-11T21:18:36Z

Summary

Add Aver as a comparison language (50 canonical solutions, runner, baselines, CLI)
Add description_neutral field to all 50 problem JSONs

Language-neutral descriptions

The existing description references Vera builtins (string_upper, array_fold), types (Nat), and concepts (handle[State<Int>], Exn<E>). Models writing Python/TypeScript ignore these from training data, but any zero-training-data language takes them literally.

description_neutral has only the functional requirement. Python, TypeScript, and Aver use it. Vera still uses the original description. Relevant for Go (#21) too.

What is Aver

Aver — statically typed, zero training data, learns from llms.txt in the prompt. Named params instead of De Bruijn indices, verify blocks instead of contracts, Result<T,E> instead of effect handlers.

Usage

vera-bench baselines --language aver
vera-bench run --model claude-sonnet-4-20250514 --language aver \
  --skill-md /path/to/llms.txt

Test plan

50/50 solutions pass aver check + aver run
Baselines: 100% check, 100% run_correct (24 testable)
Existing tests: 339 passed (2 pre-existing failures — vera not on PATH)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- End-to-end Aver language support: generate, verify, run and attempt fixes from errors; ~50 worked solution examples added across tiers 1–5.
Documentation
- Added neutral alternative descriptions to many problem statements for more consistent prompts.
Chores
- CLI and tooling now detect/report Aver version, include Aver in baseline runs, and normalise Aver outputs for evaluation.

coderabbitai · 2026-04-11T21:19:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Added a neutral description field to many problem JSONs, introduced Aver (*.av) solution modules (functions, verify blocks, main), and extended the vera_bench CLI, prompts, runner and baseline_runner to generate, check, run and attempt fixes for Aver artifacts.

Changes

Cohort / File(s)	Summary
Problem metadata `problems/.../VB_T_.json`	Added a top-level `description_neutral` string to ~50 problem JSONs across tiers 1–5. No signatures, contracts, test cases or behavioural fields were changed.
Aver solutions `solutions/aver/*.av`	Added ~50 new Aver modules (one per problem) implementing the reference function(s), `verify` blocks with assertions and `main()` entrypoints. Introduces exported functions/types in many modules.
Baseline runner `vera_bench/baseline_runner.py`	Added Aver baseline support: `.av` discovery, `run_aver_baseline()` invoking `aver check`/`aver run`, env filtering, timeouts, output parsing and Aver↔expected normalization helpers.
Runner / evaluation `vera_bench/runner.py`, `tests/test_runner.py`	Integrated Aver into generation/evaluation: recognise `aver` fences, `_evaluate_aver_code` flow (write `.av`, `aver check`/`aver verify`/per-test `aver run`), `_strip_aver_main`, `_aver_literal`, output-matching and a fix-from-error attempt path. Updated/added unit tests for these helpers.
Prompts `vera_bench/prompts.py`	Added Aver-specific constants and loader (`AVER_LLMS_TXT_URL`, `load_aver_llms_txt`), `AVER_SYSTEM_PROMPT`, `build_aver_prompt`/`build_aver_fix_prompt`, and made Python/TypeScript prompts prefer `description_neutral` when present.
CLI `vera_bench/cli.py`	Extended `--language` choices to include `aver`; added Aver version detection (`aver --version`) and version-based output slugging; adjusted warnings and metrics rendering to include Aver; baselines fast-fail when `aver` missing.
Regex / code-fence handling & parser `vera_bench/runner.py`	Extended fence regex to recognise `aver` code blocks and route them to the new evaluator.
Tests `tests/test_runner.py`	Added tests for `_aver_literal`, `_strip_aver_main`, `load_aver_llms_txt`, and `_neutral_description`; expanded imports to exercise new helpers.
Config / tooling `.coderabbit.yaml`, `CHANGELOG.md`	Excluded `.av` and `solutions/aver/**` from automated review paths; updated CHANGELOG with Aver support and `description_neutral` metadata addition.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Strengthen problem descriptions and postconditions (v0.0.5) #32 — overlaps many of the same problem JSON edits (adds/edits problem descriptions and metadata).
Add --language python for cross-language LLM comparison #11 — touches CLI/runner/prompt logic that this change extends for Aver support.
Implement baseline runner (Phase 3) #8 — earlier baseline/runner additions for other languages; related baseline_runner/runner wiring that this PR expands.

Suggested labels

problems, harness

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the two main objectives: adding Aver language support and introducing language-neutral problem descriptions. It is concise, specific, and directly reflects the primary changes in the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

vera_bench/cli.py (1)

97-108: ⚠️ Potential issue | 🟠 Major

Warn or reject --mode for Aver, because it is currently ignored.

run() accepts --language aver --mode spec-from-nl, prints that mode, and passes it on, but vera_bench/runner.py always selects build_aver_prompt() for Aver. That means the requested experiment configuration is silently not honoured.

Suggested fix

-    # Warn on Vera-specific flags in non-Vera/non-Aver mode
-    if language not in ("vera", "aver"):
+    # Warn on flags that are ignored for the selected language
+    if language not in ("vera", "aver"):
         if skill_md is not None:
             console.print(
                 f"[yellow]Warning: --skill-md is ignored "
                 f"with --language {language}[/yellow]"
             )
         if mode != "full-spec":
             console.print(
                 f"[yellow]Warning: --mode is ignored "
                 f"with --language {language}[/yellow]"
             )
+    elif language == "aver" and mode != "full-spec":
+        raise click.BadParameter(
+            "--mode spec-from-nl is not supported for --language aver",
+            param_hint="--mode",
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/cli.py` around lines 97 - 108, The CLI currently only warns about
Vera-specific flags when language not in ("vera","aver"), but for language ==
"aver" the provided --mode is silently ignored because runner.py always uses
build_aver_prompt(); update the validation in run() so that when language ==
"aver" and mode is provided (and not the single supported mode), the CLI either
prints a clear error and exits or raises a parameter error (e.g.,
Click.BadParameter) instead of accepting and passing it along; locate the
validation block using the variables language and mode in cli.py and reference
runner.py's build_aver_prompt to justify rejecting/ warning the --mode flag for
Aver.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@problems/tier5/VB_T5_005_checked_index.json`:
- Around line 6-7: The neutral description uses "list element" which conflicts
with the array-based signature; update the description_neutral string to use
"array element" (or "array index") instead so it matches the signature public fn
safe_index(`@Array`<Int>, `@Nat` -> `@Int`) and array-focused notes—e.g., change
"Safely access a list element by index" to "Safely access an array element by
index" and ensure the rest of the wording still returns -1 when the index is out
of bounds.

In `@problems/tier5/VB_T5_010_safe_head.json`:
- Around line 6-7: Update the neutral description for the safe_head function to
use "array" instead of "list" so it matches the signature; locate the JSON entry
for the function with "signature": "public fn safe_head(`@Array`<Int> -> `@Int`)"
and change "description_neutral": "Return the first element of a list, or -1 if
the list is empty." to wording that uses "array" (e.g., "Return the first
element of an array, or -1 if the array is empty.").

In `@solutions/aver/VB_T1_007_safe_modulo.av`:
- Around line 4-12: The function safe_modulo currently divides by b
unconditionally which can cause a runtime crash when b == 0; update safe_modulo
to guard that case by checking if b == 0 and returning a deterministic value
(e.g., return a) when true, otherwise compute the existing expression a - (a /
b) * b; adjust the function's docstring to state the defined behavior for b == 0
and ensure the guard appears before any division to avoid divide-by-zero errors.

In `@solutions/aver/VB_T4_001_fibonacci.av`:
- Around line 7-12: Rename the function fib to fibonacci, change its parameter
type from Int to `@Nat`, and add a decreases clause to prove termination;
specifically locate the function declaration named fib and replace it with
fibonacci(n: `@Nat`) and append a decreases n (or equivalent decreases clause
accepted by the language) so the signature and termination proof match the
problem spec and the harness can find --fn fibonacci.

In `@solutions/aver/VB_T4_003_even_odd_mutual_recursion.av`:
- Around line 7-17: The mutual recursion in is_even and is_odd can loop for
negative n because they only check 0; update both functions (is_even, is_odd) to
handle negative inputs up-front by normalizing n (e.g., use its absolute value)
or by adding a base/guard for n < 0 that redirects to the positive case, then
continue the existing match logic so recursion always moves toward the 0 base
case and cannot diverge for negative integers.

In `@solutions/aver/VB_T4_004_power.av`:
- Around line 6-10: The power function's recursion never terminates for negative
exp; update fn power(base: Int, exp: Int) -> Int to handle exp < 0 explicitly
(e.g., add a match arm or a guard before recursion) so it doesn't recurse
forever—either raise/return an error for negative exponents or convert to a
supported behavior (such as returning 0 or computing using reciprocal if you
change the signature), by adding a branch like "exp < 0 -> /* error or
alternative handling */" before the existing cases in power.

In `@solutions/aver/VB_T4_005_sum_to_n.av`:
- Around line 8-10: The function sum_to_n currently accepts an Int which allows
negative inputs and can cause unbounded recursion; change its type signature to
use the Nat annotation (fn sum_to_n(n: `@Nat`) -> `@Nat`) so the compiler enforces
non-negative inputs, update any related annotations or callers to pass/return
`@Nat` values, and ensure the recursive call sum_to_n(n - 1) is type-correct under
`@Nat` (e.g., rely on the type system or adjust the decrement expression) so the
base case 0 -> 0 is always reachable.

In `@solutions/aver/VB_T4_007_count_digits.av`:
- Around line 7-11: The function count_digits currently treats any n < 10 as
base case, so negative inputs (e.g., count_digits(-42)) incorrectly return 1;
update count_digits to explicitly handle negatives by either rejecting them
(raise an error/return an Option/Result) or normalizing the sign before
recursion (apply absolute value to n at start, e.g., use abs(n) or a guard that
converts n to non-negative), and then proceed with the existing base case and
recursive call (ensure references to count_digits remain unchanged).

In `@solutions/aver/VB_T4_010_div_natural.av`:
- Around line 6-10: The div_natural function can enter non-terminating recursion
when b <= 0; add a divisor guard at the top of div_natural to handle
non-positive divisors (b <= 0) before the existing a < b match — e.g., return an
error/exception or a defined sentinel value for invalid divisor — so that
div_natural(a: Int, b: Int) short-circuits for b <= 0 and only performs the
repeated-subtraction branch for b > 0.

In `@solutions/aver/VB_T5_002_greeter_io_boundary.av`:
- Line 8: The Console.print call is using unsupported `{name}` interpolation;
update the call in the greeter to use Aver's string concatenation by replacing
the literal with a concatenated string expression (e.g., build the greeting by
concatenating "Hello, ", the name variable, and "!" in the Console.print
invocation).

In `@solutions/aver/VB_T5_007_exn_negate.av`:
- Around line 1-15: The current file lacks the required exception-effect and
handler flow: define an effect type Exn<E> (or Exn) and implement
require_non_negative(x: Int) that throws an Exn when x < 0; change
safe_non_negative to call require_non_negative inside an exception handler that
catches Exn and returns 0 (otherwise returns the original value), and add the
postcondition ensures(`@Int.result` >= 0) to safe_non_negative to satisfy the
specification; reference the symbols Exn (effect), require_non_negative
(throwing function), safe_non_negative (handler), and ensures (postcondition).

In `@solutions/aver/VB_T5_010_safe_head.av`:
- Around line 11-14: Add a new verification case to ensure safe_head treats
non-empty lists with negative heads correctly: update the test block for
safe_head to include a case like safe_head([-1, 0]) => -1 (or any negative
number as head) so the implementation (safe_head) is not conflating negative
head values with the empty-list sentinel; ensure the new assertion follows the
existing verify format and keeps the empty-list check (safe_head([]) => 0 - 1)
intact.

In `@vera_bench/baseline_runner.py`:
- Around line 377-386: The current early-return in run_aver_baseline that
creates a passing ProblemResult when test_cases is empty must be removed:
instead of returning success immediately for empty test_cases, resolve the Aver
solution file and invoke the Aver checker (the same path used when test_cases
exist) to validate the baseline; update run_aver_baseline (and the branch that
constructs the ProblemResult for empty cases) to run aver check/validation and
set ProblemResult.check_pass and run_correct based on that check outcome (use
the existing ProblemResult fields and the same logging/error handling as other
paths), so Array/String/ADT problems without runtime cases still get their Aver
solution validated.
- Around line 513-547: Add explicit type hints for the expected parameter on
both helper functions: change _normalize_aver_expected(expected) to
_normalize_aver_expected(expected: Any) -> str and change
_aver_output_matches(actual: str, expected) to _aver_output_matches(actual: str,
expected: Any) -> bool; also ensure Any is imported (from typing import Any) at
the top of the module if not already present so the signatures are consistent
with the codebase.

In `@vera_bench/runner.py`:
- Around line 709-753: The retry/fix branch for language "aver" currently
triggers whenever eval_result["check_pass"] is False, which causes wasted model
calls for tooling/host failures; change the condition in the Attempt 2 block to
only proceed when the original failure is a real compiler/diagnostic error
(e.g., inspect eval_result for a compiler error field or specific diagnostic
flag rather than generic check_pass), so before calling build_aver_fix_prompt /
client.complete and before invoking _evaluate_aver_code again, gate the retry on
a concrete diagnostic (for example require eval_result.get("diagnostics") or
eval_result.get("compiler_error") to be present and not a tooling/timeout
error), and if it is a tooling/host failure, append a ProblemResult marking
attempt 2 as not attempted or a tooling failure and return without making
another model call.
- Around line 576-623: The three Aver helper functions lack explicit parameter
type annotations; update _aver_literal(value), _normalize_aver_output(expected),
and _aver_output_matches(actual, expected) to use explicit Union types for the
allowed inputs (e.g. value: bool | int | float | str | list[object] or expected:
bool | int | float | str | list[object]) and keep their return annotations (->
str and -> bool). Add any necessary typing imports (Union or use Python 3.10+
union syntax) and ensure _aver_output_matches declares actual: str, expected:
bool|int|float|str|list[...] so static checkers accept these functions.

---

Outside diff comments:
In `@vera_bench/cli.py`:
- Around line 97-108: The CLI currently only warns about Vera-specific flags
when language not in ("vera","aver"), but for language == "aver" the provided
--mode is silently ignored because runner.py always uses build_aver_prompt();
update the validation in run() so that when language == "aver" and mode is
provided (and not the single supported mode), the CLI either prints a clear
error and exits or raises a parameter error (e.g., Click.BadParameter) instead
of accepting and passing it along; locate the validation block using the
variables language and mode in cli.py and reference runner.py's
build_aver_prompt to justify rejecting/ warning the --mode flag for Aver.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 68345561-d5c3-4d24-80ae-cc4c4d96b676

📥 Commits

Reviewing files that changed from the base of the PR and between 4f012fe and 9941928.

📒 Files selected for processing (104)

problems/tier1/VB_T1_001_absolute_value.json
problems/tier1/VB_T1_002_clamp.json
problems/tier1/VB_T1_003_signum.json
problems/tier1/VB_T1_004_max_of_two.json
problems/tier1/VB_T1_005_min_of_two.json
problems/tier1/VB_T1_006_is_positive.json
problems/tier1/VB_T1_007_safe_modulo.json
problems/tier1/VB_T1_008_distance.json
problems/tier1/VB_T1_009_max_of_three.json
problems/tier1/VB_T1_010_double_or_nothing.json
problems/tier2/VB_T2_001_sum_array.json
problems/tier2/VB_T2_002_filter_positives.json
problems/tier2/VB_T2_003_greeting.json
problems/tier2/VB_T2_004_is_empty_string.json
problems/tier2/VB_T2_005_contains_substring.json
problems/tier2/VB_T2_006_join_strings.json
problems/tier2/VB_T2_007_double_elements.json
problems/tier2/VB_T2_008_count_positives.json
problems/tier2/VB_T2_009_to_upper.json
problems/tier2/VB_T2_010_sum_positives.json
problems/tier3/VB_T3_001_list_length.json
problems/tier3/VB_T3_002_tree_depth.json
problems/tier3/VB_T3_003_expression_evaluator.json
problems/tier3/VB_T3_004_list_sum.json
problems/tier3/VB_T3_005_tree_sum.json
problems/tier3/VB_T3_006_option_unwrap_or.json
problems/tier3/VB_T3_007_list_contains.json
problems/tier3/VB_T3_008_tree_count_leaves.json
problems/tier3/VB_T3_009_list_append.json
problems/tier3/VB_T3_010_list_last.json
problems/tier4/VB_T4_001_fibonacci.json
problems/tier4/VB_T4_002_greatest_common_divisor.json
problems/tier4/VB_T4_003_even_odd_mutual_recursion.json
problems/tier4/VB_T4_004_power.json
problems/tier4/VB_T4_005_sum_to_n.json
problems/tier4/VB_T4_006_list_reverse.json
problems/tier4/VB_T4_007_count_digits.json
problems/tier4/VB_T4_008_multiply.json
problems/tier4/VB_T4_009_list_nth.json
problems/tier4/VB_T4_010_div_natural.json
problems/tier5/VB_T5_001_counter.json
problems/tier5/VB_T5_002_greeter_io_boundary.json
problems/tier5/VB_T5_003_safe_division_exceptions.json
problems/tier5/VB_T5_004_accumulator.json
problems/tier5/VB_T5_005_checked_index.json
problems/tier5/VB_T5_006_state_double.json
problems/tier5/VB_T5_007_exn_negate.json
problems/tier5/VB_T5_008_print_numbers.json
problems/tier5/VB_T5_009_state_max.json
problems/tier5/VB_T5_010_safe_head.json
solutions/aver/VB_T1_001_absolute_value.av
solutions/aver/VB_T1_002_clamp.av
solutions/aver/VB_T1_003_signum.av
solutions/aver/VB_T1_004_max_of_two.av
solutions/aver/VB_T1_005_min_of_two.av
solutions/aver/VB_T1_006_is_positive.av
solutions/aver/VB_T1_007_safe_modulo.av
solutions/aver/VB_T1_008_distance.av
solutions/aver/VB_T1_009_max_of_three.av
solutions/aver/VB_T1_010_double_or_nothing.av
solutions/aver/VB_T2_001_sum_array.av
solutions/aver/VB_T2_002_filter_positives.av
solutions/aver/VB_T2_003_greeting.av
solutions/aver/VB_T2_004_is_empty_string.av
solutions/aver/VB_T2_005_contains_substring.av
solutions/aver/VB_T2_006_join_strings.av
solutions/aver/VB_T2_007_double_elements.av
solutions/aver/VB_T2_008_count_positives.av
solutions/aver/VB_T2_009_to_upper.av
solutions/aver/VB_T2_010_sum_positives.av
solutions/aver/VB_T3_001_list_length.av
solutions/aver/VB_T3_002_tree_depth.av
solutions/aver/VB_T3_003_expression_evaluator.av
solutions/aver/VB_T3_004_list_sum.av
solutions/aver/VB_T3_005_tree_sum.av
solutions/aver/VB_T3_006_option_unwrap_or.av
solutions/aver/VB_T3_007_list_contains.av
solutions/aver/VB_T3_008_tree_count_leaves.av
solutions/aver/VB_T3_009_list_append.av
solutions/aver/VB_T3_010_list_last.av
solutions/aver/VB_T4_001_fibonacci.av
solutions/aver/VB_T4_002_greatest_common_divisor.av
solutions/aver/VB_T4_003_even_odd_mutual_recursion.av
solutions/aver/VB_T4_004_power.av
solutions/aver/VB_T4_005_sum_to_n.av
solutions/aver/VB_T4_006_list_reverse.av
solutions/aver/VB_T4_007_count_digits.av
solutions/aver/VB_T4_008_multiply.av
solutions/aver/VB_T4_009_list_nth.av
solutions/aver/VB_T4_010_div_natural.av
solutions/aver/VB_T5_001_counter.av
solutions/aver/VB_T5_002_greeter_io_boundary.av
solutions/aver/VB_T5_003_safe_division_exceptions.av
solutions/aver/VB_T5_004_accumulator.av
solutions/aver/VB_T5_005_checked_index.av
solutions/aver/VB_T5_006_state_double.av
solutions/aver/VB_T5_007_exn_negate.av
solutions/aver/VB_T5_008_print_numbers.av
solutions/aver/VB_T5_009_state_max.av
solutions/aver/VB_T5_010_safe_head.av
vera_bench/baseline_runner.py
vera_bench/cli.py
vera_bench/prompts.py
vera_bench/runner.py

coderabbitai

Actionable comments posted: 12

♻️ Duplicate comments (16)

solutions/aver/VB_T5_010_safe_head.av (1)

11-14: 🧹 Nitpick | 🔵 Trivial

Add one verify case for a non-empty negative head.

A negative-head case would tighten confidence that non-empty lists are never mistaken for the empty-list sentinel path.

Suggested addition

 verify safe_head
     safe_head([10, 20, 30]) => 10
     safe_head([42])         => 42
+    safe_head([0 - 5, 99])  => 0 - 5
     safe_head([])           => 0 - 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_010_safe_head.av` around lines 11 - 14, Add a verify
case to ensure safe_head returns a negative head value for a non-empty list (to
distinguish from the empty-list sentinel path). Update the test block for
safe_head by adding a line like safe_head([-5, 1, 2]) => -5 (or any negative
integer) alongside the existing cases so the function safe_head is exercised
with a negative non-empty head without modifying existing tests or behavior.

solutions/aver/VB_T4_005_sum_to_n.av (1)

6-10: ⚠️ Potential issue | 🔴 Critical

Prevent non-terminating recursion for negative inputs.

Line 10 recurses indefinitely when n < 0 because the only base case is 0. Add a guard before recursion so negative values terminate deterministically.

Proposed fix

 fn sum_to_n(n: Int) -> Int
     ? "Returns the sum of integers from 0 to n."
-    match n
-        0 -> 0
-        _ -> n + sum_to_n(n - 1)
+    match n <= 0
+        true  -> 0
+        false -> n + sum_to_n(n - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_005_sum_to_n.av` around lines 6 - 10, The function
sum_to_n currently only handles 0 and positive n and will recurse forever for
negative inputs; update sum_to_n's match to include a branch for negative n
(e.g., add a pattern like n when n < 0 -> 0 or otherwise return/raise a
deterministic result) so that the negative case short-circuits before the
recursive case and prevents non-terminating recursion in the match expression.

solutions/aver/VB_T1_007_safe_modulo.av (1)

4-6: ⚠️ Potential issue | 🔴 Critical

Guard b == 0 before division.

Line 6 performs division unconditionally; this can crash at runtime when b is zero.

Proposed fix

 fn safe_modulo(a: Int, b: Int) -> Int
-    ? "Returns a modulo b. Assumes b != 0."
-    a - (a / b) * b
+    ? "Returns a modulo b, or 0 when b == 0."
+    match b == 0
+        true  -> 0
+        false -> a - (a / b) * b

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T1_007_safe_modulo.av` around lines 4 - 6, The function
safe_modulo currently performs a division unguarded (the expression a - (a / b)
* b), which will crash when b == 0; modify safe_modulo to check b == 0 at the
start and handle it explicitly—either raise a descriptive runtime error (e.g.,
"division by zero") or return a defined sentinel value—then only perform the
existing computation a - (a / b) * b when b != 0.

solutions/aver/VB_T4_004_power.av (1)

6-10: ⚠️ Potential issue | 🔴 Critical

Handle negative exponents to avoid runaway recursion.

Line 10 recurses forever for exp < 0. Add an explicit negative-exponent branch before the recursive path.

Proposed fix

 fn power(base: Int, exp: Int) -> Int
     ? "Returns base^exp using recursive repeated multiplication."
-    match exp
-        0 -> 1
-        _ -> base * power(base, exp - 1)
+    match exp < 0
+        true  -> 0
+        false ->
+            match exp
+                0 -> 1
+                _ -> base * power(base, exp - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_004_power.av` around lines 6 - 10, The power function
currently recurses forever for exp < 0; update the match on exp inside fn power
to add an explicit negative-exponent branch (e.g., a case matching exp < 0)
before the recursive "_" branch so negative exponents are handled without
recursion; in that branch return or signal the appropriate behavior per project
conventions (throw/raise an error, return a sentinel, or convert to a
Result/Option) instead of falling through to base * power(...).

solutions/aver/VB_T4_007_count_digits.av (1)

7-11: ⚠️ Potential issue | 🟡 Minor

Guard negative inputs to match the declared domain.

At Line 9, any negative number satisfies n < 10, so invalid inputs like -42 collapse to 1. Please reject negatives or normalise sign before recursion so behaviour is explicit.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_007_count_digits.av` around lines 7 - 11, The function
count_digits currently treats any n < 10 (including negatives) as a single
digit; update count_digits to enforce the declared domain by either rejecting
negative inputs (e.g., raise/error when n < 0) or normalizing the sign before
processing (e.g., use abs(n) at the start), then use the absolute value for the
recursion (1 + count_digits(abs(n) / 10)) so negative values no longer collapse
to 1; ensure the guard/branch that checks n < 10 operates on the non-negative
value or add an explicit negative check that returns/raises an error.

solutions/aver/VB_T4_010_div_natural.av (1)

6-10: ⚠️ Potential issue | 🔴 Critical

Add a non-positive divisor guard to prevent non-terminating recursion.

Line [6]–Line [10] still permit infinite recursion when b <= 0, which can hang aver run/baseline execution.

Suggested patch

 fn div_natural(a: Int, b: Int) -> Int
     ? "Returns a divided by b using repeated subtraction."
-    match a < b
-        true  -> 0
-        false -> 1 + div_natural(a - b, b)
+    match b <= 0
+        true  -> 0
+        false ->
+            match a < b
+                true  -> 0
+                false -> 1 + div_natural(a - b, b)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_010_div_natural.av` around lines 6 - 10, The function
div_natural has no guard for non-positive divisors and will recurse forever when
b <= 0; update div_natural to check for b <= 0 at the start (before the match)
and handle it explicitly (e.g., return an error/raise an exception/abort
execution or otherwise short-circuit) so the recursion only runs for b > 0;
modify the div_natural definition to include that precondition check and ensure
the existing match on a < b remains for the normal positive-divisor case.

problems/tier5/VB_T5_005_checked_index.json (1)

6-6: ⚠️ Potential issue | 🟡 Minor

Use “array” terminology to match the signature.

Line 6 says “list element”, but Line 7 defines an array input. Please align wording to avoid prompt ambiguity.

Suggested fix

-  "description_neutral": "Safely access a list element by index. Return -1 if the index is out of bounds.",
+  "description_neutral": "Safely access an array element by index. Return -1 if the index is out of bounds.",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@problems/tier5/VB_T5_005_checked_index.json` at line 6, Update the
"description_neutral" string to use "array" instead of "list" so it matches the
function signature; specifically edit the JSON key "description_neutral" in
VB_T5_005_checked_index.json to read something like "Safely access an array
element by index. Return -1 if the index is out of bounds." to align with the
array input defined elsewhere in the spec.

problems/tier5/VB_T5_010_safe_head.json (1)

6-7: ⚠️ Potential issue | 🟡 Minor

Use “array” in the neutral description to match the signature.

Line 6 says “list”, but Line 7 defines @Array<Int>. This can mislead cross-language prompt generation and should be aligned.
Suggested fix
-  "description_neutral": "Return the first element of a list, or -1 if the list is empty.",
+  "description_neutral": "Return the first element of an array, or -1 if the array is empty.",
As per coding guidelines: “problems/**/*.json: Problem definitions. Review for schema completeness, consistency between contracts in JSON and the canonical solution, correct test case expected outputs, and appropriate tier assignment.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@problems/tier5/VB_T5_010_safe_head.json` around lines 6 - 7, Update the
neutral description to use the term "array" to match the function signature:
change the value of "description_neutral" so it reads "Return the first element
of an array, or -1 if the array is empty." to align with the signature "public
fn safe_head(`@Array`<Int> -> `@Int`)" and maintain consistency between the problem
text and the contract.

solutions/aver/VB_T4_001_fibonacci.av (1)

5-12: ⚠️ Potential issue | 🔴 Critical

Entry-point/signature drift will break harness invocation if spec expects fibonacci.

Line 7 defines fib(n: Int) -> Int; if the problem spec still declares entry_point: "fibonacci", the generated test wrapper will call a missing symbol.

#!/bin/bash
set -euo pipefail

python - <<'PY'
import json, pathlib, re

spec = next(pathlib.Path("problems").rglob("VB_T4_001_fibonacci.json"))
sol = next(pathlib.Path("solutions/aver").rglob("VB_T4_001_fibonacci.av"))

spec_j = json.loads(spec.read_text())
text = sol.read_text()

m = re.search(r'^\s*fn\s+([A-Za-z_]\w*)\s*\((.*?)\)\s*->\s*([A-Za-z_@<>]+)', text, re.M)
print("spec.entry_point =", spec_j.get("entry_point"))
print("spec.signature   =", spec_j.get("signature"))
print("solution.fn      =", m.group(1) if m else None)
print("solution.params  =", m.group(2) if m else None)
print("solution.return  =", m.group(3) if m else None)
PY

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_001_fibonacci.av` around lines 5 - 12, The solution
defines fn fib(...) but the spec/harness expects the entry-point named
"fibonacci" (and its signature); update the solution to match the spec by
renaming the function from fib to fibonacci and ensuring the parameter and
return types/signature match the spec's "signature" (or alternatively change the
spec's entry_point to "fib" if intended) so the harness calls the correct
symbol; specifically locate the fn fib declaration and rename it to fn fibonacci
with the same parameter list and return type.

solutions/aver/VB_T5_002_greeter_io_boundary.av (1)

8-8: ⚠️ Potential issue | 🔴 Critical

Replace unsupported interpolation form on Line 8.

"Hello, {name}!" should be assembled via concatenation for reliable Aver compatibility.

Suggested fix

-    Console.print("Hello, {name}!")
+    Console.print("Hello, " + name + "!")

What is the official Aver syntax for inserting a variable into a string literal? Is "Hello, {name}!" valid, or should it be written as "Hello, " + name + "!"?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_002_greeter_io_boundary.av` at line 8, The string uses
unsupported interpolation; change the Console.print call in Console.print(...)
to build the greeting with concatenation instead of "{name}" interpolation —
e.g., assemble "Hello, " + name + "!" (update the Console.print invocation that
currently passes "Hello, {name}!").

solutions/aver/VB_T5_007_exn_negate.av (1)

5-10: ⚠️ Potential issue | 🔴 Critical

Tier 5 exception-handler semantics are bypassed.

Line 5-10 implements direct clamping only; it does not exercise the Exn throw/handle flow this Tier 5 task is meant to validate, so benchmark signal for effect handling is lost.

Based on learnings: Tier 5 problems are intentionally about correctly wiring effect handlers (State, Exn, IO), not just functional end values.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_007_exn_negate.av` around lines 5 - 10, The current
safe_non_negative function bypasses Tier 5 effect handling by clamping directly;
change it so that safe_non_negative throws an Exn when x < 0 (use the
Exn.throw/raise symbol available in the language runtime) and then handle that
exception with an Exn.handle (or equivalent try/handle) so the handler returns 0
for the negative-case exception and returns x otherwise; keep the function name
safe_non_negative and ensure the thrown exception is specific enough to be
caught by your handler (e.g., a tag or message like "NegativeInput").

solutions/aver/VB_T4_003_even_odd_mutual_recursion.av (1)

9-17: ⚠️ Potential issue | 🔴 Critical

Handle negative inputs before recursing.

is_even(-1) and is_odd(-1) never reach the 0 base case here; both functions keep stepping further negative, so this canonical solution can hang on valid Int inputs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_003_even_odd_mutual_recursion.av` around lines 9 - 17,
The mutual recursion in is_even and is_odd lacks handling for negative n,
causing infinite recursion for negative inputs; update both functions (is_even
and is_odd) to handle n < 0 before recursing (for example, normalize by using
the absolute value or redirect negative inputs to the same function with -n) so
that negative integers reach the base case instead of recursing further into
negatives.

vera_bench/runner.py (2)

576-623: ⚠️ Potential issue | 🟡 Minor

Type the new Aver helper interfaces.

These helpers sit on the evaluation path but still accept untyped parameters. Please declare the supported input union explicitly so the new harness code stays aligned with the repo’s Python typing requirement. As per coding guidelines, "Use Python 3.11+ with type hints everywhere in the VeraBench harness".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 576 - 623, Declare a typed union for
supported Aver inputs (e.g., AverValue = Union[bool, int, float, str,
list['AverValue']]) and add necessary typing imports (from typing import Union,
List). Then annotate the three helpers: change def _aver_literal(value) -> str
to def _aver_literal(value: AverValue) -> str, def
_normalize_aver_output(expected) -> str to def _normalize_aver_output(expected:
AverValue) -> str, and def _aver_output_matches(actual: str, expected) -> bool
to def _aver_output_matches(actual: str, expected: AverValue) -> bool, keeping
function bodies unchanged so list recursion uses the AverValue alias.

709-713: ⚠️ Potential issue | 🟠 Major

Skip attempt 2 for tooling failures.

This still retries whenever check_pass is false, including aver not found on PATH and aver check timed out. Those are not fixable by the model, so attempt 2 just burns tokens and records a misleading second attempt.

Proposed fix

-    if language == "aver" and not eval_result["check_pass"] and max_fix_attempts > 0:
+    retryable_aver_failure = (
+        not eval_result["check_pass"]
+        and eval_result.get("error_message")
+        not in {"aver not found on PATH", "aver check timed out"}
+    )
+    if language == "aver" and retryable_aver_failure and max_fix_attempts > 0:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 709 - 713, The current Aver fix retry
triggers whenever eval_result["check_pass"] is false, which also retries on
tooling failures (e.g., "aver not found on PATH", "aver check timed out");
update the conditional around build_aver_fix_prompt in runner.py to skip attempt
2 for tooling failures by adding a guard that checks
eval_result.get("error_message", "") (or a small helper like
is_tooling_failure(error_msg)) and only calls build_aver_fix_prompt when the
error is not a tooling failure and max_fix_attempts > 0; reference the existing
symbols language, eval_result["check_pass"], eval_result.get("error_message",
""), max_fix_attempts, and build_aver_fix_prompt to locate and implement the
change.

vera_bench/baseline_runner.py (2)

513-547: ⚠️ Potential issue | 🟡 Minor

Add type hints to the Aver output helpers.

The new helper signatures still leave expected untyped, which breaks the harness-wide typing rule and makes these comparison contracts harder to follow. Annotate expected with Any or a precise union in both helpers.

As per coding guidelines, "Use Python 3.11+ with type hints everywhere in the VeraBench harness".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/baseline_runner.py` around lines 513 - 547, Annotate the untyped
expected parameter in both helper functions: change
_normalize_aver_expected(expected) to _normalize_aver_expected(expected: Any) ->
str and change _aver_output_matches(actual: str, expected) -> bool to
_aver_output_matches(actual: str, expected: Any) -> bool; add the required
import "from typing import Any" at the top of the module. This keeps the
existing return types and behavior but satisfies the harness typing rule; you
may alternatively use a precise union type for expected (e.g.,
Union[bool,int,float,str,list[Any]]) if you prefer stricter typing.
377-386: ⚠️ Potential issue | 🟠 Major

Do not treat empty test_cases as an automatic Aver pass.

Many VeraBench problems intentionally leave test_cases empty, so this short-circuit means a missing or broken Aver baseline is never even checked. Please keep those problems in the Aver baseline run and still execute aver check, with run_correct=None when there is nothing to execute.

Based on learnings, "test_cases should be left empty for problems whose entry point takes Array or String parameters (or returns them)" and ADT-input problems are likewise validated without runtime cases.

Also applies to: 571-583
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/baseline_runner.py` around lines 377 - 386, The current
early-return when test_cases is empty incorrectly marks Aver as passed; remove
that short-circuit so we still perform the Aver baseline check. Specifically, in
the block that checks if not test_cases (the code that returns a ProblemResult
with check_pass=True), stop returning early and instead call the Aver check
routine (the same code path used when test_cases exist), set run_correct=None
when there are no runtime cases to execute, and populate
ProblemResult.check_pass from the Aver check result; apply the same change to
the analogous block around the other occurrence mentioned (the block at lines
~571-583).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@problems/tier3/VB_T3_008_tree_count_leaves.json`:
- Line 6: Update the "description_neutral" string to explicitly state the
non-empty-tree lower bound (that the count is at least 1) so it matches the
function signature, contracts.ensures, and test_cases; locate the JSON field
"description_neutral" in this file and add phrasing like "the result is at least
1" or "returns an integer >= 1" to align the neutral description with the
verifier-facing contract guarantees.

In `@solutions/aver/VB_T2_005_contains_substring.av`:
- Around line 5-7: The contains_substring function currently calls a
non-existent String.contains; replace that call by implementing a manual
substring check inside contains_substring: compute haystack and needle lengths
(using String.len), return false if needle is longer, then iterate over possible
start indices in haystack and compare each haystack slice of needle length to
needle (using the language's slice/substr operation or character-by-character
equality), returning true on a match and false if none found; ensure you
reference contains_substring and avoid any use of String.contains.

In `@solutions/aver/VB_T2_010_sum_positives.av`:
- Around line 19-23: The main function's printed testcases don't include the
zero-containing input used by the verify harness, causing mismatch; update fn
main to add a Console.print call that invokes sum_positives with the same
zero-including list used in the verify cases so manual runs mirror automated
tests (i.e., add a Console.print(sum_positives(<the verify zero-containing
list>)) alongside the existing Console.print calls for sum_positives([]),
sum_positives([1, -2, 3, -4, 5]) and sum_positives([-1, -2, -3])).

In `@solutions/aver/VB_T4_002_greatest_common_divisor.av`:
- Around line 13-17: Add an explicit test for the gcd(0, 0) case in the verify
block to pin the convention (use gcd(0, 0) => 0) so the expected behavior is
unambiguous; update the verify section that currently lists gcd(12, 8), gcd(7,
0), gcd(0, 5), gcd(100, 75) to include gcd(0, 0) => 0 and ensure the test
harness that uses verify/gcd continues to pass.

In `@solutions/aver/VB_T4_009_list_nth.av`:
- Around line 10-12: The function list_nth currently falls into the [h, ..t]
case and recursively traverses the list even when index is negative; add an
early guard that checks the index (e.g., if index < 0) and immediately returns
-1 before the pattern match so negative indices short-circuit and do not trigger
deep recursion in list_nth.

In `@solutions/aver/VB_T5_001_counter.av`:
- Around line 5-18: The function count allows non-terminating recursion when n >
target; update the base case in count to stop when n >= target (i.e., match n >=
target and return n) so recursion only occurs for n < target, and add a
regression verify case such as count(5, 3) => 5 (or any n > target) under verify
count to ensure the inclusive base case works.

In `@solutions/aver/VB_T5_004_accumulator.av`:
- Around line 5-13: The implementation bypasses the Tier-5 state effect: replace
the pure recursive accumulator (sum_loop) usage in sum_with_state with explicit
State-effect operations so the benchmark exercises effect wiring; specifically,
keep sum_loop but change it to perform its accumulation by reading and writing a
State variable (e.g., use State.get/State.set or the project's state-effect
primitives) instead of passing acc as an argument, and make sum_with_state
initialize the state (0), invoke the effectful loop (starting from 1 to n), and
finally return the state value; reference sum_with_state and sum_loop when
locating where to swap the pure-accumulator for state.get/state.set calls and to
add the state initialization and final read.

In `@solutions/aver/VB_T5_008_print_numbers.av`:
- Around line 14-16: The trailing Console.print("") in the match branch causes
an extra blank line after output; remove that extra terminal print so that when
match in the file (the match handling i <= n and the false branch) finishes it
does not emit a blank line. Locate the match expression that calls
print_and_continue(i, n) and replace/eliminate the false -> Console.print("")
branch so the function returns without printing an extra newline; ensure
print_and_continue remains responsible for all required output.

In `@solutions/aver/VB_T5_009_state_max.av`:
- Around line 33-36: The verify block only tests positive inputs; add boundary
tests for the non-positive branch by adding verify cases for state_max(0) and
state_max(-1) in the verify state_max block, using the expected results that
correspond to the n < 1 behaviour defined at line 20 so the n<1 branch is
exercised and protected from regressions.
- Around line 24-27: The test suite only exercises the true branch of max_step;
add a verify case that exercises the false branch where i <= current_max so the
accumulator remains unchanged. Update the verify block to include a line calling
max_step with i less than or equal to current_max (for example use max_step(2,
5, 1) => 5 or similar) so the code path in the max_step function that skips
updating current_max is covered.

In `@vera_bench/cli.py`:
- Around line 162-177: The code currently swallows
FileNotFoundError/_sp.TimeoutExpired when calling ["aver","--version"] and falls
back to aver_ver="unknown"; change this to fail fast by importing click and
raising click.ClickException when the subprocess raises FileNotFoundError or
_sp.TimeoutExpired (include the original exception message) and also treat a
non-zero returncode from _av_proc as a failure that raises click.ClickException
with stdout/stderr content; update the block around aver_ver and the subprocess
call to raise instead of assigning "unknown" so the CLI aborts before running
the full benchmark.

In `@vera_bench/runner.py`:
- Around line 469-479: The subprocess invocation that runs Aver verification is
using the wrong CLI form; update the subprocess.run call that creates
verify_proc (the list currently ["aver", "run", "--verify", str(code_path)]) to
call the documented CLI ["aver", "verify", str(code_path")] while preserving
capture_output, text, timeout, check, env (run_env) and the assignment to
result["verify_pass"] so the verify returncode is evaluated correctly.

---

Duplicate comments:
In `@problems/tier5/VB_T5_005_checked_index.json`:
- Line 6: Update the "description_neutral" string to use "array" instead of
"list" so it matches the function signature; specifically edit the JSON key
"description_neutral" in VB_T5_005_checked_index.json to read something like
"Safely access an array element by index. Return -1 if the index is out of
bounds." to align with the array input defined elsewhere in the spec.

In `@problems/tier5/VB_T5_010_safe_head.json`:
- Around line 6-7: Update the neutral description to use the term "array" to
match the function signature: change the value of "description_neutral" so it
reads "Return the first element of an array, or -1 if the array is empty." to
align with the signature "public fn safe_head(`@Array`<Int> -> `@Int`)" and maintain
consistency between the problem text and the contract.

In `@solutions/aver/VB_T1_007_safe_modulo.av`:
- Around line 4-6: The function safe_modulo currently performs a division
unguarded (the expression a - (a / b) * b), which will crash when b == 0; modify
safe_modulo to check b == 0 at the start and handle it explicitly—either raise a
descriptive runtime error (e.g., "division by zero") or return a defined
sentinel value—then only perform the existing computation a - (a / b) * b when b
!= 0.

In `@solutions/aver/VB_T4_001_fibonacci.av`:
- Around line 5-12: The solution defines fn fib(...) but the spec/harness
expects the entry-point named "fibonacci" (and its signature); update the
solution to match the spec by renaming the function from fib to fibonacci and
ensuring the parameter and return types/signature match the spec's "signature"
(or alternatively change the spec's entry_point to "fib" if intended) so the
harness calls the correct symbol; specifically locate the fn fib declaration and
rename it to fn fibonacci with the same parameter list and return type.

In `@solutions/aver/VB_T4_003_even_odd_mutual_recursion.av`:
- Around line 9-17: The mutual recursion in is_even and is_odd lacks handling
for negative n, causing infinite recursion for negative inputs; update both
functions (is_even and is_odd) to handle n < 0 before recursing (for example,
normalize by using the absolute value or redirect negative inputs to the same
function with -n) so that negative integers reach the base case instead of
recursing further into negatives.

In `@solutions/aver/VB_T4_004_power.av`:
- Around line 6-10: The power function currently recurses forever for exp < 0;
update the match on exp inside fn power to add an explicit negative-exponent
branch (e.g., a case matching exp < 0) before the recursive "_" branch so
negative exponents are handled without recursion; in that branch return or
signal the appropriate behavior per project conventions (throw/raise an error,
return a sentinel, or convert to a Result/Option) instead of falling through to
base * power(...).

In `@solutions/aver/VB_T4_005_sum_to_n.av`:
- Around line 6-10: The function sum_to_n currently only handles 0 and positive
n and will recurse forever for negative inputs; update sum_to_n's match to
include a branch for negative n (e.g., add a pattern like n when n < 0 -> 0 or
otherwise return/raise a deterministic result) so that the negative case
short-circuits before the recursive case and prevents non-terminating recursion
in the match expression.

In `@solutions/aver/VB_T4_007_count_digits.av`:
- Around line 7-11: The function count_digits currently treats any n < 10
(including negatives) as a single digit; update count_digits to enforce the
declared domain by either rejecting negative inputs (e.g., raise/error when n <
0) or normalizing the sign before processing (e.g., use abs(n) at the start),
then use the absolute value for the recursion (1 + count_digits(abs(n) / 10)) so
negative values no longer collapse to 1; ensure the guard/branch that checks n <
10 operates on the non-negative value or add an explicit negative check that
returns/raises an error.

In `@solutions/aver/VB_T4_010_div_natural.av`:
- Around line 6-10: The function div_natural has no guard for non-positive
divisors and will recurse forever when b <= 0; update div_natural to check for b
<= 0 at the start (before the match) and handle it explicitly (e.g., return an
error/raise an exception/abort execution or otherwise short-circuit) so the
recursion only runs for b > 0; modify the div_natural definition to include that
precondition check and ensure the existing match on a < b remains for the normal
positive-divisor case.

In `@solutions/aver/VB_T5_002_greeter_io_boundary.av`:
- Line 8: The string uses unsupported interpolation; change the Console.print
call in Console.print(...) to build the greeting with concatenation instead of
"{name}" interpolation — e.g., assemble "Hello, " + name + "!" (update the
Console.print invocation that currently passes "Hello, {name}!").

In `@solutions/aver/VB_T5_007_exn_negate.av`:
- Around line 5-10: The current safe_non_negative function bypasses Tier 5
effect handling by clamping directly; change it so that safe_non_negative throws
an Exn when x < 0 (use the Exn.throw/raise symbol available in the language
runtime) and then handle that exception with an Exn.handle (or equivalent
try/handle) so the handler returns 0 for the negative-case exception and returns
x otherwise; keep the function name safe_non_negative and ensure the thrown
exception is specific enough to be caught by your handler (e.g., a tag or
message like "NegativeInput").

In `@solutions/aver/VB_T5_010_safe_head.av`:
- Around line 11-14: Add a verify case to ensure safe_head returns a negative
head value for a non-empty list (to distinguish from the empty-list sentinel
path). Update the test block for safe_head by adding a line like safe_head([-5,
1, 2]) => -5 (or any negative integer) alongside the existing cases so the
function safe_head is exercised with a negative non-empty head without modifying
existing tests or behavior.

In `@vera_bench/baseline_runner.py`:
- Around line 513-547: Annotate the untyped expected parameter in both helper
functions: change _normalize_aver_expected(expected) to
_normalize_aver_expected(expected: Any) -> str and change
_aver_output_matches(actual: str, expected) -> bool to
_aver_output_matches(actual: str, expected: Any) -> bool; add the required
import "from typing import Any" at the top of the module. This keeps the
existing return types and behavior but satisfies the harness typing rule; you
may alternatively use a precise union type for expected (e.g.,
Union[bool,int,float,str,list[Any]]) if you prefer stricter typing.
- Around line 377-386: The current early-return when test_cases is empty
incorrectly marks Aver as passed; remove that short-circuit so we still perform
the Aver baseline check. Specifically, in the block that checks if not
test_cases (the code that returns a ProblemResult with check_pass=True), stop
returning early and instead call the Aver check routine (the same code path used
when test_cases exist), set run_correct=None when there are no runtime cases to
execute, and populate ProblemResult.check_pass from the Aver check result; apply
the same change to the analogous block around the other occurrence mentioned
(the block at lines ~571-583).

In `@vera_bench/runner.py`:
- Around line 576-623: Declare a typed union for supported Aver inputs (e.g.,
AverValue = Union[bool, int, float, str, list['AverValue']]) and add necessary
typing imports (from typing import Union, List). Then annotate the three
helpers: change def _aver_literal(value) -> str to def _aver_literal(value:
AverValue) -> str, def _normalize_aver_output(expected) -> str to def
_normalize_aver_output(expected: AverValue) -> str, and def
_aver_output_matches(actual: str, expected) -> bool to def
_aver_output_matches(actual: str, expected: AverValue) -> bool, keeping function
bodies unchanged so list recursion uses the AverValue alias.
- Around line 709-713: The current Aver fix retry triggers whenever
eval_result["check_pass"] is false, which also retries on tooling failures
(e.g., "aver not found on PATH", "aver check timed out"); update the conditional
around build_aver_fix_prompt in runner.py to skip attempt 2 for tooling failures
by adding a guard that checks eval_result.get("error_message", "") (or a small
helper like is_tooling_failure(error_msg)) and only calls build_aver_fix_prompt
when the error is not a tooling failure and max_fix_attempts > 0; reference the
existing symbols language, eval_result["check_pass"],
eval_result.get("error_message", ""), max_fix_attempts, and
build_aver_fix_prompt to locate and implement the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0054a5cd-1c45-4f92-b30e-c85aa4417531

📥 Commits

Reviewing files that changed from the base of the PR and between 9941928 and 366023b.

📒 Files selected for processing (104)

problems/tier1/VB_T1_001_absolute_value.json
problems/tier1/VB_T1_002_clamp.json
problems/tier1/VB_T1_003_signum.json
problems/tier1/VB_T1_004_max_of_two.json
problems/tier1/VB_T1_005_min_of_two.json
problems/tier1/VB_T1_006_is_positive.json
problems/tier1/VB_T1_007_safe_modulo.json
problems/tier1/VB_T1_008_distance.json
problems/tier1/VB_T1_009_max_of_three.json
problems/tier1/VB_T1_010_double_or_nothing.json
problems/tier2/VB_T2_001_sum_array.json
problems/tier2/VB_T2_002_filter_positives.json
problems/tier2/VB_T2_003_greeting.json
problems/tier2/VB_T2_004_is_empty_string.json
problems/tier2/VB_T2_005_contains_substring.json
problems/tier2/VB_T2_006_join_strings.json
problems/tier2/VB_T2_007_double_elements.json
problems/tier2/VB_T2_008_count_positives.json
problems/tier2/VB_T2_009_to_upper.json
problems/tier2/VB_T2_010_sum_positives.json
problems/tier3/VB_T3_001_list_length.json
problems/tier3/VB_T3_002_tree_depth.json
problems/tier3/VB_T3_003_expression_evaluator.json
problems/tier3/VB_T3_004_list_sum.json
problems/tier3/VB_T3_005_tree_sum.json
problems/tier3/VB_T3_006_option_unwrap_or.json
problems/tier3/VB_T3_007_list_contains.json
problems/tier3/VB_T3_008_tree_count_leaves.json
problems/tier3/VB_T3_009_list_append.json
problems/tier3/VB_T3_010_list_last.json
problems/tier4/VB_T4_001_fibonacci.json
problems/tier4/VB_T4_002_greatest_common_divisor.json
problems/tier4/VB_T4_003_even_odd_mutual_recursion.json
problems/tier4/VB_T4_004_power.json
problems/tier4/VB_T4_005_sum_to_n.json
problems/tier4/VB_T4_006_list_reverse.json
problems/tier4/VB_T4_007_count_digits.json
problems/tier4/VB_T4_008_multiply.json
problems/tier4/VB_T4_009_list_nth.json
problems/tier4/VB_T4_010_div_natural.json
problems/tier5/VB_T5_001_counter.json
problems/tier5/VB_T5_002_greeter_io_boundary.json
problems/tier5/VB_T5_003_safe_division_exceptions.json
problems/tier5/VB_T5_004_accumulator.json
problems/tier5/VB_T5_005_checked_index.json
problems/tier5/VB_T5_006_state_double.json
problems/tier5/VB_T5_007_exn_negate.json
problems/tier5/VB_T5_008_print_numbers.json
problems/tier5/VB_T5_009_state_max.json
problems/tier5/VB_T5_010_safe_head.json
solutions/aver/VB_T1_001_absolute_value.av
solutions/aver/VB_T1_002_clamp.av
solutions/aver/VB_T1_003_signum.av
solutions/aver/VB_T1_004_max_of_two.av
solutions/aver/VB_T1_005_min_of_two.av
solutions/aver/VB_T1_006_is_positive.av
solutions/aver/VB_T1_007_safe_modulo.av
solutions/aver/VB_T1_008_distance.av
solutions/aver/VB_T1_009_max_of_three.av
solutions/aver/VB_T1_010_double_or_nothing.av
solutions/aver/VB_T2_001_sum_array.av
solutions/aver/VB_T2_002_filter_positives.av
solutions/aver/VB_T2_003_greeting.av
solutions/aver/VB_T2_004_is_empty_string.av
solutions/aver/VB_T2_005_contains_substring.av
solutions/aver/VB_T2_006_join_strings.av
solutions/aver/VB_T2_007_double_elements.av
solutions/aver/VB_T2_008_count_positives.av
solutions/aver/VB_T2_009_to_upper.av
solutions/aver/VB_T2_010_sum_positives.av
solutions/aver/VB_T3_001_list_length.av
solutions/aver/VB_T3_002_tree_depth.av
solutions/aver/VB_T3_003_expression_evaluator.av
solutions/aver/VB_T3_004_list_sum.av
solutions/aver/VB_T3_005_tree_sum.av
solutions/aver/VB_T3_006_option_unwrap_or.av
solutions/aver/VB_T3_007_list_contains.av
solutions/aver/VB_T3_008_tree_count_leaves.av
solutions/aver/VB_T3_009_list_append.av
solutions/aver/VB_T3_010_list_last.av
solutions/aver/VB_T4_001_fibonacci.av
solutions/aver/VB_T4_002_greatest_common_divisor.av
solutions/aver/VB_T4_003_even_odd_mutual_recursion.av
solutions/aver/VB_T4_004_power.av
solutions/aver/VB_T4_005_sum_to_n.av
solutions/aver/VB_T4_006_list_reverse.av
solutions/aver/VB_T4_007_count_digits.av
solutions/aver/VB_T4_008_multiply.av
solutions/aver/VB_T4_009_list_nth.av
solutions/aver/VB_T4_010_div_natural.av
solutions/aver/VB_T5_001_counter.av
solutions/aver/VB_T5_002_greeter_io_boundary.av
solutions/aver/VB_T5_003_safe_division_exceptions.av
solutions/aver/VB_T5_004_accumulator.av
solutions/aver/VB_T5_005_checked_index.av
solutions/aver/VB_T5_006_state_double.av
solutions/aver/VB_T5_007_exn_negate.av
solutions/aver/VB_T5_008_print_numbers.av
solutions/aver/VB_T5_009_state_max.av
solutions/aver/VB_T5_010_safe_head.av
vera_bench/baseline_runner.py
vera_bench/cli.py
vera_bench/prompts.py
vera_bench/runner.py

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (16)

solutions/aver/VB_T5_004_accumulator.av (1)

5-13: ⚠️ Potential issue | 🟠 Major

sum_with_state currently bypasses the Tier-5 State-effect path.

Line 11 advertises state-oriented behaviour, but Lines 5-13 are a pure recursive accumulator with no effect operations/handler wiring. This shifts the task away from the Tier-5 benchmark objective.

Please verify with this read-only check (expected: this file should contain State wiring, but currently likely does not):
#!/bin/bash
set -euo pipefail

# 1) Inspect State-effect usage across Tier-5 Aver solutions
rg -n --iglob 'solutions/aver/VB_T5_*.av' 'handle\[State|State\.(get|set|put|modify)|!\s*\[State'

# 2) Inspect the target file's implementation details
rg -n 'fn sum_loop|fn sum_with_state|State|handle' solutions/aver/VB_T5_004_accumulator.av
Based on learnings: In vera-bench, Tier 5 problems intentionally evaluate correct wiring of effect handlers (State, Exn, IO) rather than pure arithmetic recursion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_004_accumulator.av` around lines 5 - 13, The
implementation currently uses pure recursion (fn sum_loop and fn sum_with_state)
and must be rewritten to exercise the Tier-5 State effect: change sum_loop to
use State operations (e.g., State.get/State.put or State.modify) instead of an
explicit acc parameter, and implement sum_with_state to invoke that loop under a
State handler (e.g., handle[State] ... with initial state 0) so the final result
is returned from the handled state; update references to sum_loop and
sum_with_state accordingly so the file contains explicit State wiring and
handler usage.

problems/tier3/VB_T3_008_tree_count_leaves.json (1)

6-6: ⚠️ Potential issue | 🟡 Minor

Include the “at least 1” guarantee in description_neutral.

The neutral text currently drops the explicit lower bound guaranteed by contracts.ensures (@Nat.result >= 1).
Suggested fix
-  "description_neutral": "Define a binary tree type with leaf and branch constructors. Write a function that counts the number of leaf nodes.",
+  "description_neutral": "Define a binary tree type with leaf and branch constructors. Write a function that counts the number of leaf nodes; the result is at least 1.",
As per coding guidelines: ensure problem definitions remain consistent, and keep description_neutral aligned with existing contract semantics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@problems/tier3/VB_T3_008_tree_count_leaves.json` at line 6, Update the
"description_neutral" string to explicitly state the lower bound guarantee from
the contract (mention that the function returns at least 1 leaf), e.g. include
phrasing like "returns the number of leaf nodes (at least 1)"; ensure this
aligns with the contract comment "@Nat.result >= 1" so the neutral description
mirrors the ensures clause and the guaranteed lower bound.

problems/tier5/VB_T5_010_safe_head.json (1)

6-6: ⚠️ Potential issue | 🟡 Minor

Use “array” in the neutral description to match the signature.

description_neutral says “list”, but the function contract is array-based (@Array<Int>). Please align wording to avoid prompt ambiguity for non-Vera languages.
Suggested fix
-  "description_neutral": "Return the first element of a list, or -1 if the list is empty.",
+  "description_neutral": "Return the first element of an array, or -1 if the array is empty.",
As per coding guidelines: “For this PR, problem JSONs add a new top-level field description_neutral ... Keep all other behavior ... unchanged.” and “ensure description_neutral restates the task ... while preserving the original functional meaning.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@problems/tier5/VB_T5_010_safe_head.json` at line 6, The neutral description
uses “list” but the function signature expects an array; update the
description_neutral field to use “array” while preserving meaning (e.g., change
"Return the first element of a list, or -1 if the list is empty." to "Return the
first element of an array, or -1 if the array is empty.") — locate and edit the
description_neutral value in the JSON (field name: description_neutral) to make
this single-word substitution.

solutions/aver/VB_T4_002_greatest_common_divisor.av (1)

13-17: 🧹 Nitpick | 🔵 Trivial

Pin the gcd(0, 0) convention in verify.

Adding this case will make the canonical behaviour explicit and reduce ambiguity in future refactors.

Suggested verify addition

 verify gcd
     gcd(12, 8)   => 4
     gcd(7, 0)    => 7
     gcd(0, 5)    => 5
+    gcd(0, 0)    => 0
     gcd(100, 75) => 25

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_002_greatest_common_divisor.av` around lines 13 - 17,
Add an explicit test for the gcd(0, 0) convention inside the verify block to pin
canonical behaviour: update the verify cases (the verify section containing
gcd(12, 8), gcd(7, 0), etc.) to include a line for gcd(0, 0) with the chosen
canonical result (e.g., => 0) so the gcd function's behaviour for (0,0) is
unambiguous for future refactors.

solutions/aver/VB_T4_005_sum_to_n.av (1)

6-10: ⚠️ Potential issue | 🔴 Critical

Guard negative input to prevent non-terminating recursion.

Line 10 recurses with n - 1 for every non-zero input, so any negative n never reaches the 0 base case and can run indefinitely.

Proposed fix

 fn sum_to_n(n: Int) -> Int
     ? "Returns the sum of integers from 0 to n."
-    match n
-        0 -> 0
-        _ -> n + sum_to_n(n - 1)
+    match n <= 0
+        true  -> 0
+        false -> n + sum_to_n(n - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_005_sum_to_n.av` around lines 6 - 10, The function
sum_to_n currently recurses for every non-zero n causing non-termination on
negative inputs; update the pattern matching in sum_to_n to handle negative n
explicitly (for example add a branch like n < 0 -> 0 or return an error/raise an
exception depending on intended behavior) so recursion only proceeds for
positive integers and the base case is reachable; locate the match in sum_to_n
and add the negative-case branch (or a guard) before the recursive '_' branch to
prevent infinite recursion.

solutions/aver/VB_T4_004_power.av (1)

6-10: ⚠️ Potential issue | 🔴 Critical

Handle negative exponents to avoid runaway recursion.

With exp: Int, any negative exponent never reaches the 0 branch and the recursion can fail/hang.

Proposed fix

 fn power(base: Int, exp: Int) -> Int
     ? "Returns base^exp using recursive repeated multiplication."
-    match exp
-        0 -> 1
-        _ -> base * power(base, exp - 1)
+    match exp < 0
+        true  -> 0
+        false ->
+            match exp
+                0 -> 1
+                _ -> base * power(base, exp - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_004_power.av` around lines 6 - 10, The power function
(power(base: Int, exp: Int)) doesn’t guard against negative exp and will recurse
forever; add a check at the start of power to handle exp < 0 (either
raise/return an error indicating negative exponents are unsupported or change
the API to accept fractional results and compute reciprocal by using positive
exponent on a Float/Double/base conversion). Update the function body to
short-circuit when exp < 0 and document the behavior so recursion only proceeds
for exp >= 0.

solutions/aver/VB_T5_010_safe_head.av (1)

11-14: 🧹 Nitpick | 🔵 Trivial

Add a negative-head verify case to lock sentinel behaviour.

Current checks are good, but one non-empty negative-head case helps ensure -1 is only the empty-list sentinel and not accidentally conflated in future edits.

Suggested verify addition

 verify safe_head
     safe_head([10, 20, 30]) => 10
     safe_head([42])         => 42
+    safe_head([0 - 5, 99])  => 0 - 5
     safe_head([])           => 0 - 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_010_safe_head.av` around lines 11 - 14, Add a non-empty
negative-head verify case to ensure the empty-list sentinel -1 isn't conflated
with a real head value: in the verify block for safe_head, add a line like
safe_head([-1, 99]) => -1 (or any non-empty list whose first element is -1) so
safe_head(/function/) returns the actual head -1 for non-empty lists rather than
treating -1 as the empty-list sentinel.

solutions/aver/VB_T1_007_safe_modulo.av (1)

4-6: ⚠️ Potential issue | 🔴 Critical

Guard b == 0 before division.

Line 6 divides by b unconditionally, so b == 0 can hard-fail execution.

Proposed fix

 fn safe_modulo(a: Int, b: Int) -> Int
-    ? "Returns a modulo b. Assumes b != 0."
-    a - (a / b) * b
+    ? "Returns a modulo b; returns 0 when b == 0."
+    match b == 0
+        true  -> 0
+        false -> a - (a / b) * b

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T1_007_safe_modulo.av` around lines 4 - 6, The function
safe_modulo performs a division by b without checking b; add an explicit guard
at the top of safe_modulo that tests if b == 0 and handles it (for example,
raise/return a descriptive error like "division by zero" or return a documented
default) before computing a - (a / b) * b so the division never executes with b
== 0.

solutions/aver/VB_T4_009_list_nth.av (1)

8-12: ⚠️ Potential issue | 🟡 Minor

Short-circuit negative indices before list recursion.

At Line 8, negative index values still recurse through the entire list before returning -1. Add an early guard to avoid unnecessary deep recursion and potential timeout/stack-risk on long inputs.

Suggested patch

 fn list_nth(xs: List<Int>, index: Int) -> Int
     ? "Returns the element at index, or -1 if out of bounds."
-    match xs
-        [] -> 0 - 1
-        [h, ..t] -> match index
-            0 -> h
-            _ -> list_nth(t, index - 1)
+    match index < 0
+        true -> 0 - 1
+        false -> match xs
+            [] -> 0 - 1
+            [h, ..t] -> match index
+                0 -> h
+                _ -> list_nth(t, index - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_009_list_nth.av` around lines 8 - 12, The function
list_nth currently recurses for negative index values; add an early guard that
checks if index < 0 and immediately returns -1 before doing the pattern match on
xs to avoid deep recursion/stack risk. Locate function list_nth (the match xs
block) and insert the negative-index check (e.g., if index < 0 -> -1) prior to
the existing match so only non-negative indices proceed to the [] / [h, ..t]
branches.

solutions/aver/VB_T5_008_print_numbers.av (1)

14-16: ⚠️ Potential issue | 🟠 Major

Remove terminal blank-line print in the stop branch.

At Line 16, Console.print("") emits an extra trailing line. In benchmark harness output matching, this can cause false wrong-answer results even when numeric lines are correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_008_print_numbers.av` around lines 14 - 16, The stop
branch of the match currently calls Console.print("") which emits an unwanted
blank line; change the `false` branch in the match (the stop branch paired with
`match i <= n`) to a no-op instead of printing (e.g., replace `false ->
Console.print("")` with a unit/no-op branch such as `false -> ()` or simply omit
the print call) so the routine stops without emitting a trailing blank line.

solutions/aver/VB_T4_007_count_digits.av (1)

7-11: ⚠️ Potential issue | 🟡 Minor

Handle negative input explicitly to match declared domain.

At Line 9, negatives satisfy n < 10, so e.g. count_digits(-42) returns 1. That silently conflicts with the stated non-negative domain.

Suggested patch

 fn count_digits(n: Int) -> Int
     ? "Returns the number of digits in n. Handles 0 as having 1 digit."
-    match n < 10
-        true  -> 1
-        false -> 1 + count_digits(n / 10)
+    match n < 0
+        true  -> count_digits(0 - n)
+        false -> match n < 10
+            true  -> 1
+            false -> 1 + count_digits(n / 10)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_007_count_digits.av` around lines 7 - 11, count_digits
currently treats negative n as having 1 digit because the pattern `n < 10`
matches negatives; add an explicit negative-input branch at the start of fn
count_digits (e.g., check `n < 0`) and handle it consistently with the declared
non-negative domain—either raise/return an error/exception or normalize via
abs(n) before proceeding; update the match/guards so negatives are not lumped
into the `n < 10` case and ensure recursive calls (count_digits) use the same
non-negative precondition.

solutions/aver/VB_T2_010_sum_positives.av (1)

19-23: 🧹 Nitpick | 🔵 Trivial

Keep main examples aligned with verify cases.

verify includes sum_positives([0, 1, 2]) (Line [17]), but main omits it, so manual runs and verifier coverage can drift.

Suggested tidy-up

 fn main() -> Unit
     ! [Console.print]
     Console.print(sum_positives([]))
     Console.print(sum_positives([1, -2, 3, -4, 5]))
     Console.print(sum_positives([-1, -2, -3]))
+    Console.print(sum_positives([0, 1, 2]))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T2_010_sum_positives.av` around lines 19 - 23, The examples
in main are missing the case used by verify; update the main function (the
Console.print calls inside fn main) to include Console.print(sum_positives([0,
1, 2])) so examples match the verify cases for sum_positives and remain aligned
for manual runs and testing.

solutions/aver/VB_T5_001_counter.av (1)

5-18: ⚠️ Potential issue | 🟠 Major

Use an inclusive stop condition in count to prevent runaway recursion.

At Line [7], stopping only on equality means count(n, target) does not terminate when n > target, which can stall run jobs.

Suggested fix

 fn count(n: Int, target: Int) -> Int
     ? "Increments n until it reaches target."
-    match n == target
+    match n >= target
         true  -> n
         false -> count(n + 1, target)

 verify count
     count(0, 3) => 3
     count(0, 0) => 0
     count(2, 5) => 5
+    count(5, 2) => 5

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_001_counter.av` around lines 5 - 18, The match in
function count currently only stops when n == target which causes infinite
recursion if n > target; update the stop condition in count to be inclusive
(e.g., check n >= target) and return target in that branch, and ensure
count_three continues to call count(0, 3) unchanged; locate the match inside fn
count and replace the equality branch with an inclusive comparison so the
recursion terminates when n has reached or exceeded target.

solutions/aver/VB_T4_003_even_odd_mutual_recursion.av (1)

7-17: ⚠️ Potential issue | 🔴 Critical

Guard negative inputs before mutual recursion.

At Line [11] and Line [17], negative n keeps decreasing and never reaches the 0 base case, so execution can diverge and hit harness timeouts.

Suggested fix

 fn is_even(n: Int) -> Bool
     ? "Returns true if n is even, using mutual recursion with is_odd."
-    match n
-        0 -> true
-        _ -> is_odd(n - 1)
+    match n < 0
+        true -> is_even(0 - n)
+        false -> match n
+            0 -> true
+            _ -> is_odd(n - 1)

 fn is_odd(n: Int) -> Bool
     ? "Returns true if n is odd, using mutual recursion with is_even."
-    match n
-        0 -> false
-        _ -> is_even(n - 1)
+    match n < 0
+        true -> is_odd(0 - n)
+        false -> match n
+            0 -> false
+            _ -> is_even(n - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_003_even_odd_mutual_recursion.av` around lines 7 - 17,
The mutual-recursive functions is_even and is_odd lack a guard for negative
integers which causes infinite recursion; update both functions (is_even and
is_odd) to handle n < 0 by converting to a non-negative value before recursing
(e.g., call is_even(-n) / is_odd(-n) or dispatch to the other function with -n)
so recursion proceeds toward the 0 base case rather than decreasing
indefinitely.

solutions/aver/VB_T4_010_div_natural.av (1)

6-18: ⚠️ Potential issue | 🔴 Critical

Add a non-positive divisor guard to prevent non-termination.

At Line [8]-Line [10], b <= 0 can lead to infinite recursion and hang execution under timeout-constrained runs.

Suggested fix

 fn div_natural(a: Int, b: Int) -> Int
     ? "Returns a divided by b using repeated subtraction."
-    match a < b
-        true  -> 0
-        false -> 1 + div_natural(a - b, b)
+    match b <= 0
+        true  -> 0
+        false ->
+            match a < b
+                true  -> 0
+                false -> 1 + div_natural(a - b, b)

 verify div_natural
     div_natural(17, 5) => 3
     div_natural(10, 3) => 3
     div_natural(0, 1)  => 0
     div_natural(7, 7)  => 1
     div_natural(3, 10) => 0
+    div_natural(3, 0)  => 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_010_div_natural.av` around lines 6 - 18, The function
div_natural risks infinite recursion when b <= 0; update div_natural to first
handle non-positive divisors by adding a guard for b <= 0 and returning or
raising a clear failure instead of recursing (e.g., produce an error/exception
or a defined sentinel) so the match branch that does repeated subtraction only
runs for b > 0; locate the div_natural definition and add the b <= 0 case before
the existing match on a < b.

vera_bench/cli.py (1)

172-186: ⚠️ Potential issue | 🟠 Major

Treat non-zero aver --version exits as fatal too.

If the binary exists but aver --version returns non-zero, Line 180 still carries on with an empty or bogus version string and the run only fails later after generation has already started.

Suggested guard

         try:
             _av_proc = _sp.run(  # noqa: S603
                 ["aver", "--version"],
                 capture_output=True,
                 text=True,
                 timeout=5,
                 check=False,
             )
+            if _av_proc.returncode != 0:
+                err = (_av_proc.stderr or _av_proc.stdout).strip()
+                raise click.ClickException(err or "aver --version failed")
             aver_ver = _av_proc.stdout.strip().replace("aver ", "")
-        except (FileNotFoundError, _sp.TimeoutExpired):
-            console.print(
-                "[red]Error: aver not found on PATH. "
-                "Install with: cargo install aver-lang[/red]"
-            )
-            raise SystemExit(1)
+        except FileNotFoundError as e:
+            raise click.ClickException(
+                "aver not found on PATH. Install with: cargo install aver-lang"
+            ) from e
+        except _sp.TimeoutExpired as e:
+            raise click.ClickException("aver --version timed out") from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/cli.py` around lines 172 - 186, The try block that runs ["aver",
"--version"] currently ignores non-zero exits; after running _sp.run (the
_av_proc result), check _av_proc.returncode and if it's non-zero treat it as
fatal: print a helpful error via console.print that includes _av_proc.stderr
(and/or stdout) and exit with SystemExit(1). Update the logic around the
existing _av_proc/_av_proc.stdout handling in vera_bench/cli.py so you only set
aver_ver from _av_proc.stdout when returncode == 0, otherwise log the command
failure and exit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@problems/tier2/VB_T2_003_greeting.json`:
- Line 6: Update the description_neutral value to preserve the original
precondition that the input name must be non-empty: change the text for the
description_neutral key to something like "Given a non-empty name, return
'Hello, <name>!'." so it restates the task plainly while keeping the non-empty
input constraint; locate and edit the description_neutral entry in the JSON (the
"description_neutral" string) to include the "non-empty" condition.

In `@solutions/aver/VB_T1_008_distance.av`:
- Around line 1-2: The module header for `module Distance` is missing an export
declaration for the `distance` function; update the header to include `exposes
[distance]` (i.e., change `module Distance` to include the exposes clause so the
`distance` function is exported), keeping the existing `intent = "Computes the
absolute distance between two integers."` line intact and ensuring the
`distance` symbol is listed exactly in the exposes clause.

In `@solutions/aver/VB_T4_008_multiply.av`:
- Around line 6-10: The multiply function can infinite-recurse when b is
negative; update fn multiply(a: Int, b: Int) to guard against negative b (e.g.,
add an early check in multiply: if b < 0 -> handle it deterministically —
throw/return an error value or convert to non-negative via sign handling), or
change the parameter type to a natural/nat type if Aver supports it; ensure the
new guard is placed before the match on b so multiply and its recursive case
(multiply(a, b - 1)) cannot be invoked with negative b.

In `@vera_bench/runner.py`:
- Around line 576-580: The _aver_literal helper returns raw f'"{value}"' for
strings which breaks .av syntax when value contains quotes, backslashes or
newlines; update _aver_literal to properly escape string contents before
embedding (e.g., replace backslashes, double quotes and newline/tab characters
or use a safe serializer like json.dumps on the string) so the returned literal
is a valid quoted .av string; ensure the list branch that calls _aver_literal
still works unchanged.

---

Duplicate comments:
In `@problems/tier3/VB_T3_008_tree_count_leaves.json`:
- Line 6: Update the "description_neutral" string to explicitly state the lower
bound guarantee from the contract (mention that the function returns at least 1
leaf), e.g. include phrasing like "returns the number of leaf nodes (at least
1)"; ensure this aligns with the contract comment "@Nat.result >= 1" so the
neutral description mirrors the ensures clause and the guaranteed lower bound.

In `@problems/tier5/VB_T5_010_safe_head.json`:
- Line 6: The neutral description uses “list” but the function signature expects
an array; update the description_neutral field to use “array” while preserving
meaning (e.g., change "Return the first element of a list, or -1 if the list is
empty." to "Return the first element of an array, or -1 if the array is empty.")
— locate and edit the description_neutral value in the JSON (field name:
description_neutral) to make this single-word substitution.

In `@solutions/aver/VB_T1_007_safe_modulo.av`:
- Around line 4-6: The function safe_modulo performs a division by b without
checking b; add an explicit guard at the top of safe_modulo that tests if b == 0
and handles it (for example, raise/return a descriptive error like "division by
zero" or return a documented default) before computing a - (a / b) * b so the
division never executes with b == 0.

In `@solutions/aver/VB_T2_010_sum_positives.av`:
- Around line 19-23: The examples in main are missing the case used by verify;
update the main function (the Console.print calls inside fn main) to include
Console.print(sum_positives([0, 1, 2])) so examples match the verify cases for
sum_positives and remain aligned for manual runs and testing.

In `@solutions/aver/VB_T4_002_greatest_common_divisor.av`:
- Around line 13-17: Add an explicit test for the gcd(0, 0) convention inside
the verify block to pin canonical behaviour: update the verify cases (the verify
section containing gcd(12, 8), gcd(7, 0), etc.) to include a line for gcd(0, 0)
with the chosen canonical result (e.g., => 0) so the gcd function's behaviour
for (0,0) is unambiguous for future refactors.

In `@solutions/aver/VB_T4_003_even_odd_mutual_recursion.av`:
- Around line 7-17: The mutual-recursive functions is_even and is_odd lack a
guard for negative integers which causes infinite recursion; update both
functions (is_even and is_odd) to handle n < 0 by converting to a non-negative
value before recursing (e.g., call is_even(-n) / is_odd(-n) or dispatch to the
other function with -n) so recursion proceeds toward the 0 base case rather than
decreasing indefinitely.

In `@solutions/aver/VB_T4_004_power.av`:
- Around line 6-10: The power function (power(base: Int, exp: Int)) doesn’t
guard against negative exp and will recurse forever; add a check at the start of
power to handle exp < 0 (either raise/return an error indicating negative
exponents are unsupported or change the API to accept fractional results and
compute reciprocal by using positive exponent on a Float/Double/base
conversion). Update the function body to short-circuit when exp < 0 and document
the behavior so recursion only proceeds for exp >= 0.

In `@solutions/aver/VB_T4_005_sum_to_n.av`:
- Around line 6-10: The function sum_to_n currently recurses for every non-zero
n causing non-termination on negative inputs; update the pattern matching in
sum_to_n to handle negative n explicitly (for example add a branch like n < 0 ->
0 or return an error/raise an exception depending on intended behavior) so
recursion only proceeds for positive integers and the base case is reachable;
locate the match in sum_to_n and add the negative-case branch (or a guard)
before the recursive '_' branch to prevent infinite recursion.

In `@solutions/aver/VB_T4_007_count_digits.av`:
- Around line 7-11: count_digits currently treats negative n as having 1 digit
because the pattern `n < 10` matches negatives; add an explicit negative-input
branch at the start of fn count_digits (e.g., check `n < 0`) and handle it
consistently with the declared non-negative domain—either raise/return an
error/exception or normalize via abs(n) before proceeding; update the
match/guards so negatives are not lumped into the `n < 10` case and ensure
recursive calls (count_digits) use the same non-negative precondition.

In `@solutions/aver/VB_T4_009_list_nth.av`:
- Around line 8-12: The function list_nth currently recurses for negative index
values; add an early guard that checks if index < 0 and immediately returns -1
before doing the pattern match on xs to avoid deep recursion/stack risk. Locate
function list_nth (the match xs block) and insert the negative-index check
(e.g., if index < 0 -> -1) prior to the existing match so only non-negative
indices proceed to the [] / [h, ..t] branches.

In `@solutions/aver/VB_T4_010_div_natural.av`:
- Around line 6-18: The function div_natural risks infinite recursion when b <=
0; update div_natural to first handle non-positive divisors by adding a guard
for b <= 0 and returning or raising a clear failure instead of recursing (e.g.,
produce an error/exception or a defined sentinel) so the match branch that does
repeated subtraction only runs for b > 0; locate the div_natural definition and
add the b <= 0 case before the existing match on a < b.

In `@solutions/aver/VB_T5_001_counter.av`:
- Around line 5-18: The match in function count currently only stops when n ==
target which causes infinite recursion if n > target; update the stop condition
in count to be inclusive (e.g., check n >= target) and return target in that
branch, and ensure count_three continues to call count(0, 3) unchanged; locate
the match inside fn count and replace the equality branch with an inclusive
comparison so the recursion terminates when n has reached or exceeded target.

In `@solutions/aver/VB_T5_004_accumulator.av`:
- Around line 5-13: The implementation currently uses pure recursion (fn
sum_loop and fn sum_with_state) and must be rewritten to exercise the Tier-5
State effect: change sum_loop to use State operations (e.g., State.get/State.put
or State.modify) instead of an explicit acc parameter, and implement
sum_with_state to invoke that loop under a State handler (e.g., handle[State]
... with initial state 0) so the final result is returned from the handled
state; update references to sum_loop and sum_with_state accordingly so the file
contains explicit State wiring and handler usage.

In `@solutions/aver/VB_T5_008_print_numbers.av`:
- Around line 14-16: The stop branch of the match currently calls
Console.print("") which emits an unwanted blank line; change the `false` branch
in the match (the stop branch paired with `match i <= n`) to a no-op instead of
printing (e.g., replace `false -> Console.print("")` with a unit/no-op branch
such as `false -> ()` or simply omit the print call) so the routine stops
without emitting a trailing blank line.

In `@solutions/aver/VB_T5_010_safe_head.av`:
- Around line 11-14: Add a non-empty negative-head verify case to ensure the
empty-list sentinel -1 isn't conflated with a real head value: in the verify
block for safe_head, add a line like safe_head([-1, 99]) => -1 (or any non-empty
list whose first element is -1) so safe_head(/function/) returns the actual head
-1 for non-empty lists rather than treating -1 as the empty-list sentinel.

In `@vera_bench/cli.py`:
- Around line 172-186: The try block that runs ["aver", "--version"] currently
ignores non-zero exits; after running _sp.run (the _av_proc result), check
_av_proc.returncode and if it's non-zero treat it as fatal: print a helpful
error via console.print that includes _av_proc.stderr (and/or stdout) and exit
with SystemExit(1). Update the logic around the existing
_av_proc/_av_proc.stdout handling in vera_bench/cli.py so you only set aver_ver
from _av_proc.stdout when returncode == 0, otherwise log the command failure and
exit.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7b5bd709-ce44-4b4e-b45e-0abad13cb93e

📥 Commits

Reviewing files that changed from the base of the PR and between 366023b and 8521dcd.

📒 Files selected for processing (104)

problems/tier1/VB_T1_001_absolute_value.json
problems/tier1/VB_T1_002_clamp.json
problems/tier1/VB_T1_003_signum.json
problems/tier1/VB_T1_004_max_of_two.json
problems/tier1/VB_T1_005_min_of_two.json
problems/tier1/VB_T1_006_is_positive.json
problems/tier1/VB_T1_007_safe_modulo.json
problems/tier1/VB_T1_008_distance.json
problems/tier1/VB_T1_009_max_of_three.json
problems/tier1/VB_T1_010_double_or_nothing.json
problems/tier2/VB_T2_001_sum_array.json
problems/tier2/VB_T2_002_filter_positives.json
problems/tier2/VB_T2_003_greeting.json
problems/tier2/VB_T2_004_is_empty_string.json
problems/tier2/VB_T2_005_contains_substring.json
problems/tier2/VB_T2_006_join_strings.json
problems/tier2/VB_T2_007_double_elements.json
problems/tier2/VB_T2_008_count_positives.json
problems/tier2/VB_T2_009_to_upper.json
problems/tier2/VB_T2_010_sum_positives.json
problems/tier3/VB_T3_001_list_length.json
problems/tier3/VB_T3_002_tree_depth.json
problems/tier3/VB_T3_003_expression_evaluator.json
problems/tier3/VB_T3_004_list_sum.json
problems/tier3/VB_T3_005_tree_sum.json
problems/tier3/VB_T3_006_option_unwrap_or.json
problems/tier3/VB_T3_007_list_contains.json
problems/tier3/VB_T3_008_tree_count_leaves.json
problems/tier3/VB_T3_009_list_append.json
problems/tier3/VB_T3_010_list_last.json
problems/tier4/VB_T4_001_fibonacci.json
problems/tier4/VB_T4_002_greatest_common_divisor.json
problems/tier4/VB_T4_003_even_odd_mutual_recursion.json
problems/tier4/VB_T4_004_power.json
problems/tier4/VB_T4_005_sum_to_n.json
problems/tier4/VB_T4_006_list_reverse.json
problems/tier4/VB_T4_007_count_digits.json
problems/tier4/VB_T4_008_multiply.json
problems/tier4/VB_T4_009_list_nth.json
problems/tier4/VB_T4_010_div_natural.json
problems/tier5/VB_T5_001_counter.json
problems/tier5/VB_T5_002_greeter_io_boundary.json
problems/tier5/VB_T5_003_safe_division_exceptions.json
problems/tier5/VB_T5_004_accumulator.json
problems/tier5/VB_T5_005_checked_index.json
problems/tier5/VB_T5_006_state_double.json
problems/tier5/VB_T5_007_exn_negate.json
problems/tier5/VB_T5_008_print_numbers.json
problems/tier5/VB_T5_009_state_max.json
problems/tier5/VB_T5_010_safe_head.json
solutions/aver/VB_T1_001_absolute_value.av
solutions/aver/VB_T1_002_clamp.av
solutions/aver/VB_T1_003_signum.av
solutions/aver/VB_T1_004_max_of_two.av
solutions/aver/VB_T1_005_min_of_two.av
solutions/aver/VB_T1_006_is_positive.av
solutions/aver/VB_T1_007_safe_modulo.av
solutions/aver/VB_T1_008_distance.av
solutions/aver/VB_T1_009_max_of_three.av
solutions/aver/VB_T1_010_double_or_nothing.av
solutions/aver/VB_T2_001_sum_array.av
solutions/aver/VB_T2_002_filter_positives.av
solutions/aver/VB_T2_003_greeting.av
solutions/aver/VB_T2_004_is_empty_string.av
solutions/aver/VB_T2_005_contains_substring.av
solutions/aver/VB_T2_006_join_strings.av
solutions/aver/VB_T2_007_double_elements.av
solutions/aver/VB_T2_008_count_positives.av
solutions/aver/VB_T2_009_to_upper.av
solutions/aver/VB_T2_010_sum_positives.av
solutions/aver/VB_T3_001_list_length.av
solutions/aver/VB_T3_002_tree_depth.av
solutions/aver/VB_T3_003_expression_evaluator.av
solutions/aver/VB_T3_004_list_sum.av
solutions/aver/VB_T3_005_tree_sum.av
solutions/aver/VB_T3_006_option_unwrap_or.av
solutions/aver/VB_T3_007_list_contains.av
solutions/aver/VB_T3_008_tree_count_leaves.av
solutions/aver/VB_T3_009_list_append.av
solutions/aver/VB_T3_010_list_last.av
solutions/aver/VB_T4_001_fibonacci.av
solutions/aver/VB_T4_002_greatest_common_divisor.av
solutions/aver/VB_T4_003_even_odd_mutual_recursion.av
solutions/aver/VB_T4_004_power.av
solutions/aver/VB_T4_005_sum_to_n.av
solutions/aver/VB_T4_006_list_reverse.av
solutions/aver/VB_T4_007_count_digits.av
solutions/aver/VB_T4_008_multiply.av
solutions/aver/VB_T4_009_list_nth.av
solutions/aver/VB_T4_010_div_natural.av
solutions/aver/VB_T5_001_counter.av
solutions/aver/VB_T5_002_greeter_io_boundary.av
solutions/aver/VB_T5_003_safe_division_exceptions.av
solutions/aver/VB_T5_004_accumulator.av
solutions/aver/VB_T5_005_checked_index.av
solutions/aver/VB_T5_006_state_double.av
solutions/aver/VB_T5_007_exn_negate.av
solutions/aver/VB_T5_008_print_numbers.av
solutions/aver/VB_T5_009_state_max.av
solutions/aver/VB_T5_010_safe_head.av
vera_bench/baseline_runner.py
vera_bench/cli.py
vera_bench/prompts.py
vera_bench/runner.py

Aver is a statically typed language designed for AI-assisted development, with zero LLM training data — a direct peer to Vera for zero-shot code generation benchmarks. ## Aver language support - 50 canonical solutions in solutions/aver/ (100% baseline pass rate) - prompts.py: build_aver_prompt() with averlang.dev/llms.txt as spec - runner.py: _evaluate_aver_code() with aver check + aver run pipeline, including fix-from-error retry loop - baseline_runner.py: run_aver_baseline() for canonical solution validation - cli.py: --language aver on both run and baselines commands ## Language-neutral descriptions Each problem JSON now has a `description_neutral` field: a purely functional description with no Vera-specific syntax, builtins, types, or implementation details. Python, TypeScript, and Aver prompt builders use description_neutral; Vera prompts still use the original description with contracts and signatures. This levels the playing field for cross-language comparison: all non-Vera languages receive the same language-neutral problem statement, rather than descriptions that reference Vera builtins (string_upper, array_fold) and Vera-specific concepts (handle[State<Int>], Exn<E>, decreases clauses). ## Initial results (single run, Claude Sonnet 4) | Metric | Aver | Vera | Python | TypeScript | |------------|------|------|--------|------------| | run_correct| 91% | 79% | 96% | 88% | Usage: vera-bench run --model claude-sonnet-4-20250514 --language aver \ --skill-md /path/to/llms.txt vera-bench baselines --language aver Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (25)

solutions/aver/VB_T4_001_fibonacci.av (1)

5-12: ⚠️ Potential issue | 🔴 Critical

Entry-point name does not match the problem specification.

Line [7] defines fib, but the problem entry point is fibonacci (problems/tier4/VB_T4_001_fibonacci.json, Line [17]). That mismatch can cause harness lookup/run failures for this canonical solution.

🔧 Proposed fix

 module Fibonacci
@@
-    exposes [fib]
+    exposes [fibonacci]
@@
-fn fib(n: Int) -> Int
+fn fibonacci(n: Int) -> Int
@@
-        _ -> fib(n - 1) + fib(n - 2)
+        _ -> fibonacci(n - 1) + fibonacci(n - 2)
@@
-verify fib
-    fib(0)  => 0
-    fib(1)  => 1
-    fib(5)  => 5
-    fib(10) => 55
+verify fibonacci
+    fibonacci(0)  => 0
+    fibonacci(1)  => 1
+    fibonacci(5)  => 5
+    fibonacci(10) => 55
@@
-    Console.print(fib(0))
-    Console.print(fib(1))
-    Console.print(fib(5))
-    Console.print(fib(10))
+    Console.print(fibonacci(0))
+    Console.print(fibonacci(1))
+    Console.print(fibonacci(5))
+    Console.print(fibonacci(10))

Also applies to: 14-18, 22-25

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_001_fibonacci.av` around lines 5 - 12, The entry-point
is named fib but the canonical problem expects fibonacci; update the exposed
list and the function name from fib to fibonacci (rename the function definition
fn fib(...) -> ... to fn fibonacci(...) -> ... and change exposes [fib] to
exposes [fibonacci]) and also update any other occurrences (the recursive calls
inside the function and other references shown around lines 14-18 and 22-25) to
call fibonacci so the harness can locate the correct entry-point.

solutions/aver/VB_T5_002_greeter_io_boundary.av (1)

8-8: ⚠️ Potential issue | 🔴 Critical

Use supported string composition for the greeting output.

Line [8] uses {name} interpolation, which appears non-standard for Aver and risks either literal output or a parse error. Prefer explicit concatenation for deterministic harness behaviour.

🔧 Proposed fix

-    Console.print("Hello, {name}!")
+    Console.print("Hello, " + name + "!")

What is the official Aver language syntax for inserting a variable into a string when printing (for example, producing "Hello, Alice!")? Is "{name}" interpolation supported, or should string concatenation with + be used?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_002_greeter_io_boundary.av` at line 8, The Console.print
call currently uses non-supported "{name}" interpolation which can produce
literal text or errors; update the Console.print invocation (Console.print) to
build the greeting via explicit concatenation using the variable name (e.g.,
join "Hello, ", name, and "!" with +) so the runtime receives a single correctly
composed string.

solutions/aver/VB_T2_010_sum_positives.av (1)

19-23: 🧹 Nitpick | 🔵 Trivial

Keep main examples aligned with the verify cases.

For harness consistency, add the zero-containing input from verify to main as well.

Suggested patch

 fn main() -> Unit
     ! [Console.print]
     Console.print(sum_positives([]))
     Console.print(sum_positives([1, -2, 3, -4, 5]))
     Console.print(sum_positives([-1, -2, -3]))
+    Console.print(sum_positives([0, 1, 2]))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T2_010_sum_positives.av` around lines 19 - 23, Main
currently prints three test calls to sum_positives but is missing the
zero-containing input used by verify; update the main function to include the
same zero-containing case that verify uses. Locate the main function and add a
Console.print(sum_positives(...)) call with the exact zero-containing list from
the verify cases so examples align with verification (keep using the
sum_positives symbol and the existing Console.print pattern).

solutions/aver/VB_T4_009_list_nth.av (1)

6-12: ⚠️ Potential issue | 🟡 Minor

Short-circuit negative indices before recursive descent.

At Line 12, negative indices still recurse through the full list before returning -1. Add an early guard to avoid needless depth and timeout risk on long inputs.

Suggested patch

 fn list_nth(xs: List<Int>, index: Int) -> Int
     ? "Returns the element at index, or -1 if out of bounds."
-    match xs
-        [] -> 0 - 1
-        [h, ..t] -> match index
-            0 -> h
-            _ -> list_nth(t, index - 1)
+    match index < 0
+        true -> 0 - 1
+        false -> match xs
+            [] -> 0 - 1
+            [h, ..t] -> match index
+                0 -> h
+                _ -> list_nth(t, index - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_009_list_nth.av` around lines 6 - 12, The function
list_nth currently recurses even for negative index values; add an early guard
in fn list_nth (before descending into the list) to check if index < 0 and
return -1 immediately. Concretely, update the top-level matching logic in
list_nth (or add a new branch before recursing inside [h, ..t] -> ...) to handle
index < 0 -> -1 so negative indices don't traverse the list.

solutions/aver/VB_T4_007_count_digits.av (1)

7-11: ⚠️ Potential issue | 🟡 Minor

Handle negative input explicitly in count_digits.

At Line 9, negatives currently return 1 immediately (e.g. count_digits(-42)), which silently violates the stated non-negative domain.

Suggested patch

 fn count_digits(n: Int) -> Int
     ? "Returns the number of digits in n. Handles 0 as having 1 digit."
-    match n < 10
-        true  -> 1
-        false -> 1 + count_digits(n / 10)
+    match n < 0
+        true  -> count_digits(0 - n)
+        false -> match n < 10
+            true  -> 1
+            false -> 1 + count_digits(n / 10)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_007_count_digits.av` around lines 7 - 11, The function
count_digits currently treats negative n as valid—add an explicit negative-input
guard at the start of count_digits that detects n < 0 and raises or returns an
error/exception (e.g., ArgumentError or similar) so the function enforces the
non-negative domain; keep the existing logic for n == 0 and positive values
(match n < 10 / recursive case) but ensure negative values are handled first
with a clear error and message indicating input must be non-negative.

solutions/aver/VB_T2_005_contains_substring.av (1)

5-7: ⚠️ Potential issue | 🔴 Critical

Verify String.contains exists in current Aver stdlib before merging.

At Line 7, this is a compile-time blocker if the API is not supported. Please confirm the exact built-in and signature for substring lookup.

Does the latest Aver language/runtime provide `String.contains(haystack: String, needle: String)`? If not, what is the canonical substring-check API and signature?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T2_005_contains_substring.av` around lines 5 - 7, The code
relies on String.contains(haystack, needle) which may not exist in the Aver
stdlib; verify the actual substring-check API and update the function
accordingly (either call the correct builtin or implement a fallback). Locate
the contains_substring function and replace the String.contains call with the
canonical API (e.g., string_instance.contains(needle) or
String.indexOf(haystack, needle) != -1) or implement a simple search using
String.indexOf/position API to return a Bool if no direct contains(haystack,
needle) exists.

solutions/aver/VB_T4_008_multiply.av (1)

6-10: ⚠️ Potential issue | 🟠 Major

Guard negative multipliers to prevent unbounded recursion.

At Line 10, multiply(a, b - 1) never reaches the 0 base case when b < 0, so this can recurse indefinitely.

Suggested patch

 fn multiply(a: Int, b: Int) -> Int
     ? "Returns a*b using recursive repeated addition."
-    match b
-        0 -> 0
-        _ -> a + multiply(a, b - 1)
+    match b < 0
+        true -> 0 - multiply(a, 0 - b)
+        false -> match b
+            0 -> 0
+            _ -> a + multiply(a, b - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_008_multiply.av` around lines 6 - 10, The multiply
function currently recurses only decrementing b, which never terminates for
negative b; update multiply(a, b) to handle b < 0 by adding a guard branch
(e.g., when b < 0) that converts the problem to a non-negative multiplier and
applies the correct sign (for example call multiply(a, -b) and negate the result
or otherwise flip sign), so that the recursive case only runs for b > 0 and the
base case b == 0 is always reachable.

solutions/aver/VB_T5_004_accumulator.av (1)

5-14: ⚠️ Potential issue | 🟠 Major

Tier-5 effect wiring is currently bypassed.

sum_with_state (Line [11]-Line [14]) is implemented as pure recursion via acc, so the solution does not validate State-handler integration, which is the core Tier-5 signal.

Based on learnings: Tier 5 tasks are intentionally meant to assess correct wiring of effect handlers (State, Exn, IO), not only final arithmetic output.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_004_accumulator.av` around lines 5 - 14, The
implementation of sum_with_state bypasses the Tier-5 State effect by using an
explicit accumulator recursion (sum_loop); replace the pure-recursive approach
with a State-handler-based implementation: remove or stop using sum_loop for
accumulation and instead implement sum_with_state to perform a loop (or
recursion) that updates and reads a State counter via the language's State
effect operations (get/put or modify) and return the final state, ensuring the
code exercises the State handler integration points; update or add any helper
function names used for stepping (e.g., a step function) but keep sum_with_state
as the entry that drives State updates so the State handler is invoked.

solutions/aver/VB_T5_001_counter.av (1)

7-18: ⚠️ Potential issue | 🟠 Major

Use an inclusive base case to avoid divergence for n > target.

At Line [7], equality-only termination means count(5, 2) never returns, creating a timeout risk in harness execution paths that call count directly.

Suggested patch

 fn count(n: Int, target: Int) -> Int
     ? "Increments n until it reaches target."
-    match n == target
+    match n >= target
         true  -> n
         false -> count(n + 1, target)

 verify count
     count(0, 3) => 3
     count(0, 0) => 0
     count(2, 5) => 5
+    count(5, 2) => 5

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_001_counter.av` around lines 7 - 18, The termination
check in function count currently only matches equality (match n == target)
which diverges for inputs where n > target (e.g., count(5,2)); change the base
condition in count to use an inclusive comparison (n >= target) so the matcher
returns n for any n at or beyond target and the recursive branch only runs when
n < target; update the match in count (the same function referenced by
count_three and the verify cases) accordingly.

solutions/aver/VB_T4_003_even_odd_mutual_recursion.av (1)

7-17: ⚠️ Potential issue | 🔴 Critical

Handle negative inputs to prevent non-terminating mutual recursion.

At Line [11] and Line [17], n < 0 never converges to the 0 base case, so execution can hang and hit harness timeouts.

Suggested patch

 fn is_even(n: Int) -> Bool
     ? "Returns true if n is even, using mutual recursion with is_odd."
-    match n
-        0 -> true
-        _ -> is_odd(n - 1)
+    match n < 0
+        true -> is_even(0 - n)
+        false -> match n
+            0 -> true
+            _ -> is_odd(n - 1)

 fn is_odd(n: Int) -> Bool
     ? "Returns true if n is odd, using mutual recursion with is_even."
-    match n
-        0 -> false
-        _ -> is_even(n - 1)
+    match n < 0
+        true -> is_odd(0 - n)
+        false -> match n
+            0 -> false
+            _ -> is_even(n - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_003_even_odd_mutual_recursion.av` around lines 7 - 17,
The mutual recursion in is_even and is_odd doesn't handle negative n and can
loop forever; update both functions (is_even and is_odd) to detect negative
inputs and normalize them (e.g., call the corresponding function with -n or use
abs(n)) before proceeding to the existing match so the recursion always reaches
the 0 base case for negative numbers as well.

problems/tier5/VB_T5_005_checked_index.json (1)

6-6: ⚠️ Potential issue | 🟡 Minor

Use array terminology to match the actual interface.

Line 6 says “list element”, but the task contract is array-based (@Array<Int>). Please switch to “array element” to avoid prompt drift.
Suggested patch
-  "description_neutral": "Safely access a list element by index. Return -1 if the index is out of bounds.",
+  "description_neutral": "Safely access an array element by index. Return -1 if the index is out of bounds.",
As per coding guidelines, problems/**/*.json should be reviewed for consistency in problem definitions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@problems/tier5/VB_T5_005_checked_index.json` at line 6, Update the
"description_neutral" field in the JSON (the string currently containing "Safely
access a list element by index. Return -1 if the index is out of bounds.") to
use "array element" wording to match the task contract's `@Array`<Int> interface;
specifically replace "list element" with "array element" in the
description_neutral value so the problem text consistently references arrays.

problems/tier3/VB_T3_008_tree_count_leaves.json (1)

6-6: ⚠️ Potential issue | 🟡 Minor

Include the lower-bound guarantee in the neutral description.

Line 6 currently drops the result >= 1 semantic that is present in contracts.ensures, so prompts for non-Vera languages lose part of the problem contract.
Suggested patch
-  "description_neutral": "Define a binary tree type with leaf and branch constructors. Write a function that counts the number of leaf nodes.",
+  "description_neutral": "Define a binary tree type with leaf and branch constructors. Write a function that counts the number of leaf nodes; the result is at least 1.",
As per coding guidelines: “Problem definitions. Review for schema completeness, consistency between contracts in JSON and the canonical solution, correct test case expected outputs, and appropriate tier assignment.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@problems/tier3/VB_T3_008_tree_count_leaves.json` at line 6, The neutral
description dropped the lower-bound guarantee; update the "description_neutral"
field to reintroduce the semantic (result >= 1) present in contracts.ensures so
non-Vera prompts keep the same contract; locate the "description_neutral" string
in the JSON and append or rephrase it to state that the function returns at
least one leaf (e.g., "counts the number of leaf nodes (result >= 1)"),
mirroring the guarantee in contracts.ensures.

problems/tier5/VB_T5_010_safe_head.json (1)

6-7: ⚠️ Potential issue | 🟡 Minor

Use “array” in description_neutral to match the function signature.

description_neutral currently says “list”, but the entry-point contract is explicitly array-based. This can skew generated prompts for non-Vera targets.

Suggested fix

-  "description_neutral": "Return the first element of a list, or -1 if the list is empty.",
+  "description_neutral": "Return the first element of an array, or -1 if the array is empty.",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@problems/tier5/VB_T5_010_safe_head.json` around lines 6 - 7, Update the
description_neutral field to use the word "array" instead of "list" to match the
entry-point signature for safe_head; locate the JSON object containing
"signature": "public fn safe_head(`@Array`<Int> -> `@Int`)" and replace the phrase
"list" in the "description_neutral" value so it reads something like "Return the
first element of an array, or -1 if the array is empty."

problems/tier2/VB_T2_003_greeting.json (1)

6-6: ⚠️ Potential issue | 🟡 Minor

Preserve the non-empty input precondition in the neutral description.

The neutral text drops a key constraint from the original task (“name is non-empty”), which can change model behaviour in language-neutral prompting.

Suggested fix

-  "description_neutral": "Build a greeting string. Given a name, return 'Hello, <name>!'.",
+  "description_neutral": "Build a greeting string. Given a non-empty name, return 'Hello, <name>!'.",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@problems/tier2/VB_T2_003_greeting.json` at line 6, The neutral description
removed the precondition that the input name must be non-empty; update the
"description_neutral" field to preserve that constraint (e.g., change "Build a
greeting string. Given a name, return 'Hello, <name>!'." to include "name is
non-empty" so it matches the original intent), referencing the JSON key
"description_neutral" so reviewers can locate and modify that value accordingly.

solutions/aver/VB_T4_005_sum_to_n.av (1)

6-10: ⚠️ Potential issue | 🔴 Critical

Prevent unbounded recursion for negative inputs.

Line 10 always recurses for non-zero n; with negative values, the function moves away from 0 and never terminates. Please constrain the input domain or add a non-positive base case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_005_sum_to_n.av` around lines 6 - 10, The function
sum_to_n currently recurses for any non-zero n and will not terminate for
negative inputs; update sum_to_n to prevent unbounded recursion by either
constraining the domain or adding an explicit non-positive/base case: modify the
match in sum_to_n to handle negative values (e.g., add a branch like n <= 0 -> 0
or n < 0 -> error/raise) or change the signature to accept only natural numbers
and validate input before recursion so sum_to_n(n - 1) cannot diverge for
negative n.

solutions/aver/VB_T5_007_exn_negate.av (1)

1-10: ⚠️ Potential issue | 🔴 Critical

This bypasses the Tier 5 exception-handler requirement.

Line 2 and Line 7-9 implement clamping directly, but this Tier 5 task is meant to model exception throw/handle flow. Please implement the required Exn-based path (require_non_negative + handler-backed safe_non_negative) rather than replacing it with plain pattern matching.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_007_exn_negate.av` around lines 1 - 10, The current
implementation in module ExnNegate uses direct pattern matching in
safe_non_negative; instead implement the Tier-5 exception flow by adding a
function require_non_negative(x: Int) that raises/throws an exception when x < 0
and returns x otherwise, then rewrite safe_non_negative to call
require_non_negative inside an exception handler/catch block so that the handler
intercepts the thrown exception and returns 0; remove the direct match-based
clamping and ensure symbols require_non_negative and safe_non_negative (and the
module ExnNegate) are the locations of the change.

solutions/aver/VB_T1_008_distance.av (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Add explicit export for distance.

The module header does not export distance; add exposes [distance] for consistency and predictable entry-point visibility.

In Aver, if a module defines `fn distance(...)` without `exposes [distance]`, is the function private by default, and is `exposes` required for external entry-point access?

Proposed fix

 module Distance
     intent = "Computes the absolute distance between two integers."
+    exposes [distance]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T1_008_distance.av` around lines 1 - 3, The module
'Distance' defines fn distance but does not export it; add an explicit export by
changing the module header to include exposes [distance] so the function becomes
a public entry point. Locate the module declaration (module Distance) and add
exposes [distance] to the header to ensure the symbol distance is visible
externally.

solutions/aver/VB_T4_004_power.av (1)

6-10: ⚠️ Potential issue | 🔴 Critical

Guard negative exponents before recursion.

Line 10 keeps recursing for exp < 0, so this path never reaches the 0 base case and can time out in the benchmark runner.

Proposed fix

 fn power(base: Int, exp: Int) -> Int
     ? "Returns base^exp using recursive repeated multiplication."
-    match exp
-        0 -> 1
-        _ -> base * power(base, exp - 1)
+    match exp < 0
+        true  -> 0
+        false ->
+            match exp
+                0 -> 1
+                _ -> base * power(base, exp - 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_004_power.av` around lines 6 - 10, The power function
can recurse infinitely for exp < 0; update fn power(base: Int, exp: Int) -> Int
to guard negative exponents before recursion (e.g., add a match arm or an
initial if-check for exp < 0 that raises an error or returns a defined value) so
that power(base, exp - 1) is only called for exp > 0 and the 0 -> 1 base case is
reachable; reference the power function and the match on exp when adding the new
negative-exponent branch.

solutions/aver/VB_T1_007_safe_modulo.av (1)

4-6: ⚠️ Potential issue | 🟠 Major

Guard b == 0 before division.

Line 6 performs division with no zero check, which can hard-fail execution instead of returning a deterministic result.

Proposed fix

 fn safe_modulo(a: Int, b: Int) -> Int
-    ? "Returns a modulo b. Assumes b != 0."
-    a - (a / b) * b
+    ? "Returns a modulo b; returns a when b is 0."
+    match b == 0
+        true  -> a
+        false -> a - (a / b) * b

#!/bin/bash
set -euo pipefail

spec_file="$(fd -i 'VB_T1_007_safe_modulo.json' problems | head -n1)"
jq '{id, description, description_neutral, contracts, test_cases, notes}' "$spec_file"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T1_007_safe_modulo.av` around lines 4 - 6, The function
safe_modulo performs a division by b without checking for zero; update the
function safe_modulo(a: Int, b: Int) to guard b == 0 before performing (a / b)
and (a - (a / b) * b) — if b == 0 return a deterministic value (e.g., 0) or
another project-standard sentinel, otherwise compute and return the existing
expression; ensure the check is the first operation so no division by zero can
occur.

solutions/aver/VB_T5_008_print_numbers.av (1)

14-16: ⚠️ Potential issue | 🟠 Major

Remove the terminal blank print in print_loop.

false -> Console.print("") emits an extra stdout line when recursion ends, which can break harness output matching and produces incorrect behaviour for n < 1 (should print nothing).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_008_print_numbers.av` around lines 14 - 16, The false
branch in print_loop currently prints an empty line causing extra stdout; change
the match in print_loop so that the false case does not call Console.print("") —
instead return/stop without emitting output (e.g., no-op or simply return) so
recursion termination produces no output; update the false -> branch that
references print_and_continue(i, n) / Console.print("") accordingly.

solutions/aver/VB_T4_010_div_natural.av (1)

6-10: ⚠️ Potential issue | 🔴 Critical

Add a non-positive divisor guard to prevent non-terminating recursion.

At Line 8-10, div_natural can recurse indefinitely when b <= 0 (e.g. b == 0 never changes a). This is a runtime blocker for the runner pipeline.

Proposed fix

 fn div_natural(a: Int, b: Int) -> Int
     ? "Returns a divided by b using repeated subtraction."
-    match a < b
-        true  -> 0
-        false -> 1 + div_natural(a - b, b)
+    match b <= 0
+        true  -> 0
+        false ->
+            match a < b
+                true  -> 0
+                false -> 1 + div_natural(a - b, b)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T4_010_div_natural.av` around lines 6 - 10, The function
div_natural(a, b) can recurse forever when b <= 0; add an explicit guard at the
start of div_natural to handle non-positive divisors (b <= 0) before the
existing match: return or raise a clear error/exception (or otherwise handle the
invalid input) when b is non-positive, otherwise continue with the current logic
(match a < b ... 1 + div_natural(a - b, b)).

vera_bench/runner.py (2)

566-580: ⚠️ Potential issue | 🟠 Major

Escape string arguments before generating Aver literals.

Raw interpolation at Line 577 will break wrapper syntax for quotes, backslashes, or newlines, so valid string test cases can fail at aver check.

Suggested hardening

     if isinstance(value, float):
         return str(value)
     if isinstance(value, str):
-        return f'"{value}"'
+        return json.dumps(value)
     if isinstance(value, list):
         items = ", ".join(_aver_literal(v) for v in value)
         return f"[{items}]"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 566 - 580, The _aver_literal function
currently injects raw strings with f'"{value}"' which breaks Aver wrapper syntax
for quotes, backslashes and newlines; replace that branch to emit a properly
escaped double-quoted string by serializing/escaping the Python string (e.g. use
json.dumps(value) or an equivalent escaping routine) so special characters are
escaped consistently; update the string case in _aver_literal to call
json.dumps(value) (and add the import if needed).

674-718: ⚠️ Potential issue | 🟠 Major

Do not spend attempt 2 on host/tool failures.

This retry still fires for non-fixable cases such as "aver not found on PATH" or "aver check timed out", which wastes a model call and records a misleading second attempt.

Suggested guard

-    if language == "aver" and not eval_result["check_pass"] and max_fix_attempts > 0:
+    retryable_aver_failure = (
+        not eval_result["check_pass"]
+        and eval_result.get("error_message")
+        not in {"aver not found on PATH", "aver check timed out"}
+    )
+    if language == "aver" and retryable_aver_failure and max_fix_attempts > 0:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 674 - 718, The retry for "aver" should not
run when the failure is a host/tool issue; before calling build_aver_fix_prompt
and client.complete, inspect eval_result.get("error_message", "") for tool-host
failure indicators (e.g., "not found", "not on PATH", "timed out", "No such file
or directory", "failed to start") and if any match, skip the attempt2 flow (do
not call client.complete or _evaluate_aver_code) and immediately append/return a
ProblemResult marking check_pass=False and include the original error_message
and timestamps (use the same fields used elsewhere: ProblemResult with
problem_id, model=llm_response.model, language, attempt=2, check_pass=False,
error_message=err, timestamp=_now(), bench_version, vera_version) so we don't
spend a model call on non-fixable host/tool failures.

solutions/aver/VB_T5_009_state_max.av (2)

33-36: ⚠️ Potential issue | 🟡 Minor

Exercise the n < 1 boundary in state_max.

The implementation has a dedicated non-positive branch, but the verify block still only covers positive inputs.

Suggested verify additions

 verify state_max
     state_max(5)  => 5
     state_max(1)  => 1
     state_max(10) => 10
+    state_max(0)  => 0
+    state_max(-1) => 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_009_state_max.av` around lines 33 - 36, The verify block
for state_max only tests positive inputs but the implementation has a dedicated
non-positive branch; update the verify section for state_max to include boundary
cases where n < 1 (e.g., add tests such as state_max(0) and at least one
negative input like state_max(-3)) so the non-positive branch is exercised and
validated; keep the existing positive tests and place the new cases alongside
them in the same verify block referencing the state_max symbol.

24-27: ⚠️ Potential issue | 🟡 Minor

Cover the i <= current_max branch in max_step.

The current verify cases only prove the update path. Add one carry-forward case so the false branch is locked in as well.

Suggested verify addition

 verify max_step
     max_step(3, 5, 2) => 5
     max_step(1, 1, 0) => 1
+    max_step(2, 5, 3) => 3

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@solutions/aver/VB_T5_009_state_max.av` around lines 24 - 27, Add a verify
case that exercises the carry-forward (false) branch of max_step so the i <=
current_max path is covered; specifically, in the verify block for max_step add
a test where i is less than or equal to current_max (for example max_step(2, 5,
3) => 5 or any equivalent where i <= current_max) to lock in the non-update
branch and complement the existing update-path examples.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@problems/tier2/VB_T2_006_join_strings.json`:
- Line 6: Update the "description_neutral" field to use the phrase "array of
strings" to match the `@Array`<String> interface; specifically, replace "Join a
list of strings with a separator string." with wording like "Join an array of
strings with a separator string." so the description_neutral value aligns with
the array-based signature.

In `@problems/tier2/VB_T2_008_count_positives.json`:
- Line 6: The phrase "integer list" in the description_neutral field is
ambiguous and may steer generation toward an ADT List; update the value of
"description_neutral" to use "integer array" (or "array of integers") to match
the required signature `@Array`<Int> so the benchmark clearly expects an array
type.

In `@vera_bench/cli.py`:
- Around line 172-186: The try block invoking _sp.run(["aver", "--version"],
...) currently only catches missing binary/timeouts and then uses
_av_proc.stdout to set aver_ver; if the aver process returns a non-zero exit
code (e.g., broken install) aver_ver can be empty and execution continues. After
running _sp.run (symbol: _av_proc), check _av_proc.returncode for a non-zero
value and treat it like an error: print a descriptive message via console.print
(include _av_proc.stderr or _av_proc.stdout for details) and raise
SystemExit(1); ensure aver_ver is only set from _av_proc.stdout when returncode
== 0.

---

Duplicate comments:
In `@problems/tier2/VB_T2_003_greeting.json`:
- Line 6: The neutral description removed the precondition that the input name
must be non-empty; update the "description_neutral" field to preserve that
constraint (e.g., change "Build a greeting string. Given a name, return 'Hello,
<name>!'." to include "name is non-empty" so it matches the original intent),
referencing the JSON key "description_neutral" so reviewers can locate and
modify that value accordingly.

In `@problems/tier3/VB_T3_008_tree_count_leaves.json`:
- Line 6: The neutral description dropped the lower-bound guarantee; update the
"description_neutral" field to reintroduce the semantic (result >= 1) present in
contracts.ensures so non-Vera prompts keep the same contract; locate the
"description_neutral" string in the JSON and append or rephrase it to state that
the function returns at least one leaf (e.g., "counts the number of leaf nodes
(result >= 1)"), mirroring the guarantee in contracts.ensures.

In `@problems/tier5/VB_T5_005_checked_index.json`:
- Line 6: Update the "description_neutral" field in the JSON (the string
currently containing "Safely access a list element by index. Return -1 if the
index is out of bounds.") to use "array element" wording to match the task
contract's `@Array`<Int> interface; specifically replace "list element" with
"array element" in the description_neutral value so the problem text
consistently references arrays.

In `@problems/tier5/VB_T5_010_safe_head.json`:
- Around line 6-7: Update the description_neutral field to use the word "array"
instead of "list" to match the entry-point signature for safe_head; locate the
JSON object containing "signature": "public fn safe_head(`@Array`<Int> -> `@Int`)"
and replace the phrase "list" in the "description_neutral" value so it reads
something like "Return the first element of an array, or -1 if the array is
empty."

In `@solutions/aver/VB_T1_007_safe_modulo.av`:
- Around line 4-6: The function safe_modulo performs a division by b without
checking for zero; update the function safe_modulo(a: Int, b: Int) to guard b ==
0 before performing (a / b) and (a - (a / b) * b) — if b == 0 return a
deterministic value (e.g., 0) or another project-standard sentinel, otherwise
compute and return the existing expression; ensure the check is the first
operation so no division by zero can occur.

In `@solutions/aver/VB_T1_008_distance.av`:
- Around line 1-3: The module 'Distance' defines fn distance but does not export
it; add an explicit export by changing the module header to include exposes
[distance] so the function becomes a public entry point. Locate the module
declaration (module Distance) and add exposes [distance] to the header to ensure
the symbol distance is visible externally.

In `@solutions/aver/VB_T2_005_contains_substring.av`:
- Around line 5-7: The code relies on String.contains(haystack, needle) which
may not exist in the Aver stdlib; verify the actual substring-check API and
update the function accordingly (either call the correct builtin or implement a
fallback). Locate the contains_substring function and replace the
String.contains call with the canonical API (e.g.,
string_instance.contains(needle) or String.indexOf(haystack, needle) != -1) or
implement a simple search using String.indexOf/position API to return a Bool if
no direct contains(haystack, needle) exists.

In `@solutions/aver/VB_T2_010_sum_positives.av`:
- Around line 19-23: Main currently prints three test calls to sum_positives but
is missing the zero-containing input used by verify; update the main function to
include the same zero-containing case that verify uses. Locate the main function
and add a Console.print(sum_positives(...)) call with the exact zero-containing
list from the verify cases so examples align with verification (keep using the
sum_positives symbol and the existing Console.print pattern).

In `@solutions/aver/VB_T4_001_fibonacci.av`:
- Around line 5-12: The entry-point is named fib but the canonical problem
expects fibonacci; update the exposed list and the function name from fib to
fibonacci (rename the function definition fn fib(...) -> ... to fn
fibonacci(...) -> ... and change exposes [fib] to exposes [fibonacci]) and also
update any other occurrences (the recursive calls inside the function and other
references shown around lines 14-18 and 22-25) to call fibonacci so the harness
can locate the correct entry-point.

In `@solutions/aver/VB_T4_003_even_odd_mutual_recursion.av`:
- Around line 7-17: The mutual recursion in is_even and is_odd doesn't handle
negative n and can loop forever; update both functions (is_even and is_odd) to
detect negative inputs and normalize them (e.g., call the corresponding function
with -n or use abs(n)) before proceeding to the existing match so the recursion
always reaches the 0 base case for negative numbers as well.

In `@solutions/aver/VB_T4_004_power.av`:
- Around line 6-10: The power function can recurse infinitely for exp < 0;
update fn power(base: Int, exp: Int) -> Int to guard negative exponents before
recursion (e.g., add a match arm or an initial if-check for exp < 0 that raises
an error or returns a defined value) so that power(base, exp - 1) is only called
for exp > 0 and the 0 -> 1 base case is reachable; reference the power function
and the match on exp when adding the new negative-exponent branch.

In `@solutions/aver/VB_T4_005_sum_to_n.av`:
- Around line 6-10: The function sum_to_n currently recurses for any non-zero n
and will not terminate for negative inputs; update sum_to_n to prevent unbounded
recursion by either constraining the domain or adding an explicit
non-positive/base case: modify the match in sum_to_n to handle negative values
(e.g., add a branch like n <= 0 -> 0 or n < 0 -> error/raise) or change the
signature to accept only natural numbers and validate input before recursion so
sum_to_n(n - 1) cannot diverge for negative n.

In `@solutions/aver/VB_T4_007_count_digits.av`:
- Around line 7-11: The function count_digits currently treats negative n as
valid—add an explicit negative-input guard at the start of count_digits that
detects n < 0 and raises or returns an error/exception (e.g., ArgumentError or
similar) so the function enforces the non-negative domain; keep the existing
logic for n == 0 and positive values (match n < 10 / recursive case) but ensure
negative values are handled first with a clear error and message indicating
input must be non-negative.

In `@solutions/aver/VB_T4_008_multiply.av`:
- Around line 6-10: The multiply function currently recurses only decrementing
b, which never terminates for negative b; update multiply(a, b) to handle b < 0
by adding a guard branch (e.g., when b < 0) that converts the problem to a
non-negative multiplier and applies the correct sign (for example call
multiply(a, -b) and negate the result or otherwise flip sign), so that the
recursive case only runs for b > 0 and the base case b == 0 is always reachable.

In `@solutions/aver/VB_T4_009_list_nth.av`:
- Around line 6-12: The function list_nth currently recurses even for negative
index values; add an early guard in fn list_nth (before descending into the
list) to check if index < 0 and return -1 immediately. Concretely, update the
top-level matching logic in list_nth (or add a new branch before recursing
inside [h, ..t] -> ...) to handle index < 0 -> -1 so negative indices don't
traverse the list.

In `@solutions/aver/VB_T4_010_div_natural.av`:
- Around line 6-10: The function div_natural(a, b) can recurse forever when b <=
0; add an explicit guard at the start of div_natural to handle non-positive
divisors (b <= 0) before the existing match: return or raise a clear
error/exception (or otherwise handle the invalid input) when b is non-positive,
otherwise continue with the current logic (match a < b ... 1 + div_natural(a -
b, b)).

In `@solutions/aver/VB_T5_001_counter.av`:
- Around line 7-18: The termination check in function count currently only
matches equality (match n == target) which diverges for inputs where n > target
(e.g., count(5,2)); change the base condition in count to use an inclusive
comparison (n >= target) so the matcher returns n for any n at or beyond target
and the recursive branch only runs when n < target; update the match in count
(the same function referenced by count_three and the verify cases) accordingly.

In `@solutions/aver/VB_T5_002_greeter_io_boundary.av`:
- Line 8: The Console.print call currently uses non-supported "{name}"
interpolation which can produce literal text or errors; update the Console.print
invocation (Console.print) to build the greeting via explicit concatenation
using the variable name (e.g., join "Hello, ", name, and "!" with +) so the
runtime receives a single correctly composed string.

In `@solutions/aver/VB_T5_004_accumulator.av`:
- Around line 5-14: The implementation of sum_with_state bypasses the Tier-5
State effect by using an explicit accumulator recursion (sum_loop); replace the
pure-recursive approach with a State-handler-based implementation: remove or
stop using sum_loop for accumulation and instead implement sum_with_state to
perform a loop (or recursion) that updates and reads a State counter via the
language's State effect operations (get/put or modify) and return the final
state, ensuring the code exercises the State handler integration points; update
or add any helper function names used for stepping (e.g., a step function) but
keep sum_with_state as the entry that drives State updates so the State handler
is invoked.

In `@solutions/aver/VB_T5_007_exn_negate.av`:
- Around line 1-10: The current implementation in module ExnNegate uses direct
pattern matching in safe_non_negative; instead implement the Tier-5 exception
flow by adding a function require_non_negative(x: Int) that raises/throws an
exception when x < 0 and returns x otherwise, then rewrite safe_non_negative to
call require_non_negative inside an exception handler/catch block so that the
handler intercepts the thrown exception and returns 0; remove the direct
match-based clamping and ensure symbols require_non_negative and
safe_non_negative (and the module ExnNegate) are the locations of the change.

In `@solutions/aver/VB_T5_008_print_numbers.av`:
- Around line 14-16: The false branch in print_loop currently prints an empty
line causing extra stdout; change the match in print_loop so that the false case
does not call Console.print("") — instead return/stop without emitting output
(e.g., no-op or simply return) so recursion termination produces no output;
update the false -> branch that references print_and_continue(i, n) /
Console.print("") accordingly.

In `@solutions/aver/VB_T5_009_state_max.av`:
- Around line 33-36: The verify block for state_max only tests positive inputs
but the implementation has a dedicated non-positive branch; update the verify
section for state_max to include boundary cases where n < 1 (e.g., add tests
such as state_max(0) and at least one negative input like state_max(-3)) so the
non-positive branch is exercised and validated; keep the existing positive tests
and place the new cases alongside them in the same verify block referencing the
state_max symbol.
- Around line 24-27: Add a verify case that exercises the carry-forward (false)
branch of max_step so the i <= current_max path is covered; specifically, in the
verify block for max_step add a test where i is less than or equal to
current_max (for example max_step(2, 5, 3) => 5 or any equivalent where i <=
current_max) to lock in the non-update branch and complement the existing
update-path examples.

In `@vera_bench/runner.py`:
- Around line 566-580: The _aver_literal function currently injects raw strings
with f'"{value}"' which breaks Aver wrapper syntax for quotes, backslashes and
newlines; replace that branch to emit a properly escaped double-quoted string by
serializing/escaping the Python string (e.g. use json.dumps(value) or an
equivalent escaping routine) so special characters are escaped consistently;
update the string case in _aver_literal to call json.dumps(value) (and add the
import if needed).
- Around line 674-718: The retry for "aver" should not run when the failure is a
host/tool issue; before calling build_aver_fix_prompt and client.complete,
inspect eval_result.get("error_message", "") for tool-host failure indicators
(e.g., "not found", "not on PATH", "timed out", "No such file or directory",
"failed to start") and if any match, skip the attempt2 flow (do not call
client.complete or _evaluate_aver_code) and immediately append/return a
ProblemResult marking check_pass=False and include the original error_message
and timestamps (use the same fields used elsewhere: ProblemResult with
problem_id, model=llm_response.model, language, attempt=2, check_pass=False,
error_message=err, timestamp=_now(), bench_version, vera_version) so we don't
spend a model call on non-fixable host/tool failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6173d5a9-33be-4345-9e0e-e2918c0c1e68

📥 Commits

Reviewing files that changed from the base of the PR and between 8521dcd and 69a6fa7.

📒 Files selected for processing (104)

problems/tier1/VB_T1_001_absolute_value.json
problems/tier1/VB_T1_002_clamp.json
problems/tier1/VB_T1_003_signum.json
problems/tier1/VB_T1_004_max_of_two.json
problems/tier1/VB_T1_005_min_of_two.json
problems/tier1/VB_T1_006_is_positive.json
problems/tier1/VB_T1_007_safe_modulo.json
problems/tier1/VB_T1_008_distance.json
problems/tier1/VB_T1_009_max_of_three.json
problems/tier1/VB_T1_010_double_or_nothing.json
problems/tier2/VB_T2_001_sum_array.json
problems/tier2/VB_T2_002_filter_positives.json
problems/tier2/VB_T2_003_greeting.json
problems/tier2/VB_T2_004_is_empty_string.json
problems/tier2/VB_T2_005_contains_substring.json
problems/tier2/VB_T2_006_join_strings.json
problems/tier2/VB_T2_007_double_elements.json
problems/tier2/VB_T2_008_count_positives.json
problems/tier2/VB_T2_009_to_upper.json
problems/tier2/VB_T2_010_sum_positives.json
problems/tier3/VB_T3_001_list_length.json
problems/tier3/VB_T3_002_tree_depth.json
problems/tier3/VB_T3_003_expression_evaluator.json
problems/tier3/VB_T3_004_list_sum.json
problems/tier3/VB_T3_005_tree_sum.json
problems/tier3/VB_T3_006_option_unwrap_or.json
problems/tier3/VB_T3_007_list_contains.json
problems/tier3/VB_T3_008_tree_count_leaves.json
problems/tier3/VB_T3_009_list_append.json
problems/tier3/VB_T3_010_list_last.json
problems/tier4/VB_T4_001_fibonacci.json
problems/tier4/VB_T4_002_greatest_common_divisor.json
problems/tier4/VB_T4_003_even_odd_mutual_recursion.json
problems/tier4/VB_T4_004_power.json
problems/tier4/VB_T4_005_sum_to_n.json
problems/tier4/VB_T4_006_list_reverse.json
problems/tier4/VB_T4_007_count_digits.json
problems/tier4/VB_T4_008_multiply.json
problems/tier4/VB_T4_009_list_nth.json
problems/tier4/VB_T4_010_div_natural.json
problems/tier5/VB_T5_001_counter.json
problems/tier5/VB_T5_002_greeter_io_boundary.json
problems/tier5/VB_T5_003_safe_division_exceptions.json
problems/tier5/VB_T5_004_accumulator.json
problems/tier5/VB_T5_005_checked_index.json
problems/tier5/VB_T5_006_state_double.json
problems/tier5/VB_T5_007_exn_negate.json
problems/tier5/VB_T5_008_print_numbers.json
problems/tier5/VB_T5_009_state_max.json
problems/tier5/VB_T5_010_safe_head.json
solutions/aver/VB_T1_001_absolute_value.av
solutions/aver/VB_T1_002_clamp.av
solutions/aver/VB_T1_003_signum.av
solutions/aver/VB_T1_004_max_of_two.av
solutions/aver/VB_T1_005_min_of_two.av
solutions/aver/VB_T1_006_is_positive.av
solutions/aver/VB_T1_007_safe_modulo.av
solutions/aver/VB_T1_008_distance.av
solutions/aver/VB_T1_009_max_of_three.av
solutions/aver/VB_T1_010_double_or_nothing.av
solutions/aver/VB_T2_001_sum_array.av
solutions/aver/VB_T2_002_filter_positives.av
solutions/aver/VB_T2_003_greeting.av
solutions/aver/VB_T2_004_is_empty_string.av
solutions/aver/VB_T2_005_contains_substring.av
solutions/aver/VB_T2_006_join_strings.av
solutions/aver/VB_T2_007_double_elements.av
solutions/aver/VB_T2_008_count_positives.av
solutions/aver/VB_T2_009_to_upper.av
solutions/aver/VB_T2_010_sum_positives.av
solutions/aver/VB_T3_001_list_length.av
solutions/aver/VB_T3_002_tree_depth.av
solutions/aver/VB_T3_003_expression_evaluator.av
solutions/aver/VB_T3_004_list_sum.av
solutions/aver/VB_T3_005_tree_sum.av
solutions/aver/VB_T3_006_option_unwrap_or.av
solutions/aver/VB_T3_007_list_contains.av
solutions/aver/VB_T3_008_tree_count_leaves.av
solutions/aver/VB_T3_009_list_append.av
solutions/aver/VB_T3_010_list_last.av
solutions/aver/VB_T4_001_fibonacci.av
solutions/aver/VB_T4_002_greatest_common_divisor.av
solutions/aver/VB_T4_003_even_odd_mutual_recursion.av
solutions/aver/VB_T4_004_power.av
solutions/aver/VB_T4_005_sum_to_n.av
solutions/aver/VB_T4_006_list_reverse.av
solutions/aver/VB_T4_007_count_digits.av
solutions/aver/VB_T4_008_multiply.av
solutions/aver/VB_T4_009_list_nth.av
solutions/aver/VB_T4_010_div_natural.av
solutions/aver/VB_T5_001_counter.av
solutions/aver/VB_T5_002_greeter_io_boundary.av
solutions/aver/VB_T5_003_safe_division_exceptions.av
solutions/aver/VB_T5_004_accumulator.av
solutions/aver/VB_T5_005_checked_index.av
solutions/aver/VB_T5_006_state_double.av
solutions/aver/VB_T5_007_exn_negate.av
solutions/aver/VB_T5_008_print_numbers.av
solutions/aver/VB_T5_009_state_max.av
solutions/aver/VB_T5_010_safe_head.av
vera_bench/baseline_runner.py
vera_bench/cli.py
vera_bench/prompts.py
vera_bench/runner.py

jasisz · 2026-04-12T06:22:16Z

Initial benchmark results

Ran on two Anthropic models with description_neutral prompts. Aver spec provided via --skill-md pointing to llms.txt (376 lines). Single run, non-deterministic.

Aver results

Model	check@1	check best-of-2	fix@1	run_correct
Claude Haiku 4.5	80% (40/50)	90% (45/50)	50% (5/10)	95% (19/20)
Claude Sonnet 4	44% (22/50)	68% (34/50)	42% (12/28)	93% (14/15)

Cross-language comparison (same models, `description_neutral`)

Model	Aver	Python
Claude Haiku 4.5	95% run_correct	83% (20/24)
Claude Sonnet 4	93% run_correct	100% (24/24)

Haiku writes better Aver than Python on these problems. Sonnet is the opposite — perfect on Python, slightly lower on Aver.

Interesting finding: Haiku outperforms Sonnet on Aver across both check and run_correct. Hypothesis: Sonnet over-engineers solutions (adding verify laws, guard clauses, helper abstractions) which fail the parser, while Haiku follows the llms.txt spec more literally.

TypeScript results excluded — tsx not available on the test machine, so run_correct was 0% (infra issue, not model issue).

These are single-run results. Stable rates will require pass@k evaluation.

aallan

Review: Add Aver language support + language-neutral problem descriptions

Welcome to the project, @jasisz — this is a really impressive first contribution. The architecture is clean, the pattern consistency with our existing Python/TypeScript integration is excellent, and the description_neutral addition is independently valuable. Having Aver in VeraBench is going to make the Vera-vs-Aver comparison (named variables vs De Bruijn indices, verify blocks vs SMT contracts) one of the most interesting data points in the benchmark.

A few things to address before we can merge. Nothing here is a show-stopper — most of this is polish and housekeeping.

1. Add `.coderabbit.yaml` exclusions for Aver solutions

CodeRabbit doesn't understand Aver and has generated 17 false-positive comments on the .av solution files (I've replied to all of them inline). Please add these two lines to path_filters in .coderabbit.yaml:

- "!**/*.av"
- "!solutions/aver/**"

This matches our existing exclusions for Vera (!**/*.vera), Python (!solutions/python/**), and TypeScript (!solutions/typescript/**).

2. Tests

The PR adds substantial new code to baseline_runner.py, runner.py, prompts.py, and cli.py but no new test cases. We're currently at 85% coverage and have a CI gate at 80%. While the baselines were manually verified (50/50 — great work), we'll need at minimum:

Unit test for _aver_literal() — especially the negative integer (0 - N) conversion and string values
Unit test for _strip_aver_main()
Unit test for _neutral_description() fallback behaviour
Unit test for load_aver_llms_txt() (can mirror the existing load_skill_md tests)

These don't need to be exhaustive — a handful of cases each is fine.

3. Harness issues (valid CodeRabbit findings)

Three CodeRabbit comments on the harness code are legitimate and should be addressed:

a) Fail fast when aver is not on PATH (cli.py:177)
The version detection in cli.py catches FileNotFoundError and raises SystemExit(1) — good. But the baseline runner path should do the same rather than silently failing per-problem.

b) Don't retry on tooling failures (runner.py:753)
The fix-from-error retry currently fires even when the failure was "aver not found" or a timeout, not a compiler diagnostic. This wastes an API call and records a misleading attempt 2. Please gate the retry on actual aver check failures only.

c) Empty test_cases baseline validation (baseline_runner.py:386)
Problems with no test_cases currently report as passing without any validation. At minimum, aver check should still run on the canonical solution for these problems. This matches how the Vera baseline validates all problems even when there are no runnable test cases.

4. Tier 5 effect fidelity

This is a discussion point, not a blocker. CodeRabbit noticed that some Tier 5 solutions (VB-T5-001 counter, VB-T5-004 accumulator) implement state as pure recursion rather than using effect handlers. The briefing document notes that Aver uses Result<T,E> rather than algebraic effect handlers.

If Aver doesn't have State<T> / Exn<T> handlers, that's completely fine — but we should note it in the methodology. Options:

Option A: Add a note to the problem JSONs or the report that Aver Tier 5 tests pure-recursive equivalents rather than effect handling
Option B: Skip Tier 5 for Aver in the comparison tables (report separately)
Option C: Leave as-is and note the difference in the write-up

What's your preference? Happy to discuss.

5. Description wording (minor, from CodeRabbit)

A few description_neutral fields have minor wording issues. Please reply to these CodeRabbit comments inline and either fix or dismiss:

VB-T2-003 (greeting): neutral description drops the non-empty input constraint that the original has
VB-T5-005 (checked_index): uses "list element" but the problem works with arrays
VB-T5-010 (safe_head): uses "list" but problem is array-based
VB-T3-008 (tree_count_leaves): missing the non-empty-tree constraint

6. Minor code issues

a) String escaping in _aver_literal() (runner.py) — string values aren't escaped for double quotes. If a test case ever contains ", this would produce invalid Aver. Low risk with current test cases, but a one-line fix:

escaped = value.replace('\\\\', '\\\\\\\\').replace('"', '\\\\"')
return f'"{{escaped}}"'

b) VB-T5-008 extra print — CodeRabbit flagged that Console.print("") at line 16 emits a trailing blank line that could cause output mismatches. Worth checking if this affects baseline runs.

7. `description_neutral` and existing Python/TypeScript prompts

I want to flag that this PR changes what Python and TypeScript prompts receive — they now get description_neutral instead of the Vera-flavoured description. This is the right thing to do for fairness, but it means our v0.0.7 results (which used the Vera-flavoured descriptions) aren't directly comparable to future runs. We should note this methodology change in the CHANGELOG when we version-bump.

8. Reply to remaining CodeRabbit comments

There are 13 non-Aver CodeRabbit comments still open on this PR (the ones on harness code, description wording, and Tier 5 solutions). Please reply to each one inline — either with a fix commit or an explanation of why the current behaviour is correct. This helps us track what's been addressed and keeps the review thread clean for future contributors to reference.

Summary

This is a well-structured contribution that follows the existing architecture closely. The description_neutral field is a genuine improvement that benefits all non-Vera languages. The main asks are: CodeRabbit config, tests, three harness fixes, and a discussion about Tier 5. Looking forward to getting this merged.

aallan

Review: Add Aver language support + language-neutral problem descriptions

Welcome to the project, @jasisz — this is a really impressive first contribution. The architecture is clean, the pattern consistency with our existing Python/TypeScript integration is excellent, and the description_neutral addition is independently valuable. Having Aver in VeraBench is going to make the Vera-vs-Aver comparison (named variables vs De Bruijn indices, verify blocks vs SMT contracts) one of the most interesting data points in the benchmark.

A few things to address before we can merge. Nothing here is a show-stopper — most of this is polish and housekeeping.

1. Add `.coderabbit.yaml` exclusions for Aver solutions

CodeRabbit doesn't understand Aver and has generated 17 false-positive comments on the .av solution files (I've replied to all of them inline). Please add these two lines to path_filters in .coderabbit.yaml:

- "!**/*.av"
- "!solutions/aver/**"

This matches our existing exclusions for Vera (!**/*.vera), Python (!solutions/python/**), and TypeScript (!solutions/typescript/**).

2. Tests

The PR adds substantial new code to baseline_runner.py, runner.py, prompts.py, and cli.py but no new test cases. We're currently at 85% coverage and have a CI gate at 80%. While the baselines were manually verified (50/50 — great work), we'll need at minimum:

Unit test for _aver_literal() — especially the negative integer (0 - N) conversion and string values
Unit test for _strip_aver_main()
Unit test for _neutral_description() fallback behaviour
Unit test for load_aver_llms_txt() (can mirror the existing load_skill_md tests)

These don't need to be exhaustive — a handful of cases each is fine.

3. Harness issues (valid CodeRabbit findings)

Three CodeRabbit comments on the harness code are legitimate and should be addressed:

a) Fail fast when aver is not on PATH (cli.py:177)
The version detection in cli.py catches FileNotFoundError and raises SystemExit(1) — good. But the baseline runner path should do the same rather than silently failing per-problem.

b) Don't retry on tooling failures (runner.py:753)
The fix-from-error retry currently fires even when the failure was "aver not found" or a timeout, not a compiler diagnostic. This wastes an API call and records a misleading attempt 2. Please gate the retry on actual aver check failures only.

c) Empty test_cases baseline validation (baseline_runner.py:386)
Problems with no test_cases currently report as passing without any validation. At minimum, aver check should still run on the canonical solution for these problems. This matches how the Vera baseline validates all problems even when there are no runnable test cases.

4. Tier 5 effect fidelity

This is a discussion point, not a blocker. CodeRabbit noticed that some Tier 5 solutions (VB-T5-001 counter, VB-T5-004 accumulator) implement state as pure recursion rather than using effect handlers. The briefing document notes that Aver uses Result<T,E> rather than algebraic effect handlers.

If Aver doesn't have State<T> / Exn<T> handlers, that's completely fine — but we should note it in the methodology. Options:

Option A: Add a note to the problem JSONs or the report that Aver Tier 5 tests pure-recursive equivalents rather than effect handling
Option B: Skip Tier 5 for Aver in the comparison tables (report separately)
Option C: Leave as-is and note the difference in the write-up

What's your preference? Happy to discuss.

5. Description wording (minor, from CodeRabbit)

A few description_neutral fields have minor wording issues. Please reply to these CodeRabbit comments inline and either fix or dismiss:

VB-T2-003 (greeting): neutral description drops the non-empty input constraint that the original has
VB-T5-005 (checked_index): uses "list element" but the problem works with arrays
VB-T5-010 (safe_head): uses "list" but problem is array-based
VB-T3-008 (tree_count_leaves): missing the non-empty-tree constraint

6. Minor code issues

a) String escaping in _aver_literal() (runner.py) — string values aren't escaped for double quotes. If a test case ever contains a double-quote character, this would produce invalid Aver. Low risk with current test cases, but worth adding an escape step.

b) VB-T5-008 extra print — CodeRabbit flagged that Console.print("") at line 16 emits a trailing blank line that could cause output mismatches. Worth checking if this affects baseline runs.

7. `description_neutral` and existing Python/TypeScript prompts

I want to flag that this PR changes what Python and TypeScript prompts receive — they now get description_neutral instead of the Vera-flavoured description. This is the right thing to do for fairness, but it means our v0.0.7 results (which used the Vera-flavoured descriptions) aren't directly comparable to future runs. We should note this methodology change in the CHANGELOG when we version-bump.

8. Reply to remaining CodeRabbit comments

There are 13 non-Aver CodeRabbit comments still open on this PR (the ones on harness code, description wording, and Tier 5 solutions). Please reply to each one inline — either with a fix commit or an explanation of why the current behaviour is correct. This helps us track what's been addressed and keeps the review thread clean for future contributors to reference.

Summary

This is a well-structured contribution that follows the existing architecture closely. The description_neutral field is a genuine improvement that benefits all non-Vera languages. The main asks are: CodeRabbit config, tests, three harness fixes, and a discussion about Tier 5. Looking forward to getting this merged.

- Add .coderabbit.yaml exclusions for *.av and solutions/aver/ - Add unit tests for _aver_literal, _strip_aver_main, _neutral_description, load_aver_llms_txt (20 test cases) - Fix string escaping in _aver_literal() for quotes and backslashes - Fix VB-T5-008 trailing blank line: restructure loop so false branch prints last number instead of empty string - Fix description_neutral wording: T2-003 non-empty constraint, T5-005/T5-010 list→array, T3-008 add leaf count guarantee - Harness: fail fast when aver not on PATH in baselines command - Harness: skip fix-from-error retry on tooling failures (aver not found, timeout) — only retry on actual compiler diagnostics - Harness: explicit elif for language dispatch instead of catch-all else in baseline_runner.py and runner.py - Add CHANGELOG entry for description_neutral methodology change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

vera_bench/cli.py (1)
203-216: 🛠️ Refactor suggestion | 🟠 Major

Persist the Aver toolchain version in each JSONL row.

These lines use aver_ver for the filename and console output only. The result schema still serialises no Aver version field, so once the JSONL is copied, renamed, or merged, you lose the execution provenance for that run.

A small cross-file follow-up would be to add aver_version to ProblemResult and thread it through run_benchmark() / run_single_problem(), mirroring the existing vera_version path. Based on learnings, the JSONL files are the canonical reproducibility record.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vera_bench/cli.py` around lines 203 - 216, Add an aver_version field to the
result schema and thread it through the benchmark run so each JSONL row persists
the Aver toolchain version: update the ProblemResult dataclass/structure to
include aver_version, modify run_benchmark() and run_single_problem() to accept
and populate aver_version (sourced from the existing aver_ver variable used for
filenames/console), and ensure the serialization code that writes each JSONL row
includes this new aver_version field just like the existing vera_version path.

♻️ Duplicate comments (4)

problems/tier2/VB_T2_006_join_strings.json (1)

6-6: ⚠️ Potential issue | 🟡 Minor

Use “array” instead of “list” to match the declared interface.
description_neutral currently says “list of strings”, but the signature is explicitly @Array<String>. Please align the wording.
Proposed patch
-  "description_neutral": "Join a list of strings with a separator string.",
+  "description_neutral": "Join an array of strings with a separator string.",
As per coding guidelines, problems/**/*.json should maintain consistency within problem definitions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@problems/tier2/VB_T2_006_join_strings.json` at line 6, Update the problem
JSON's description_neutral string to use "array" instead of "list" to match the
declared signature; locate the "description_neutral" field (currently "Join a
list of strings with a separator string.") and change the text to "Join an array
of strings with a separator string." so wording is consistent with the
`@Array`<String> interface.

problems/tier2/VB_T2_008_count_positives.json (1)

6-7: ⚠️ Potential issue | 🟡 Minor

Use “array” in the neutral description to match the declared type.

At Line [6], “integer list” is inconsistent with the @Array<Int> signature at Line [7] and can misdirect generation towards ADT List.
Suggested fix
-  "description_neutral": "Count how many positive elements are in an integer list.",
+  "description_neutral": "Count how many positive elements are in an integer array.",
As per coding guidelines, problem definitions must be consistent across fields (schema completeness/consistency for problems/**/*.json).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@problems/tier2/VB_T2_008_count_positives.json` around lines 6 - 7, Update the
neutral description to use the term "array" to match the signature's `@Array`<Int>
type: change the "description_neutral" value from "integer list" to "integer
array" (or similar) so it aligns with the function signature "public fn
count_positives(`@Array`<Int> -> `@Nat`)" and avoids implying an ADT List.

vera_bench/cli.py (1)

172-180: ⚠️ Potential issue | 🟠 Major

Fail fast when aver --version exits non-zero.

Line 173 only handles missing binaries and timeouts. A broken install that returns a non-zero status leaves aver_ver blank and the benchmark still burns model calls before failing deeper in the runner.

💡 Suggested fix

         try:
             _av_proc = _sp.run(  # noqa: S603
                 ["aver", "--version"],
                 capture_output=True,
                 text=True,
                 timeout=5,
                 check=False,
             )
+            if _av_proc.returncode != 0:
+                err = (_av_proc.stderr or _av_proc.stdout).strip()
+                console.print(
+                    f"[red]Error: aver --version failed. {err or 'No output'}[/red]"
+                )
+                raise SystemExit(1)
             aver_ver = _av_proc.stdout.strip().replace("aver ", "")
         except (FileNotFoundError, _sp.TimeoutExpired):
             console.print(
                 "[red]Error: aver not found on PATH. "
                 "Install with: cargo install aver-lang[/red]"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/cli.py` around lines 172 - 180, The current subprocess invocation
that captures "aver --version" sets aver_ver from _av_proc.stdout but doesn't
fail when the command exits non-zero; update the check after running
subprocess.run (the _av_proc result) to detect a non-zero _av_proc.returncode
and immediately raise an error or exit (e.g., log and sys.exit(1)) so the CLI
fails fast instead of continuing with an empty aver_ver; specifically modify the
block that sets _av_proc and aver_ver to validate _av_proc.returncode and handle
it like missing binary/timeouts before using aver_ver.

vera_bench/runner.py (1)

566-589: 🛠️ Refactor suggestion | 🟠 Major

Annotate the new Aver helper inputs.

_aver_literal(value) and _aver_output_matches(actual, expected) introduce untyped parameters on the harness path, which falls short of the repo rule for typed Python in vera_bench/.

💡 Suggested fix

-def _aver_literal(value) -> str:
+def _aver_literal(value: bool | int | float | str | list[object]) -> str:
     """Convert a Python value to an Aver literal."""
     if isinstance(value, bool):
         return "true" if value else "false"
@@
-def _aver_output_matches(actual: str, expected) -> bool:
+def _aver_output_matches(
+    actual: str,
+    expected: bool | int | float | str | list[object],
+) -> bool:
     """Check if aver output matches expected, handling bool normalization."""
     from vera_bench.baseline_runner import _aver_output_matches as _match

As per coding guidelines, "Use Python 3.11+ with type hints everywhere in the VeraBench harness".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vera_bench/runner.py` around lines 566 - 589, Both helper functions lack type
annotations; add explicit Python 3.11+ type hints: annotate _aver_literal(value:
Any) -> str and _aver_output_matches(actual: str, expected: Any) -> bool,
importing Any (and optionally list/Iterable types) from typing, and update any
internal collections (e.g., treat list inputs as list[Any]) so the functions and
their parameters are fully typed (refer to function names _aver_literal and
_aver_output_matches).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@problems/tier2/VB_T2_001_sum_array.json`:
- Line 6: Update the JSON field description_neutral to use “array” terminology
instead of “list”: change the phrase "integer list" to "integer array" (or
"array of integers") so the description matches the problem's `@Array`<Int> type
and tests array built-ins (array_fold); modify the value of description_neutral
accordingly.

In `@problems/tier2/VB_T2_002_filter_positives.json`:
- Line 6: Update the JSON value for "description_neutral" to use the term
"array" instead of "list" to match the declared `@Array`<Int> signature; locate
the "description_neutral" field in the problem definition (the current value
"Return a new list containing only the positive elements from an integer list.")
and change both occurrences of "list" to "array" so the description reads
consistently with the interface.

In `@problems/tier2/VB_T2_007_double_elements.json`:
- Line 6: Update the description_neutral value to use array terminology instead
of "integer list" so it matches the function signature/type contract
(`@Array`<Int>); for example replace "Double every element in an integer list."
with "Double every element in an integer array." and scan other keys in this
JSON (e.g., description_neutral) across problems/**/*.json to ensure consistent
use of "array" or "array of integers" wherever list/linked-list wording appears.

In `@vera_bench/baseline_runner.py`:
- Around line 558-572: The initial guard in baseline_runner.py already raises
NotImplementedError for any language not in ("python", "typescript", "aver"), so
the final else branch after selecting runner is unreachable; remove that else
branch (the final raise NotImplementedError block) and keep the
language-to-runner mapping via if language == "python": runner =
run_python_baseline, elif language == "typescript": runner =
run_typescript_baseline, elif language == "aver": runner = run_aver_baseline to
eliminate dead code and rely on the initial validation.

In `@vera_bench/runner.py`:
- Around line 457-470: The except block handling subprocess.TimeoutExpired and
FileNotFoundError should mark tool failures as failures instead of missing data:
in the try/except around subprocess.run (the verify_proc call in runner.py)
assign result["verify_pass"] = False in the except clause so the key remains
present (counted as a failed verify) rather than setting it to None or omitting
it; keep existing capture_output/text/env behavior unchanged.

---

Outside diff comments:
In `@vera_bench/cli.py`:
- Around line 203-216: Add an aver_version field to the result schema and thread
it through the benchmark run so each JSONL row persists the Aver toolchain
version: update the ProblemResult dataclass/structure to include aver_version,
modify run_benchmark() and run_single_problem() to accept and populate
aver_version (sourced from the existing aver_ver variable used for
filenames/console), and ensure the serialization code that writes each JSONL row
includes this new aver_version field just like the existing vera_version path.

---

Duplicate comments:
In `@problems/tier2/VB_T2_006_join_strings.json`:
- Line 6: Update the problem JSON's description_neutral string to use "array"
instead of "list" to match the declared signature; locate the
"description_neutral" field (currently "Join a list of strings with a separator
string.") and change the text to "Join an array of strings with a separator
string." so wording is consistent with the `@Array`<String> interface.

In `@problems/tier2/VB_T2_008_count_positives.json`:
- Around line 6-7: Update the neutral description to use the term "array" to
match the signature's `@Array`<Int> type: change the "description_neutral" value
from "integer list" to "integer array" (or similar) so it aligns with the
function signature "public fn count_positives(`@Array`<Int> -> `@Nat`)" and avoids
implying an ADT List.

In `@vera_bench/cli.py`:
- Around line 172-180: The current subprocess invocation that captures "aver
--version" sets aver_ver from _av_proc.stdout but doesn't fail when the command
exits non-zero; update the check after running subprocess.run (the _av_proc
result) to detect a non-zero _av_proc.returncode and immediately raise an error
or exit (e.g., log and sys.exit(1)) so the CLI fails fast instead of continuing
with an empty aver_ver; specifically modify the block that sets _av_proc and
aver_ver to validate _av_proc.returncode and handle it like missing
binary/timeouts before using aver_ver.

In `@vera_bench/runner.py`:
- Around line 566-589: Both helper functions lack type annotations; add explicit
Python 3.11+ type hints: annotate _aver_literal(value: Any) -> str and
_aver_output_matches(actual: str, expected: Any) -> bool, importing Any (and
optionally list/Iterable types) from typing, and update any internal collections
(e.g., treat list inputs as list[Any]) so the functions and their parameters are
fully typed (refer to function names _aver_literal and _aver_output_matches).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9cad44fc-48e2-4788-a9db-d7a0cd1ea0f3

📥 Commits

Reviewing files that changed from the base of the PR and between 69a6fa7 and c154b23.

📒 Files selected for processing (107)

.coderabbit.yaml
CHANGELOG.md
problems/tier1/VB_T1_001_absolute_value.json
problems/tier1/VB_T1_002_clamp.json
problems/tier1/VB_T1_003_signum.json
problems/tier1/VB_T1_004_max_of_two.json
problems/tier1/VB_T1_005_min_of_two.json
problems/tier1/VB_T1_006_is_positive.json
problems/tier1/VB_T1_007_safe_modulo.json
problems/tier1/VB_T1_008_distance.json
problems/tier1/VB_T1_009_max_of_three.json
problems/tier1/VB_T1_010_double_or_nothing.json
problems/tier2/VB_T2_001_sum_array.json
problems/tier2/VB_T2_002_filter_positives.json
problems/tier2/VB_T2_003_greeting.json
problems/tier2/VB_T2_004_is_empty_string.json
problems/tier2/VB_T2_005_contains_substring.json
problems/tier2/VB_T2_006_join_strings.json
problems/tier2/VB_T2_007_double_elements.json
problems/tier2/VB_T2_008_count_positives.json
problems/tier2/VB_T2_009_to_upper.json
problems/tier2/VB_T2_010_sum_positives.json
problems/tier3/VB_T3_001_list_length.json
problems/tier3/VB_T3_002_tree_depth.json
problems/tier3/VB_T3_003_expression_evaluator.json
problems/tier3/VB_T3_004_list_sum.json
problems/tier3/VB_T3_005_tree_sum.json
problems/tier3/VB_T3_006_option_unwrap_or.json
problems/tier3/VB_T3_007_list_contains.json
problems/tier3/VB_T3_008_tree_count_leaves.json
problems/tier3/VB_T3_009_list_append.json
problems/tier3/VB_T3_010_list_last.json
problems/tier4/VB_T4_001_fibonacci.json
problems/tier4/VB_T4_002_greatest_common_divisor.json
problems/tier4/VB_T4_003_even_odd_mutual_recursion.json
problems/tier4/VB_T4_004_power.json
problems/tier4/VB_T4_005_sum_to_n.json
problems/tier4/VB_T4_006_list_reverse.json
problems/tier4/VB_T4_007_count_digits.json
problems/tier4/VB_T4_008_multiply.json
problems/tier4/VB_T4_009_list_nth.json
problems/tier4/VB_T4_010_div_natural.json
problems/tier5/VB_T5_001_counter.json
problems/tier5/VB_T5_002_greeter_io_boundary.json
problems/tier5/VB_T5_003_safe_division_exceptions.json
problems/tier5/VB_T5_004_accumulator.json
problems/tier5/VB_T5_005_checked_index.json
problems/tier5/VB_T5_006_state_double.json
problems/tier5/VB_T5_007_exn_negate.json
problems/tier5/VB_T5_008_print_numbers.json
problems/tier5/VB_T5_009_state_max.json
problems/tier5/VB_T5_010_safe_head.json
solutions/aver/VB_T1_001_absolute_value.av
solutions/aver/VB_T1_002_clamp.av
solutions/aver/VB_T1_003_signum.av
solutions/aver/VB_T1_004_max_of_two.av
solutions/aver/VB_T1_005_min_of_two.av
solutions/aver/VB_T1_006_is_positive.av
solutions/aver/VB_T1_007_safe_modulo.av
solutions/aver/VB_T1_008_distance.av
solutions/aver/VB_T1_009_max_of_three.av
solutions/aver/VB_T1_010_double_or_nothing.av
solutions/aver/VB_T2_001_sum_array.av
solutions/aver/VB_T2_002_filter_positives.av
solutions/aver/VB_T2_003_greeting.av
solutions/aver/VB_T2_004_is_empty_string.av
solutions/aver/VB_T2_005_contains_substring.av
solutions/aver/VB_T2_006_join_strings.av
solutions/aver/VB_T2_007_double_elements.av
solutions/aver/VB_T2_008_count_positives.av
solutions/aver/VB_T2_009_to_upper.av
solutions/aver/VB_T2_010_sum_positives.av
solutions/aver/VB_T3_001_list_length.av
solutions/aver/VB_T3_002_tree_depth.av
solutions/aver/VB_T3_003_expression_evaluator.av
solutions/aver/VB_T3_004_list_sum.av
solutions/aver/VB_T3_005_tree_sum.av
solutions/aver/VB_T3_006_option_unwrap_or.av
solutions/aver/VB_T3_007_list_contains.av
solutions/aver/VB_T3_008_tree_count_leaves.av
solutions/aver/VB_T3_009_list_append.av
solutions/aver/VB_T3_010_list_last.av
solutions/aver/VB_T4_001_fibonacci.av
solutions/aver/VB_T4_002_greatest_common_divisor.av
solutions/aver/VB_T4_003_even_odd_mutual_recursion.av
solutions/aver/VB_T4_004_power.av
solutions/aver/VB_T4_005_sum_to_n.av
solutions/aver/VB_T4_006_list_reverse.av
solutions/aver/VB_T4_007_count_digits.av
solutions/aver/VB_T4_008_multiply.av
solutions/aver/VB_T4_009_list_nth.av
solutions/aver/VB_T4_010_div_natural.av
solutions/aver/VB_T5_001_counter.av
solutions/aver/VB_T5_002_greeter_io_boundary.av
solutions/aver/VB_T5_003_safe_division_exceptions.av
solutions/aver/VB_T5_004_accumulator.av
solutions/aver/VB_T5_005_checked_index.av
solutions/aver/VB_T5_006_state_double.av
solutions/aver/VB_T5_007_exn_negate.av
solutions/aver/VB_T5_008_print_numbers.av
solutions/aver/VB_T5_009_state_max.av
solutions/aver/VB_T5_010_safe_head.av
tests/test_runner.py
vera_bench/baseline_runner.py
vera_bench/cli.py
vera_bench/prompts.py
vera_bench/runner.py

Solutions: - VB-T5-001 counter: n == target → n >= target (prevent non-termination) - VB-T4-008 multiply: guard b <= 0 (prevent negative-b recursion) - VB-T4-009 list_nth: guard idx < 0 (short-circuit negative indices) - VB-T1-008 distance: add missing exposes [distance] - VB-T5-010 safe_head: add verify case for negative head value - VB-T5-009 state_max: add verify for false branch and boundary input - VB-T4-002 gcd: add verify gcd(0, 0) => 0 - VB-T2-010 sum_positives: align main with verify cases Harness: - cli.py: abort when aver --version returns non-zero exit code - runner.py: count aver verify timeout/missing as failure, not None - baseline_runner.py: remove unreachable else clause Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jasisz · 2026-04-13T09:38:35Z

Tier 5 effect fidelity (point 4)

Yeah, to be honest, Tier 5 was bothering me too.

I don't think it's really an Aver-specific issue though. Python and TypeScript already seem to have the same mismatch there: Vera is testing algebraic effect handlers, while the baseline languages are really solving the same end requirement with their own native mechanisms.

So maybe the fairest thing would be to keep the T5 runs, but not treat them as part of the main cross-language headline score. T1–T4 feels like the more apples-to-apples aggregate, with T5 reported separately as functional-equivalent results.

Happy to help with a follow-up PR for the reporting side if that's useful.

aallan · 2026-04-13T12:43:19Z

This is great work — merging now. Thank you for the thorough follow-up on the review feedback, the tests are comprehensive and the CodeRabbit comments are all resolved.

We'll follow up immediately with a documentation/housekeeping PR on our side to bring the rest of the repo in line with the new Aver support. For reference, here's what that will cover:

README.md — mention Aver in the quick start, supported languages, and a note in the results section
CLAUDE.md — add Aver to the problem structure docs, description_neutral field, gotchas
DESIGN.md — add Aver to the comparison languages section
ROADMAP.md — mark Aver as done, add MoonBit (#49) to the language expansion line
CONTRIBUTING.md — add a "how to add a new comparison language" section using this PR as the template
Version bump to v0.0.8 (pyproject.toml, CITATION.cff, CHANGELOG)
scripts/run_full_benchmark.py — add Aver to the target list

For future PRs: if you're adding a language or major feature, including updates to these docs in the same PR is ideal — it keeps everything atomic. But for a first contribution this was already excellent, and we're happy to handle the docs sweep this time.

Also filed #49 for MoonBit support — same integration pattern as yours. And we'll file an issue for the Tier 5 cross-language methodology question so we can track it separately.

aallan · 2026-04-13T12:43:53Z

Re: your benchmark results — the Haiku outperforming Sonnet finding is fascinating. We've seen hints of this pattern in other zero-training-data contexts: smaller models can be better at following novel in-context specifications because they have less "prior knowledge" competing with the instruction. Sonnet has stronger priors about how code should look (from billions of lines of training data), which helps enormously for Python but can actively hurt when the spec says "do it this unfamiliar way." Haiku, with weaker priors, follows the llms.txt more literally and gets fewer parser failures as a result.

This is actually evidence for the core thesis of both Vera and Aver — that language design (clear spec, structural guardrails) can matter more than model scale when training data is absent. Would love to see this replicated across more models when we do the full Aver benchmark run.

aallan

All review feedback addressed. Tests comprehensive, harness fixes solid, CodeRabbit comments resolved. Merging — follow-up docs PR coming immediately.

aallan · 2026-04-13T16:12:14Z

Hey @jasisz — really close to merging! CI is flagging two things, both straightforward to fix:

1. Lint: two lines too long (E501)

vera_bench/baseline_runner.py:484  E501 Line too long (94 > 88)
vera_bench/baseline_runner.py:???  E501 Line too long (93 > 88)

Our ruff config enforces 88-character line length. You can find and fix these with:

ruff check vera_bench/baseline_runner.py

Or auto-fix with ruff format vera_bench/baseline_runner.py (but double-check the result — ruff format can sometimes reflow display strings in unexpected ways).

2. Coverage dropped below 80% gate (75% vs 80% required)

The new Aver code in runner.py (65% coverage) and cli.py (61%) brought overall coverage from 85% down to 75%. The tests you added for _aver_literal, _strip_aver_main, _neutral_description, and load_aver_llms_txt are excellent — but the main evaluation pipeline (_evaluate_aver_code) and the CLI Aver paths are large blocks of uncovered code.

The quickest way to get past the 80% gate without needing the Aver toolchain installed in CI:

Mock-based tests for _evaluate_aver_code — mock the subprocess calls to aver check / aver run and verify the function returns the right dict structure. We do similar things for the Vera pipeline.
CLI path tests — test that --language aver is accepted, that the version detection works (mock subprocess), and that the mode warning fires for non-full-spec.

You can check coverage locally with:

pytest --cov=vera_bench --cov-report=term-missing -q

The CI gate is --cov-fail-under=80. If you want to see examples of how we structure mock tests, tests/test_models.py has good patterns — and happy to help if you get stuck on anything!

- Fix two lines exceeding 88-char limit in baseline_runner.py - Add mock tests for _evaluate_aver_code (check/fail/timeout/test_cases) - Add mock tests for run_single_problem Aver path (retry gate, tooling errors) - Add tests for Aver prompt builder and fix prompt - Add CLI acceptance tests for --language aver - Add tests for _aver_output_matches bool normalization - Coverage: 75% → 84% (above 80% CI gate) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jasisz · 2026-04-13T16:25:53Z

Thanks for the CI ping — both issues fixed in 8489469:

Ruff E501: two long lines in baseline_runner.py reformatted
Coverage: added mock tests for _evaluate_aver_code, run_single_problem Aver path, Aver prompt builders, CLI acceptance, and _aver_output_matches — coverage up from 75% to 84%

CI run needs first-time contributor approval on your end.

codecov · 2026-04-13T16:33:20Z

Codecov Report

❌ Patch coverage is 73.36066% with 65 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.06%. Comparing base (4f012fe) to head (8d512ad).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
vera_bench/baseline_runner.py	41.89%	43 Missing ⚠️
vera_bench/runner.py	88.79%	13 Missing ⚠️
vera_bench/cli.py	75.00%	8 Missing ⚠️
vera_bench/prompts.py	95.45%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #48      +/-   ##
==========================================
- Coverage   84.94%   83.06%   -1.89%     
==========================================
  Files          10       10              
  Lines        1116     1346     +230     
==========================================
+ Hits          948     1118     +170     
- Misses        168      228      +60

Flag	Coverage Δ
python	`83.06% <73.36%> (-1.89%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jasisz · 2026-04-13T17:00:33Z

Fixed — sorry for the CI dance, should have checked the workflow config before pushing. All three ruff steps pass now.

aallan · 2026-04-13T18:07:20Z

Fixed — sorry for the CI dance, should have checked the workflow config before pushing. All three ruff steps pass now.

@jasisz Looking good. The only remaining flag is the Codecov project check, which isn't a merge blocker — it compares against the base branch's ~85% average and flags the ~3pp regression. The actual CI coverage gate (--cov-fail-under=80) passes fine at 81.94%.

That said, if you'd like to push coverage closer to the base, the biggest opportunities are:

runner.py (dropped to ~65%) — _evaluate_aver_code() and _strip_aver_main() could use a few more test cases covering error paths (e.g., Aver code that fails to compile, timeout on aver run)
baseline_runner.py — run_aver_baseline() edge cases (missing solution file, aver not on PATH)
prompts.py — load_aver_llms_txt() fetch failure path, _neutral_description() fallback when description_neutral is missing

But this is absolutely not a blocker — happy to merge as-is and we can improve coverage incrementally? Let me know!

aallan · 2026-04-13T18:10:40Z

Docs PR #52 is queued and ready to merge after this one lands. It adds Aver documentation to README, CLAUDE.md, DESIGN.md, CONTRIBUTING.md, ROADMAP.md, and adds Aver targets to the full benchmark runner script. Only overlap is CHANGELOG.md which will need a quick rebase.

Coverage: 82% → 85% (cli.py: 67% → 77%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The pchj/aver-lang URL is a 404. The correct repo is github.com/jasisz/aver (the PR #48 contributor repo). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jasisz · 2026-04-13T18:30:08Z

Pushed coverage from 82% to 85% — added CLI tests for aver version detection, mode warnings, and error paths. Should bring the Codecov project check closer to the baseline.

The pchj/aver-lang URL is a 404. The correct repo is github.com/jasisz/aver (the PR #48 contributor repo). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jasisz requested a review from aallan as a code owner April 11, 2026 21:18

jasisz force-pushed the add-aver-language branch from 9941928 to 366023b Compare April 11, 2026 21:26

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

jasisz force-pushed the add-aver-language branch 3 times, most recently from 8521dcd to 69a6fa7 Compare April 12, 2026 06:01

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread problems/tier2/VB_T2_003_greeting.json Outdated

Comment thread solutions/aver/VB_T1_008_distance.av

Comment thread solutions/aver/VB_T4_008_multiply.av Outdated

Comment thread vera_bench/runner.py

jasisz force-pushed the add-aver-language branch from 69a6fa7 to 8e867ae Compare April 12, 2026 06:16

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread problems/tier2/VB_T2_006_join_strings.json

Comment thread problems/tier2/VB_T2_008_count_positives.json

Comment thread vera_bench/cli.py

aallan requested changes Apr 13, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread problems/tier2/VB_T2_001_sum_array.json

Comment thread problems/tier2/VB_T2_002_filter_positives.json

Comment thread problems/tier2/VB_T2_007_double_elements.json

Comment thread vera_bench/baseline_runner.py Outdated

Comment thread vera_bench/runner.py

aallan mentioned this pull request Apr 13, 2026

Add MoonBit as a comparison language #49

Open

7 tasks

aallan previously approved these changes Apr 13, 2026

View reviewed changes

aallan mentioned this pull request Apr 13, 2026

Tier 5 cross-language methodology: report T1-T4 aggregate separately #50

Closed

4 tasks

jasisz dismissed aallan’s stale review via 8489469 April 13, 2026 16:23

Fix E501 in test: split long inline string

fd40a76

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jasisz and others added 2 commits April 13, 2026 18:49

Run ruff format on all changed files

5ce93e2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix S607 security lint: add noqa for partial executable paths

c2e76e9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan mentioned this pull request Apr 13, 2026

Document Aver support and add comparison language guide #52

Merged

3 tasks

Add CLI coverage tests: warnings, version detection, error paths

8d512ad

Coverage: 82% → 85% (cli.py: 67% → 77%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan added a commit that referenced this pull request Apr 13, 2026

Fix Aver repo URL: jasisz/aver not pchj/aver-lang

8265b3e

The pchj/aver-lang URL is a 404. The correct repo is github.com/jasisz/aver (the PR #48 contributor repo). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aallan merged commit 89e2689 into aallan:main Apr 13, 2026
8 checks passed

aallan added a commit that referenced this pull request Apr 13, 2026

Fix Aver repo URL: jasisz/aver not pchj/aver-lang

226744a

The pchj/aver-lang URL is a 404. The correct repo is github.com/jasisz/aver (the PR #48 contributor repo). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aallan mentioned this pull request Apr 13, 2026

Release v0.0.8: Aver support + description_neutral #53

Merged

2 tasks

jasisz mentioned this pull request Apr 14, 2026

Report T1-T4 aggregate separately for cross-language comparison #56

Merged

4 tasks

coderabbitai Bot mentioned this pull request Apr 16, 2026

Add 10 new T2/T3 problems with testable signatures #57

Merged

7 tasks

coderabbitai Bot mentioned this pull request May 21, 2026

Add AILANG as a baseline target language #70

Merged

Conversation

jasisz commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Language-neutral descriptions

What is Aver

Usage

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jasisz commented Apr 12, 2026

Initial benchmark results

Aver results

Cross-language comparison (same models, description_neutral)

Uh oh!

aallan left a comment

Choose a reason for hiding this comment

Review: Add Aver language support + language-neutral problem descriptions

1. Add .coderabbit.yaml exclusions for Aver solutions

2. Tests

3. Harness issues (valid CodeRabbit findings)

4. Tier 5 effect fidelity

5. Description wording (minor, from CodeRabbit)

6. Minor code issues

7. description_neutral and existing Python/TypeScript prompts

8. Reply to remaining CodeRabbit comments

Summary

Uh oh!

aallan left a comment

Choose a reason for hiding this comment

Review: Add Aver language support + language-neutral problem descriptions

1. Add .coderabbit.yaml exclusions for Aver solutions

jasisz commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

Cross-language comparison (same models, `description_neutral`)

1. Add `.coderabbit.yaml` exclusions for Aver solutions

7. `description_neutral` and existing Python/TypeScript prompts

1. Add `.coderabbit.yaml` exclusions for Aver solutions

7. `description_neutral` and existing Python/TypeScript prompts

aallan commented Apr 13, 2026 •

edited

Loading

codecov Bot commented Apr 13, 2026 •

edited

Loading