Skip to content

Commit 5e79d45

Browse files
aallanclaude
andcommitted
Merge main into PR #70 — resolve tests/test_runner.py conflict
Positional conflict only: both #73 (TestRunBenchmarkParallel) and #70 (TestAilangLiteral / TestStripAilangMain / TestEvaluateAilangCode / TestLoadAilangPrompt / TestAilangPrompt / TestAilangCLI) appended new test classes at the end of tests/test_runner.py. Resolved by keeping both groups in order: TestRunBenchmarkIntegration -> TestRunBenchmarkParallel (from #73) -> AILANG test classes (from #70). No logical conflict between the PRs. PR #73 modified run_benchmark (with new _crash_result / _record helpers at lines ~1242-1280); PR #70 modified the AILANG evaluator paths (lines ~554-831) and added the AILANG dispatch branch in run_single_problem (lines ~975, 1017, 1107). The runner.py three-way merge resolved cleanly because the regions are disjoint; only the test file needed manual stitching. Verification: - ruff check . / ruff format --check . both clean - AST parse OK on merged test file - All three target classes present exactly once (no duplicates) - Final structure: TestRunBenchmarkIntegration -> TestRunBenchmarkParallel -> AILANG classes, separated by header comments Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 parents 6db02f4 + 6e7b726 commit 5e79d45

72 files changed

Lines changed: 3074 additions & 12 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.coderabbit.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,13 @@ reviews:
4848
path_filters:
4949
- "!**/*.vera"
5050
- "!**/*.av"
51+
- "!**/*.ail"
5152
- "!context/**"
5253
- "!results/**/*.jsonl"
5354
- "!solutions/python/**"
5455
- "!solutions/typescript/**"
5556
- "!solutions/aver/**"
57+
- "!solutions/ailang/**"
5658

5759
path_instructions:
5860
- path: "vera_bench/**/*.py"

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,6 @@ assets/benchmark_*.png
3434
# Script is committed; rendered PNGs are talk-prep ephemera and go into
3535
# the speaker's slide deck rather than the repo.
3636
assets/vera-bench_slide_*.png
37+
38+
# AILANG runtime cache (created by `ailang run` in solutions/ailang/)
39+
solutions/ailang/.ailang/

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ For each problem, we measure:
7070
- **fix@1** — Given the error message, can the model fix it in one turn?
7171
- **run_correct** — Does execution produce the correct output?
7272

73-
The same problems are also run in Python, TypeScript, and [Aver](https://github.com/jasisz/aver) as baselines. Aver is a Haskell-inspired language with zero LLM training data, providing a second data point alongside Vera for the zero-training-data thesis.
73+
The same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/) as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis.
7474

7575
> **Cross-language comparison:** For cross-language headline rates, use the T1–T4 aggregate. Tier 5 tests Vera's algebraic effect handlers, which other languages solve with fundamentally different native idioms. See [#50](https://github.com/aallan/vera-bench/issues/50).
7676
@@ -80,6 +80,7 @@ The same problems are also run in Python, TypeScript, and [Aver](https://github.
8080
* Git
8181
* Node.js 22+ *(optional, for TypeScript baselines and generation)*
8282
* [Aver](https://github.com/jasisz/aver) *(optional, for Aver baselines and generation)*
83+
* [AILANG](https://ailang.sunholo.com/) *(optional, for AILANG baselines and generation)*
8384

8485
## Installation
8586

@@ -135,15 +136,17 @@ vera-bench run --model claude-sonnet-4-20250514 --problem VB-T1-001
135136
# Spec-from-NL mode (agent writes its own contracts)
136137
vera-bench run --model claude-sonnet-4-20250514 --mode spec-from-nl
137138

138-
# Ask the same model to write Python, TypeScript, or Aver for comparison
139+
# Ask the same model to write Python, TypeScript, Aver, or AILANG for comparison
139140
vera-bench run --model claude-sonnet-4-20250514 --language python
140141
vera-bench run --model claude-sonnet-4-20250514 --language typescript
141142
vera-bench run --model claude-sonnet-4-20250514 --language aver
143+
vera-bench run --model claude-sonnet-4-20250514 --language ailang
142144

143145
# Run canonical baselines as a reference
144146
vera-bench baselines
145147
vera-bench baselines --language typescript
146148
vera-bench baselines --language aver
149+
vera-bench baselines --language ailang
147150

148151
# Generate a combined report
149152
vera-bench report results/

ROADMAP.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
- [x] Strengthen postconditions to catch slot-swap bugs (issue #14)
3030
- [ ] Improve SKILL.md coverage of where blocks (issue #15)
3131
- [x] Test coverage ([issue #5](https://github.com/aallan/vera-bench/issues/5), ongoing — target 90%) — CI enforces 80% floor via `--cov-fail-under=80` in [ci.yml](.github/workflows/ci.yml), current coverage shown by [![codecov](https://codecov.io/gh/aallan/vera-bench/graph/badge.svg)](https://codecov.io/gh/aallan/vera-bench)
32+
- [ ] Per-test subprocess-failure diagnostics — Aver and AILANG evaluators currently `continue` on per-test failures without capturing stderr, unlike the Python/TypeScript paths. Small shared-helper refactor (issue [#72](https://github.com/aallan/vera-bench/issues/72))
3233

3334
## Milestone 2: Longitudinal tracking
3435

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
module benchmark/solution
2+
3+
-- VB-T1-001: Absolute Value
4+
-- Compute |x|. AILANG: simple if-as-expression, no Nat subtype so we return int.
5+
6+
export func absolute_value(x: int) -> int =
7+
if x < 0 then -x else x
8+
9+
export func main() -> () ! {IO} {
10+
println(show(absolute_value(0)));
11+
println(show(absolute_value(42)));
12+
println(show(absolute_value(-42)));
13+
println(show(absolute_value(1)));
14+
println(show(absolute_value(-1)))
15+
}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
module benchmark/solution
2+
3+
export func clamp(value: int, lo: int, hi: int) -> int =
4+
if value < lo then lo
5+
else if value > hi then hi
6+
else value
7+
8+
export func main() -> () ! {IO} {
9+
println(show(clamp(50, 0, 100)));
10+
println(show(clamp(-5, 0, 100)));
11+
println(show(clamp(150, 0, 100)));
12+
println(show(clamp(0, 0, 0)))
13+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
module benchmark/solution
2+
3+
export func signum(x: int) -> int =
4+
if x < 0 then -1
5+
else if x > 0 then 1
6+
else 0
7+
8+
export func main() -> () ! {IO} {
9+
println(show(signum(42)));
10+
println(show(signum(-7)));
11+
println(show(signum(0)))
12+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
module benchmark/solution
2+
3+
export func max_of_two(a: int, b: int) -> int =
4+
if a > b then a else b
5+
6+
export func main() -> () ! {IO} {
7+
println(show(max_of_two(3, 7)));
8+
println(show(max_of_two(7, 3)));
9+
println(show(max_of_two(5, 5)));
10+
println(show(max_of_two(-1, -5)));
11+
println(show(max_of_two(0, 0)))
12+
}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
module benchmark/solution
2+
3+
export func min_of_two(a: int, b: int) -> int =
4+
if a < b then a else b
5+
6+
export func main() -> () ! {IO} {
7+
println(show(min_of_two(3, 7)));
8+
println(show(min_of_two(7, 3)));
9+
println(show(min_of_two(5, 5)));
10+
println(show(min_of_two(-1, -5)))
11+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
module benchmark/solution
2+
3+
export func is_positive(x: int) -> bool = x > 0
4+
5+
export func main() -> () ! {IO} {
6+
println(show(is_positive(5)));
7+
println(show(is_positive(-3)));
8+
println(show(is_positive(0)));
9+
println(show(is_positive(1)))
10+
}

0 commit comments

Comments
 (0)