You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge main into PR #70 — resolve tests/test_runner.py conflict
Positional conflict only: both #73 (TestRunBenchmarkParallel) and #70
(TestAilangLiteral / TestStripAilangMain / TestEvaluateAilangCode /
TestLoadAilangPrompt / TestAilangPrompt / TestAilangCLI) appended new
test classes at the end of tests/test_runner.py. Resolved by keeping
both groups in order: TestRunBenchmarkIntegration -> TestRunBenchmarkParallel
(from #73) -> AILANG test classes (from #70).
No logical conflict between the PRs. PR #73 modified run_benchmark
(with new _crash_result / _record helpers at lines ~1242-1280);
PR #70 modified the AILANG evaluator paths (lines ~554-831) and added
the AILANG dispatch branch in run_single_problem (lines ~975, 1017,
1107). The runner.py three-way merge resolved cleanly because the
regions are disjoint; only the test file needed manual stitching.
Verification:
- ruff check . / ruff format --check . both clean
- AST parse OK on merged test file
- All three target classes present exactly once (no duplicates)
- Final structure: TestRunBenchmarkIntegration -> TestRunBenchmarkParallel ->
AILANG classes, separated by header comments
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -70,7 +70,7 @@ For each problem, we measure:
70
70
-**fix@1** — Given the error message, can the model fix it in one turn?
71
71
-**run_correct** — Does execution produce the correct output?
72
72
73
-
The same problems are also run in Python, TypeScript, and [Aver](https://github.com/jasisz/aver)as baselines. Aver is a Haskell-inspired language with zero LLM trainingdata, providing a second data point alongside Vera for the zero-training-data thesis.
73
+
The same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/)as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis.
74
74
75
75
> **Cross-language comparison:** For cross-language headline rates, use the T1–T4 aggregate. Tier 5 tests Vera's algebraic effect handlers, which other languages solve with fundamentally different native idioms. See [#50](https://github.com/aallan/vera-bench/issues/50).
76
76
@@ -80,6 +80,7 @@ The same problems are also run in Python, TypeScript, and [Aver](https://github.
80
80
* Git
81
81
* Node.js 22+ *(optional, for TypeScript baselines and generation)*
82
82
*[Aver](https://github.com/jasisz/aver)*(optional, for Aver baselines and generation)*
83
+
*[AILANG](https://ailang.sunholo.com/)*(optional, for AILANG baselines and generation)*
83
84
84
85
## Installation
85
86
@@ -135,15 +136,17 @@ vera-bench run --model claude-sonnet-4-20250514 --problem VB-T1-001
135
136
# Spec-from-NL mode (agent writes its own contracts)
136
137
vera-bench run --model claude-sonnet-4-20250514 --mode spec-from-nl
137
138
138
-
# Ask the same model to write Python, TypeScript, or Aver for comparison
139
+
# Ask the same model to write Python, TypeScript, Aver, or AILANG for comparison
139
140
vera-bench run --model claude-sonnet-4-20250514 --language python
140
141
vera-bench run --model claude-sonnet-4-20250514 --language typescript
141
142
vera-bench run --model claude-sonnet-4-20250514 --language aver
143
+
vera-bench run --model claude-sonnet-4-20250514 --language ailang
Copy file name to clipboardExpand all lines: ROADMAP.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,6 +29,7 @@
29
29
-[x] Strengthen postconditions to catch slot-swap bugs (issue #14)
30
30
-[ ] Improve SKILL.md coverage of where blocks (issue #15)
31
31
-[x] Test coverage ([issue #5](https://github.com/aallan/vera-bench/issues/5), ongoing — target 90%) — CI enforces 80% floor via `--cov-fail-under=80` in [ci.yml](.github/workflows/ci.yml), current coverage shown by [](https://codecov.io/gh/aallan/vera-bench)
32
+
-[ ] Per-test subprocess-failure diagnostics — Aver and AILANG evaluators currently `continue` on per-test failures without capturing stderr, unlike the Python/TypeScript paths. Small shared-helper refactor (issue [#72](https://github.com/aallan/vera-bench/issues/72))
0 commit comments