Skip to content

Commit 62b5285

Browse files
Address CodeRabbit review on PR #70
- AILANG_RESULTS.md: reconcile conflicting LLM-eval status (line 7 said wired, line 110 said not-wired); annotate Haiku 100% as post-fix run, Kimi K2.5 stays 97%, both date-stamped. - vera_bench/prompts.py: catch subprocess.TimeoutExpired on `ailang prompt` and surface as RuntimeError instead of letting the exception escape unstructured. - vera_bench/runner.py: treat ALL non-zero `ailang check` exits as failures (except the explicit missing-main allowance); previously untagged compile errors could be misclassified as check_pass=True. Removes now-dead _is_ailang_compile_error helper. Three other CodeRabbit comments are intentional non-fixes: - VB_T2_009/T2_010/T4_009 empty `main = ()`: these problems have `test_cases: []`, so the baseline runner correctly uses check-only mode (baseline_runner.py:593). No-op main is right. - VB_T1_007 safe_modulo / VB_T4_010 div_natural defensive programming: none of the published test_cases exercise b=0 or b<0, so the baselines match spec output. Hardening could ship as a follow-up. - VB_T5_009 state_max: the spec description mentions State<Int>/handler, but test_cases verify only the observable behavior (n -> n). Baseline matches tests; a state-handler implementation is a follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 204f93f commit 62b5285

3 files changed

Lines changed: 15 additions & 27 deletions

File tree

AILANG_RESULTS.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@ This PR adds AILANG as a baseline target language AND as an LLM-eval target. Hea
1313
| Mode | Coverage | check@1 | run_correct@1 |
1414
|------|----------|---------|---------------|
1515
| **AI-authored references** (Claude Opus 4.7, iterated) | 60 problems | **100% (36/36 testable)** | **100% (36/36 testable)** |
16-
| **Single-shot LLM-eval** (Claude Haiku 4.5) | 60 problems | 90% (54/60) | **97% (35/36 testable)** |
17-
| **Single-shot LLM-eval** (Kimi K2.5 via OpenRouter) | 60 problems | 77% (46/60) | **97% (35/36 testable)** |
16+
| **Single-shot LLM-eval (post-fix)** (Claude Haiku 4.5, 2026-05-21) | 60 problems | 90% (54/60) | **100% (36/36 testable)** |
17+
| **Single-shot LLM-eval** (Kimi K2.5 via OpenRouter, 2026-05-21) | 60 problems | 77% (46/60) | **97% (35/36 testable)** |
1818

19-
Both modes hit 97-100% run_correct. The reference run is the ceiling; the LLM-eval rows are the floor. AILANG can be written by very-different-sized models (cheap Haiku 4.5 + flagship Kimi K2.5) at comparable quality.
19+
Both modes hit 97-100% run_correct. The reference run is the ceiling; the LLM-eval rows are the floor. AILANG can be written by very-different-sized models (cheap Haiku 4.5 + flagship Kimi K2.5) at comparable quality. The Haiku "post-fix" row reflects results AFTER shipping the upstream prompt fix surfaced by this benchmark (letrec-needs-block-body example); the pre-fix Haiku run was 97% (35/36).
2020

2121
| Tier | Tests | check@1 | run_correct@1 |
2222
|------|-------|---------|---------------|
@@ -36,7 +36,7 @@ Both the AI-authored reference run and the single-shot LLM-eval rows for AILANG,
3636
| Run | Mode | check@1 | run_correct@1 |
3737
|-----|------|---------|---------------|
3838
| **AILANG (this work)** | **AI-authored + iterated** (Claude Opus 4.7) | **100%** | **100%** |
39-
| **AILANG + Claude Haiku 4.5** | **LLM single-shot** (this work) | **90%** | **97%** |
39+
| **AILANG + Claude Haiku 4.5 (post-fix)** | **LLM single-shot** (this work) | **90%** | **100%** |
4040
| **AILANG + Kimi K2.5 (OpenRouter)** | **LLM single-shot** (this work) | **77%** | **97%** |
4141
| Vera + Kimi K2.5 | LLM single-shot (published) | 100% | 100% |
4242
| Vera + GPT-4.1 | LLM single-shot (published) || 91% |
@@ -107,7 +107,7 @@ For perf-honest reporting, the metric of interest is `tests_passed / tests_total
107107

108108
## Known limitations & follow-ups
109109

110-
- **LLM-eval mode (`vera-bench run --language ailang`)** is not yet wired. The `run` subcommand currently only handles vera, python, typescript, and aver. Adding AILANG to that path requires (a) extending the click choice in `cli.py`, (b) implementing a prompt loader that fetches AILANG's teaching prompt via `ailang prompt`, and (c) hooking the LLM-generated output into the same `run_ailang_baseline` execution path. Tracked in [AILANG's M-VERA-BENCH-INTEGRATION design doc](https://github.com/sunholo-data/ailang/blob/dev/design_docs/planned/v0_23_0/m-vera-bench-integration.md) Phase 2.
110+
- **LLM-eval mode (`vera-bench run --language ailang`)**: wired in this PR (2026-05-21). Both Claude Haiku 4.5 and Kimi K2.5 (via OpenRouter) are tested above. See the LLM-eval rows in the Headline results table. Tracked in [AILANG's M-VERA-BENCH-INTEGRATION design doc](https://github.com/sunholo-data/ailang/blob/dev/design_docs/planned/v0_23_0/m-vera-bench-integration.md) Phase 2.
111111
- **`verify@1` parity**: Vera's `verify_tier1`/`verify_tier3` columns report Z3 contract verification. AILANG has Z3-backed `requires`/`ensures` via `ailang verify` but the current AILANG solutions don't ship with VeraBench's contract translations. Phase 2 of the design doc covers translating `contracts.requires`/`ensures` from problem JSON into AILANG syntax and reporting `verify@1` per-problem.
112112
- **`get_char_code`**: required adding `std/bytes.byteAt` upstream in AILANG. The benchmark surfaced a real stdlib gap; tracked + shipped in [AILANG's M-BYTES-TOINTS-BYTEAT design doc](https://github.com/sunholo-data/ailang/blob/dev/design_docs/planned/v0_23_0/m-bytes-toints-byteAt.md). Solutions older than 2026-05-21 use a placeholder; current solution uses `byteAt`.
113113

vera_bench/prompts.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,11 @@ def load_ailang_prompt(source: str | Path | None = None) -> str:
273273
"ailang not found on PATH. "
274274
"Install AILANG: https://github.com/sunholo-data/ailang"
275275
) from e
276+
except subprocess.TimeoutExpired as e:
277+
raise RuntimeError(
278+
"`ailang prompt --source embedded` timed out after 10s. "
279+
"Check your ailang installation."
280+
) from e
276281
if result.returncode != 0:
277282
raise RuntimeError(
278283
f"`ailang prompt` failed: {result.stderr[:200]}"

vera_bench/runner.py

Lines changed: 5 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -546,24 +546,6 @@ def _evaluate_aver_code(
546546
return result
547547

548548

549-
_AILANG_COMPILE_ERROR_TAGS = (
550-
"Error PAR",
551-
"Error TC",
552-
"Error MOD",
553-
"PAR_",
554-
"TC_",
555-
"MOD_",
556-
"EFF_",
557-
"parse error",
558-
"module loading error",
559-
"type error",
560-
)
561-
562-
563-
def _is_ailang_compile_error(err: str) -> bool:
564-
return any(tag in err for tag in _AILANG_COMPILE_ERROR_TAGS)
565-
566-
567549
def _strip_ailang_main(code: str) -> str:
568550
"""Remove any top-level `main` function from AILANG code.
569551
@@ -727,12 +709,13 @@ def _evaluate_ailang_code(
727709
return result
728710

729711
# When checking without a main, AILANG may complain about a missing
730-
# entrypoint or pure-only module — that's fine if the only error is
731-
# the missing main. We treat real compile errors (PAR/TC/MOD) as
732-
# failures but tolerate the missing-main case.
712+
# entrypoint or pure-only module — that's the ONE non-zero exit we
713+
# tolerate, since the harness adds a per-test-case main below. Every
714+
# other non-zero exit (tagged compile error OR untagged failure) is
715+
# treated as check failure.
733716
if check_proc.returncode != 0:
734717
err = (check_proc.stderr or check_proc.stdout)[:500]
735-
if _is_ailang_compile_error(err) and "missing main" not in err.lower():
718+
if "missing main" not in err.lower():
736719
result["check_pass"] = False
737720
result["error_message"] = err
738721
if not test_cases:

0 commit comments

Comments
 (0)