Skip to content

Commit 6e7b726

Browse files
Address @aallan's deep-read review on PR #70 (C1-C3 + I1-I3 + I6)
Three Critical bugs and four Important items from the four-agent review at 2026-05-22T20:27Z. Six negotiable items (I4, I5, I7, I8, I9, I10) deferred to follow-up per @aallan's "could be follow-ups" framing. ### C1 — _strip_ailang_main brace-counter bug (priority blocker) Old code: `if "{" in line and "}" in line:` fired on the canonical AILANG main signature `export func main() -> () ! {IO} {` because `{IO}` provides balanced braces; the function then treated it as a single-line block and only skipped the def line, leaving the body as orphan code. Three review agents converged on this. My own xfail(strict=True) test was documenting the bug. New code: drop brace counting entirely. After matching the main def, swallow body lines using indentation + structural rules: - blank lines are part of the body - lines strictly more indented than the def line are the body - a bare `}` (block-close, possibly with trailing `-- comment`) ends the swallow loop - any other line at def-indent ends the swallow loop (preserves comments attached to the next definition) Removed the xfail; replaced with two positive tests (block form + equals form, both with `! {IO}` annotation) plus a preserves-comment-attached-to-next-def edge case test. 12 strip tests pass. ### C2 — AILANG fix-retry dispatch was dead code `build_ailang_fix_prompt` was imported, tested, and exported, but the `language == "ailang"` branch in `run_single_problem`'s retry path was missing — so `--max-fix-attempts > 0` was silently no-op for AILANG, undercounting it vs Aver/Vera by the entire attempt-2 contribution. Added the branch mirroring the Aver retry path. Extended `_is_tooling_error` to also match `"ailang not found"`. Added `TestRunSingleProblemAilang` (I6) with 4 cases pinning the dispatch + retry behavior: - ailang_language_dispatches_to_evaluate - ailang_no_retry_on_tooling_error (FileNotFoundError, max_attempts=2) - ailang_retry_on_check_failure (verifies client.complete called 2x with the fix prompt containing the original error) - ailang_no_retry_when_max_fix_attempts_zero ### C3 — Runtime errors lose all diagnostic info The per-test-case loop in `_evaluate_ailang_code` silently `continue`d on both TimeoutExpired and non-zero returncode. When ALL tests failed at runtime, the row was `check_pass=True, run_correct=False, tests_passed=0, error_message=None` — indistinguishable from "compiled but outputs were wrong". Now capture the first non-zero stderr (or stdout fallback, or explicit "exit N (no output)" marker) into `last_run_error` and attach to `error_message` IF no upstream check error already set it. Truncates to 400 chars to keep JSONL rows readable. Issue #72's full shared-helper refactor will land separately. ### I1 — Subprocess argv/env contract tests Without test pinning, a regression dropping `--quiet` would cause AILANG's standard tracing to escape onto stdout → silent miscount in the line-counting parser. A regression dropping `*_API_KEY` scrubbing could leak credentials into the AILANG subprocess. Added `test_check_subprocess_contract` + `test_run_subprocess_contract` in TestEvaluateAilangCode. Each sets a real `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` in env, runs the function, then asserts: - argv contains the required flags (`--quiet`, `--caps IO`, `--entry main`, `--relax-modules`) - env contains `AILANG_TRACE=off` - env does NOT contain `*_API_KEY` (the scrubbing happened) ### I2 — Regex tag classification for compile vs runtime Old: `any(tag in err for tag in ("Error PAR", "Error TC", "Error MOD"))` — substring match. A future AILANG release adding `Error PARSER_` would silently match `Error PAR` and reclassify; `Error RT_` would silently classify as runtime; a tag rename flips classifications across the suite. New: `re.search(r"\bError ([A-Z]+)_", err)` with a `\b` word boundary plus an explicit `compile_tags = ("PAR", "TC", "MOD", "ELB", "LINK", "TY")` allow-list. New AILANG categories default to runtime (the safer classification) and the allow-list documents what we know. ### I3 — OpenRouter error handling Pre-fix, only `APITimeoutError` was caught; everything else propagated raw → multi-line openai-repr blobs landed in JSONL rows, blamed on the model. Now explicitly handle: - AuthenticationError → EnvironmentError (abort: retrying 60 problems with a bad key is waste) - RateLimitError → RuntimeError with clear "slow the sweep" message - BadRequestError → RuntimeError with "model id wrong or context exceeded" hint - APIStatusError → RuntimeError catch-all for 5xx, with status code - Empty `choices` array → RuntimeError (was returning text="", blamed on model as "did not define entry point") - Empty content (content-filter, tool-call-only) → RuntimeError with finish_reason in message Two existing tests refactored, three new tests added: - empty_choices_raises, empty_content_raises (was 1 graceful test) - authentication_error_aborts, rate_limit_error 23 model tests pass. ### Local verification - All 12 strip tests pass (including the previously-xfailed `{IO}`) - All 14 evaluate tests pass (including 2 new I1 contract tests) - All 4 new TestRunSingleProblemAilang tests pass - All 23 model tests pass (5 new OpenRouter) - All 13 AILANG baseline tests pass - TOTAL: 550 passed, 27 skipped, 3 vera-binary-dependent failures (CI has vera; will pass there) - Coverage: 80.00% (was 79.49%) - ruff check / format --check / S: all clean ### Deferred to follow-up Per @aallan's "could be follow-ups" framing on I4-I10: - I4 (module-synthesis position validation), I5 (_ailang_literal None/dict/tuple), I7 (missing-main substring guard tag), I8 (stdout/test-case line-count mismatch detection), I9 (--relax-modules comment), I10 (numeric rationale comments) Will land in a small follow-up PR. None of these are gating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d6769c4 commit 6e7b726

5 files changed

Lines changed: 571 additions & 69 deletions

File tree

tests/test_models.py

Lines changed: 89 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -263,8 +263,11 @@ def test_complete_mock(self, monkeypatch):
263263
assert called_kwargs["messages"][1]["role"] == "user"
264264
assert called_kwargs["messages"][1]["content"] == "user"
265265

266-
def test_complete_empty_response(self, monkeypatch):
267-
"""OpenRouter handles empty/missing content gracefully."""
266+
def test_complete_empty_choices_raises(self, monkeypatch):
267+
"""I3 (PR #70): empty `choices` is API-side failure, not the
268+
model's fault. Must raise rather than return text="" — silent
269+
empty-string return was getting blamed on the model as "did not
270+
define entry point" in JSONL output."""
268271
try:
269272
import openai # noqa: F401
270273

@@ -276,19 +279,46 @@ def test_complete_empty_response(self, monkeypatch):
276279
client = OpenRouterClient("or/test/model")
277280

278281
mock_resp = MagicMock()
279-
mock_resp.choices = [] # empty choices
280-
mock_resp.usage = None # missing usage
282+
mock_resp.choices = []
283+
mock_resp.usage = None
281284
mock_resp.model = "test/model"
282285

283286
mock_inner = MagicMock()
284287
mock_inner.chat.completions.create.return_value = mock_resp
285288
client._client = MagicMock()
286289
client._client.with_options.return_value = mock_inner
287290

288-
result = client.complete("sys", "user")
289-
assert result.text == ""
290-
assert result.input_tokens == 0
291-
assert result.output_tokens == 0
291+
with pytest.raises(RuntimeError, match="returned no choices"):
292+
client.complete("sys", "user")
293+
294+
def test_complete_empty_content_raises(self, monkeypatch):
295+
"""I3: choice exists but content is None (content-filter,
296+
tool-call-only, etc.) — surface as RuntimeError with finish_reason
297+
rather than silently text=""."""
298+
try:
299+
import openai # noqa: F401
300+
301+
from vera_bench.models import OpenRouterClient
302+
except ImportError:
303+
pytest.skip("openai not installed")
304+
305+
monkeypatch.setenv("OPENROUTER_API_KEY", "test-key")
306+
client = OpenRouterClient("or/test/model")
307+
308+
mock_choice = MagicMock()
309+
mock_choice.message.content = None
310+
mock_choice.finish_reason = "content_filter"
311+
mock_resp = MagicMock()
312+
mock_resp.choices = [mock_choice]
313+
mock_resp.model = "test/model"
314+
315+
mock_inner = MagicMock()
316+
mock_inner.chat.completions.create.return_value = mock_resp
317+
client._client = MagicMock()
318+
client._client.with_options.return_value = mock_inner
319+
320+
with pytest.raises(RuntimeError, match="empty content.*content_filter"):
321+
client.complete("sys", "user")
292322

293323
def test_complete_api_timeout(self, monkeypatch):
294324
"""OpenRouter timeout propagates as TimeoutError."""
@@ -311,3 +341,54 @@ def test_complete_api_timeout(self, monkeypatch):
311341

312342
with pytest.raises(TimeoutError, match="OpenRouter API timed out"):
313343
client.complete("sys", "user")
344+
345+
def test_complete_authentication_error_aborts(self, monkeypatch):
346+
"""I3: AuthenticationError must raise EnvironmentError so the
347+
caller aborts the run — retrying 60 problems with a bad key is
348+
pure token waste."""
349+
try:
350+
import openai
351+
352+
from vera_bench.models import OpenRouterClient
353+
except ImportError:
354+
pytest.skip("openai not installed")
355+
356+
monkeypatch.setenv("OPENROUTER_API_KEY", "test-key")
357+
client = OpenRouterClient("or/test/model")
358+
359+
mock_inner = MagicMock()
360+
mock_inner.chat.completions.create.side_effect = openai.AuthenticationError(
361+
message="Invalid API key",
362+
response=MagicMock(),
363+
body=None,
364+
)
365+
client._client = MagicMock()
366+
client._client.with_options.return_value = mock_inner
367+
368+
with pytest.raises(EnvironmentError, match="OpenRouter authentication"):
369+
client.complete("sys", "user")
370+
371+
def test_complete_rate_limit_error(self, monkeypatch):
372+
"""I3: RateLimitError surfaces with a clear "slow the sweep"
373+
message rather than the raw openai repr."""
374+
try:
375+
import openai
376+
377+
from vera_bench.models import OpenRouterClient
378+
except ImportError:
379+
pytest.skip("openai not installed")
380+
381+
monkeypatch.setenv("OPENROUTER_API_KEY", "test-key")
382+
client = OpenRouterClient("or/test/model")
383+
384+
mock_inner = MagicMock()
385+
mock_inner.chat.completions.create.side_effect = openai.RateLimitError(
386+
message="Rate limit hit",
387+
response=MagicMock(),
388+
body=None,
389+
)
390+
client._client = MagicMock()
391+
client._client.with_options.return_value = mock_inner
392+
393+
with pytest.raises(RuntimeError, match="rate-limited"):
394+
client.complete("sys", "user")

0 commit comments

Comments
 (0)