Commit 6e7b726
Three Critical bugs and four Important items from the four-agent
review at 2026-05-22T20:27Z. Six negotiable items (I4, I5, I7, I8,
I9, I10) deferred to follow-up per @aallan's "could be follow-ups"
framing.
### C1 — _strip_ailang_main brace-counter bug (priority blocker)
Old code: `if "{" in line and "}" in line:` fired on the canonical
AILANG main signature `export func main() -> () ! {IO} {` because
`{IO}` provides balanced braces; the function then treated it as a
single-line block and only skipped the def line, leaving the body
as orphan code. Three review agents converged on this. My own
xfail(strict=True) test was documenting the bug.
New code: drop brace counting entirely. After matching the main
def, swallow body lines using indentation + structural rules:
- blank lines are part of the body
- lines strictly more indented than the def line are the body
- a bare `}` (block-close, possibly with trailing `-- comment`)
ends the swallow loop
- any other line at def-indent ends the swallow loop (preserves
comments attached to the next definition)
Removed the xfail; replaced with two positive tests (block form +
equals form, both with `! {IO}` annotation) plus a
preserves-comment-attached-to-next-def edge case test.
12 strip tests pass.
### C2 — AILANG fix-retry dispatch was dead code
`build_ailang_fix_prompt` was imported, tested, and exported, but
the `language == "ailang"` branch in `run_single_problem`'s retry
path was missing — so `--max-fix-attempts > 0` was silently no-op
for AILANG, undercounting it vs Aver/Vera by the entire attempt-2
contribution.
Added the branch mirroring the Aver retry path. Extended
`_is_tooling_error` to also match `"ailang not found"`. Added
`TestRunSingleProblemAilang` (I6) with 4 cases pinning the
dispatch + retry behavior:
- ailang_language_dispatches_to_evaluate
- ailang_no_retry_on_tooling_error (FileNotFoundError, max_attempts=2)
- ailang_retry_on_check_failure (verifies client.complete called 2x
with the fix prompt containing the original error)
- ailang_no_retry_when_max_fix_attempts_zero
### C3 — Runtime errors lose all diagnostic info
The per-test-case loop in `_evaluate_ailang_code` silently
`continue`d on both TimeoutExpired and non-zero returncode. When
ALL tests failed at runtime, the row was `check_pass=True,
run_correct=False, tests_passed=0, error_message=None` —
indistinguishable from "compiled but outputs were wrong".
Now capture the first non-zero stderr (or stdout fallback, or
explicit "exit N (no output)" marker) into `last_run_error` and
attach to `error_message` IF no upstream check error already set
it. Truncates to 400 chars to keep JSONL rows readable. Issue
#72's full shared-helper refactor will land separately.
### I1 — Subprocess argv/env contract tests
Without test pinning, a regression dropping `--quiet` would cause
AILANG's standard tracing to escape onto stdout → silent miscount
in the line-counting parser. A regression dropping `*_API_KEY`
scrubbing could leak credentials into the AILANG subprocess.
Added `test_check_subprocess_contract` + `test_run_subprocess_contract`
in TestEvaluateAilangCode. Each sets a real `ANTHROPIC_API_KEY` /
`OPENAI_API_KEY` in env, runs the function, then asserts:
- argv contains the required flags (`--quiet`, `--caps IO`,
`--entry main`, `--relax-modules`)
- env contains `AILANG_TRACE=off`
- env does NOT contain `*_API_KEY` (the scrubbing happened)
### I2 — Regex tag classification for compile vs runtime
Old: `any(tag in err for tag in ("Error PAR", "Error TC", "Error MOD"))`
— substring match. A future AILANG release adding `Error PARSER_`
would silently match `Error PAR` and reclassify; `Error RT_` would
silently classify as runtime; a tag rename flips classifications
across the suite.
New: `re.search(r"\bError ([A-Z]+)_", err)` with a `\b` word boundary
plus an explicit `compile_tags = ("PAR", "TC", "MOD", "ELB", "LINK",
"TY")` allow-list. New AILANG categories default to runtime (the
safer classification) and the allow-list documents what we know.
### I3 — OpenRouter error handling
Pre-fix, only `APITimeoutError` was caught; everything else
propagated raw → multi-line openai-repr blobs landed in JSONL
rows, blamed on the model.
Now explicitly handle:
- AuthenticationError → EnvironmentError (abort: retrying 60
problems with a bad key is waste)
- RateLimitError → RuntimeError with clear "slow the sweep" message
- BadRequestError → RuntimeError with "model id wrong or context
exceeded" hint
- APIStatusError → RuntimeError catch-all for 5xx, with status code
- Empty `choices` array → RuntimeError (was returning text="",
blamed on model as "did not define entry point")
- Empty content (content-filter, tool-call-only) → RuntimeError
with finish_reason in message
Two existing tests refactored, three new tests added:
- empty_choices_raises, empty_content_raises (was 1 graceful test)
- authentication_error_aborts, rate_limit_error
23 model tests pass.
### Local verification
- All 12 strip tests pass (including the previously-xfailed `{IO}`)
- All 14 evaluate tests pass (including 2 new I1 contract tests)
- All 4 new TestRunSingleProblemAilang tests pass
- All 23 model tests pass (5 new OpenRouter)
- All 13 AILANG baseline tests pass
- TOTAL: 550 passed, 27 skipped, 3 vera-binary-dependent failures
(CI has vera; will pass there)
- Coverage: 80.00% (was 79.49%)
- ruff check / format --check / S: all clean
### Deferred to follow-up
Per @aallan's "could be follow-ups" framing on I4-I10:
- I4 (module-synthesis position validation), I5 (_ailang_literal
None/dict/tuple), I7 (missing-main substring guard tag),
I8 (stdout/test-case line-count mismatch detection),
I9 (--relax-modules comment), I10 (numeric rationale comments)
Will land in a small follow-up PR. None of these are gating.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d6769c4 commit 6e7b726
5 files changed
Lines changed: 571 additions & 69 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
263 | 263 | | |
264 | 264 | | |
265 | 265 | | |
266 | | - | |
267 | | - | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
268 | 271 | | |
269 | 272 | | |
270 | 273 | | |
| |||
276 | 279 | | |
277 | 280 | | |
278 | 281 | | |
279 | | - | |
280 | | - | |
| 282 | + | |
| 283 | + | |
281 | 284 | | |
282 | 285 | | |
283 | 286 | | |
284 | 287 | | |
285 | 288 | | |
286 | 289 | | |
287 | 290 | | |
288 | | - | |
289 | | - | |
290 | | - | |
291 | | - | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
292 | 322 | | |
293 | 323 | | |
294 | 324 | | |
| |||
311 | 341 | | |
312 | 342 | | |
313 | 343 | | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
0 commit comments