Add --parallel N flag to vera-bench run#73
Conversation
Run N problems concurrently via ThreadPoolExecutor. Each worker is I/O-bound on its LLM HTTP call + subprocess-based check/run, so the GIL is not a bottleneck. Use case: slow models like Kimi K2.5 averaged 49s/problem sequentially across the 60-problem AILANG sweep (~50 min total). With --parallel 10 the same sweep should drop to ~5 min, which makes release-time re-evals practical. Implementation: - ThreadPoolExecutor with max_workers=parallel - Per-problem futures collected via as_completed - threading.Lock around the JSONL append so concurrent writes don't interleave. Lines are still self-contained (carry problem_id) so completion-order writes are fine. - Workers share the same work_dir; per-problem temp files are uniquified by problem_id (existing behavior). - Exception per worker is caught and logged; the sweep continues. Default parallel=1 preserves the existing sequential path with no behavior change. Smoke-tested with claude-haiku-4-5 --tier 1 --parallel 4: 10/10 problems, no duplicates, 100%/100% run_correct. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds TestRunBenchmarkParallel covering the ThreadPoolExecutor path: - test_parallel_one_uses_sequential_path: parallel=1 (default) does NOT touch ThreadPoolExecutor at all (patched to raise on use) - test_parallel_two_runs_all_problems: every problem completes, every result is collected (order may differ — completion order) - test_parallel_worker_exception_continues: one worker raising doesn't abort the sweep; sibling problems still complete - test_parallel_writes_are_serialised: 20 problems × 8 workers, every JSONL line is parseable JSON (no torn writes from the write_lock failing to serialise) - test_parallel_no_output_path_still_collects_results: skipping the write block is a valid code path - test_run_command_accepts_parallel_flag: Click accepts --parallel N - test_run_command_parallel_default_is_one: help text confirms default All 7 pass; local ruff check / format --check / S all clean. Coverage on vera_bench/runner.py: 83% locally (CI lifts further when vera-binary-dependent paths are reachable). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 52 minutes and 42 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds configurable parallel execution to the ChangesParallel problem execution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70. (Item 1 — extracting --parallel N into its own PR — addressed via PR aallan#73.) ### Item 2: README headline section -> single sentence in §Overview Removed the "AILANG: AI-designed language..." headline section (13 lines: the heading, the description paragraphs, the per-mode results table, the "full-circle finding" paragraph). The phrasing included editorial claims about VeraBench's identity that should be a project-owner call, and "added in this fork" wouldn't read correctly post-merge. Replaced the §Overview line about baselines with the form @aallan suggested verbatim: The same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/) as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis. Matches the existing Aver pattern: light-touch mention without results writeups in the README. ### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md Neither file is load-bearing — no code or tests reference them. Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or AVER_MAPPING.md. Numbers and writeups go in PR descriptions and external content; in-repo docs are reserved for things future maintainers need. ### Item 4: .coderabbit.yaml path_filters Added the two missing AILANG entries to mirror the existing {python, typescript, aver} pattern: - "!**/*.ail" (alongside !**/*.vera, !**/*.av) - "!solutions/ailang/**" (alongside the other solutions/* entries) This stops CodeRabbit from generating speculative findings on .ail solution files in future review passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@coderabbitai review (Re-triggering — the initial review at 13:32Z was rate-limited.) |
🧩 Analysis chain🏁 Script executed: Length of output: 15126 🏁 Script executed: Length of output: 18516 🏁 Script executed: Length of output: 26437
Review🟡 Conceptual:
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vera_bench/cli.py (1)
90-101:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd an explicit
-> Nonereturn type torun
vera_bench/cli.py: run(...)currently has no return annotation even though it only performs side effects (prints/writes results) and doesn’t return a value. Update the signature to:def run( model: str, tier: int | None, problem: str | None, language: str, mode: str, skill_md: Path | None, output_dir: Path | None, max_tokens: int, keep_temps: bool, parallel: int, ) -> None:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vera_bench/cli.py` around lines 90 - 101, The run function lacks an explicit return annotation; update the function signature for run(...) to declare it returns None by adding -> None to the definition so its signature becomes def run(... ) -> None:, ensuring the type hint reflects that it only performs side effects.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_runner.py`:
- Around line 1711-1713: The test in tests/test_runner.py is brittle because it
checks for the substring "invalid" in result.output; replace that substring
check with Click usage-error semantics by asserting on result.exit_code (use
result.exit_code == 2 for a parse/usage failure, or result.exit_code != 2 for a
successful parse) and remove the "invalid" substring assertion that can
false-fail; locate the assertion lines that reference result.output and the
variable result to make this change.
In `@vera_bench/cli.py`:
- Around line 79-83: Update the Click option for "--parallel" so values <1 are
rejected at parse time (replace type=int with click.IntRange(min=1) or add an
explicit callback validator) to prevent 0/negative values from being passed to
run_benchmark; also add the missing return type annotation "-> None" on the CLI
entry function run(...) to follow the repo's type-hinting guideline and clarify
it returns nothing. Ensure the call site that passes parallel to run_benchmark
remains unchanged (run_benchmark(parallel=parallel)).
In `@vera_bench/runner.py`:
- Around line 915-919: The redundant threading.Lock instance named write_lock
should be removed from the parallel path: delete the write_lock creation and any
acquisitions/releases around JSONL writes (references: write_lock), since JSONL
writes are already serialized by the main-thread as_completed loop (references:
as_completed, ThreadPoolExecutor) and the lock provides no additional
protection; also remove the now-unused threading import if it's only present for
this lock so the imports remain correct.
---
Outside diff comments:
In `@vera_bench/cli.py`:
- Around line 90-101: The run function lacks an explicit return annotation;
update the function signature for run(...) to declare it returns None by adding
-> None to the definition so its signature becomes def run(... ) -> None:,
ensuring the type hint reflects that it only performs side effects.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 57028d49-ac03-482c-9980-64392affeb86
📒 Files selected for processing (3)
tests/test_runner.pyvera_bench/cli.pyvera_bench/runner.py
…test brittleness) ### cli.py:83 — IntRange(min=1) for --parallel `type=int` silently accepted 0 and negative values; `run_benchmark` then treated `parallel <= 1` as sequential, masking the bug. Switched to `click.IntRange(min=1)` so 0/negative fail at parse time with Click's standard usage error (exit_code=2). Skipped CR's suggested `-> None` annotation on `def run(...)` — per @aallan's prior comment on PR aallan#70 (commit aa13f25's description), "missing `-> None` applies to ALL Click handlers in cli.py — pre- existing project-wide consistency issue, not specific to this PR". Annotating just `run` would break that consistency; out of scope. ### runner.py:919 — remove redundant write_lock CR correctly observed that JSONL writes are already serialised by the main-thread `for fut in as_completed(...)` loop. Workers only run `_run_one` (LLM/subprocess work) and never touch `output_path`, so `threading.Lock()` was protection without need. Removed the lock, the `import threading`, and the lock acquisition. Added a comment explaining where serialisation actually comes from so a future reader doesn't re-add the lock thinking it was load-bearing. Updated the docstring on `test_parallel_writes_are_serialised` to credit the loop structure (not the lock) as the serialisation mechanism — the property holds whether the lock is there or not, because workers never write. ### tests/test_runner.py:1713 — exit_code != 2 over substring check Replaced the brittle `"invalid" not in result.output` substring check with `result.exit_code != 2`. Click's parse/usage errors return exit_code 2 cleanly; substring checks could false-fail on unrelated runtime output (e.g. an API-key error message containing the word "invalid"). Also added two new tests pinning the new IntRange behaviour: - test_run_command_rejects_zero_parallel: --parallel 0 -> exit 2 - test_run_command_rejects_negative_parallel: --parallel -5 -> exit 2 All 9 TestRunBenchmarkParallel cases pass locally; ruff check / format --check / S all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #73 +/- ##
==========================================
+ Coverage 83.65% 84.49% +0.84%
==========================================
Files 10 10
Lines 1395 1432 +37
==========================================
+ Hits 1167 1210 +43
+ Misses 228 222 -6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
aallan
left a comment
There was a problem hiding this comment.
Detailed review — --parallel N
Ran a four-agent parallel review (code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer). Findings are deduped; convergence noted where multiple agents independently agreed.
TL;DR
No critical merge-blockers — threading model is sound, JSONL serialisation is correct, CI clean across all 9 checks, all three CodeRabbit inline findings substantively addressed in bfbfae2. But the four agents converged on five Important issues worth a second iteration before merge, plus five negotiable suggestions. The standout is I1 (worker crashes vanish from JSONL) — flagged by silent-failure-hunter and arguably the closest thing to a blocker, since it makes parallel sweeps silently under-report scope to downstream vera-bench report.
Important Issues (5)
I1 — Worker crashes vanish from JSONL ★ priority
vera_bench/runner.py:945-950. When fut.result() raises, the handler logs [red]Worker failed on {pid}: {exc}[/red] to stdout and continues — no ProblemResult is written to output_path. Downstream consequences:
- A 60-problem sweep with 2 worker crashes produces 58 JSONL lines.
vera-bench reportthen computes pass/fail metrics over those 58 lines and reports "58/58 succeeded (100%)" — the 2 crashes are invisible to anyone reading results files later.- The sequential path doesn't have this defect because crashes abort the whole sweep loudly with a traceback.
For a benchmark tool, silently changing the denominator on operator error is arguably worse than diagnostic loss. Fix:
except Exception as exc: # noqa: BLE001
pid = futures[fut].get("id", "?")
tb = traceback.format_exc()
console.print(f"[red]Worker failed on {pid}: {exc!r}[/red]")
console.print(f"[dim]{tb}[/dim]")
crash_result = ProblemResult(
problem_id=pid,
model=getattr(client, "model", "unknown"),
language=language,
attempt=0,
check_pass=False,
run_correct=False,
error_message=f"Worker crash: {exc!r}\n{tb}",
timestamp=datetime.now(timezone.utc).isoformat(),
# ... other required fields
)
all_results.append(crash_result)
if output_path:
with open(output_path, "a", encoding="utf-8") as f:
f.write(crash_result.to_jsonl() + "\n")
progress.advance(task)
continueThis single change addresses I1 and I3 (traceback) at once.
I2 — Sequential/parallel error-handling asymmetry
Two agents converged. runner.py:885-906 (sequential, no try/except) vs runner.py:945-952 (parallel, swallows Exception). vera-bench run --parallel 1 and vera-bench run --parallel 2 have different fault semantics on the same input — a transient NoneType from a bad model response kills the whole sweep at --parallel 1 but is logged-and-continued at --parallel 2.
Either align them (recommended — wrap the sequential body in the same try/except so a 4-hour sweep survives problem 47 of 60 regardless of parallelism), or document the asymmetry in the run_benchmark docstring alongside the existing "JSONL output ordering" note.
I3 — Lost traceback in worker exception handler
Same diagnostic-loss pattern as C3 on PR #70. runner.py:947 reports str(exc) only. On a 1-hour parallel sweep with hundreds of problems, debugging Worker failed on VB-T3-007: 'NoneType' object has no attribute 'foo' with no file/line is painful.
Subsumed by the I1 fix above (capture traceback.format_exc() into the JSONL row's error_message).
I4 — progress.advance(task) on exception path is untested
tests/test_runner.py:1591-1620 (test_parallel_worker_exception_continues) verifies the loop continues (3 of 4 results return), but does not pin that the progress bar advanced on the crashed problem. The current code does call advance before continue at runner.py:951, but a refactor moving it into an else: branch would silently regress the bar to hang at N-1/N with no test failure.
Fix: inject/patch progress (or pass a fake) and assert advance.call_count == len(problems) even with a worker exception.
I5 — Version propagation through _run_one closure is untested
runner.py:922-936. bench_version and vera_version are captured by closure into _run_one. If a future refactor drops them from the kwargs forwarded to run_single_problem, JSONL would silently get empty version strings. The existing mock at test_runner.py:1546 uses lambda problem, **kw: ... which swallows the kwargs without inspection.
Fix: one test that calls with bench_version="0.1.0", vera_version="0.0.103" and asserts the mock saw them.
Suggestions (5, negotiable)
- S1 — Output-write block has no error handling.
runner.py:949— disk full or NFS hiccup mid-sweep raisesOSErroron the main thread, killing the whole sweep. Pre-existing in sequential too, but parallel makes the blast radius bigger (more in-flight uncommitted results). Consider try/except around the write with a fallback log. - S2 — Patch-target fragility.
tests/test_runner.py:1538-1541patchesconcurrent.futures.ThreadPoolExecutor. Works because the import is lazy in the parallel branch, but would silently fail if someone moves the import to module scope. Robustness alternative: assertthreading.current_thread() is threading.main_thread()inside_run_one— asserts behaviour (no thread spawn) rather than implementation (no class lookup). - S3 — Anecdotal Kimi K2.5 performance figures.
runner.py:877-879andcli.py:85cite "~50s → ~5s" with no committed benchmark, and the model name will age out. Either commit a measurement or soften to generic "slow models". - S4 — Misleading POSIX-atomicity claim in test docstring.
tests/test_runner.py:1623's "Python's GIL doesn't make file writes atomic — partial writes are observable" is wrong for the actual case here: short writes (<PIPE_BUF ~4096B) withO_APPENDare atomic on POSIX, and the production code serialises writes on the main thread anyway. The test still verifies useful properties; the docstring oversells. - S5 — 20×8 stress level is overkill post-lock-removal. With writes now serialised on main thread, the test only catches the regression of moving writes back into workers — a single race at much lower scale would catch it. Lower N if test suite gets slow.
Strengths
Substantial — calling out specifically because four agents converged on these:
runner.py:919-922"no write lock needed" comment — load-bearing and exemplary (called out by three agents). The kind of comment that survives refactors and actively prevents a future contributor from re-introducing a redundant lock thinking it was protective. Best comment in the PR.cli.py:79-89click.IntRange(min=1)withshow_default=True— closes the silent-fall-through-to-sequential bug class that plaintype=intwould have allowed. Right pin at the parse boundary.tests/test_runner.py:1518-1564test_parallel_one_uses_sequential_path— clever negative-space test. PatchingThreadPoolExecutorwithside_effect=AssertionError("must not be used in sequential path")proves the absence of an import path, which is genuinely harder than asserting presence.runner.py:946# noqa: BLE001correctly catchesExceptionnotBaseException— Ctrl-C /SystemExitstill propagate and cancel the sweep, which is exactly what you want. Two agents called this out as the right narrow choice.- Set-equality assertions at
tests/test_runner.py:1581,1611correctly accommodate completion-order nondeterminism. Avoids the trap of list-equality assertions that would be flaky under parallel scheduling. - CLI tests assert
exit_code != 2rather than message-string matching — resilient to Click phrasing changes, the right semantic boundary for parse-time validation. - Inline closure
_run_oneatrunner.py:922-936is the right abstraction level — captures 9 invariants locally without polluting module namespace. A module-level_dispatch_onewould add boilerplate for no real win. run_benchmarkdocstring atrunner.py:872-883honestly documents completion-order JSONL ordering with mitigation (problem_idis self-contained, consumers can sort). Setting expectations correctly upfront.- Review-iteration discipline — applied 3/3 CodeRabbit inline findings verbatim with extra coverage (the two new
IntRangepin tests went beyond what CR asked for), and declined the-> Noneannotation with documented rationale rather than silently ignoring. Exactly the response pattern that makes review productive.
Recommended action
This is close to merge-ready. I'd ask for one more iteration on I1-I5:
- I1 + I3 in one change — synthesise a
ProblemResulton worker crash withtraceback.format_exc()inerror_message. The patch sketch above is ~10 lines and lands both fixes. - I2 — wrap the sequential body in the same try/except so semantics match, OR document the asymmetry. Either is defensible.
- I4 + I5 — small test additions in the existing
TestRunBenchmarkParallelclass. - S1-S5 are negotiable — your call. None are blocking.
After those land I'd approve and merge. Thanks for the careful work on this — the CR-iteration discipline and the load-bearing concurrency comment are both first-rate.
— Reviewed with a four-agent parallel pass (general code review, test coverage, silent-failure hunting, comment accuracy). Convergence noted above where multiple agents flagged the same issue.
Five Important issues and two of five suggestions from the 2026-05-22T20:44 CHANGES_REQUESTED review. ### I1 + I3 — Worker crashes vanish from JSONL (priority blocker) Before this change, a worker exception was logged to stdout and the loop `continue`d — no `ProblemResult` was written. A 60-problem sweep with 2 crashes produced 58 JSONL rows; downstream `vera-bench report` then showed "58/58 (100%)", silently shrinking the denominator. New `_crash_result(problem, exc, tb)` helper synthesises a `ProblemResult` with `check_pass=False`, `run_correct=False`, and the full `traceback.format_exc()` embedded in `error_message`. Wired into both sequential and parallel paths via the new `_record` helper so successes and crashes hit the same persistence machinery. ### I2 — Sequential / parallel error-handling asymmetry Pre-fix: `--parallel 1` aborted on any worker exception, `--parallel 2+` logged-and-continued. A transient bad model response would kill a 4-hour sweep on the sequential path but not the parallel one. Now both paths wrap `run_single_problem` in the same `try/except` and route crashes through `_crash_result` + `_record`. Same fault semantics regardless of N. ### I4 — `progress.advance(task)` on exception path is now tested `test_progress_advances_on_crash_path` patches `Progress` and asserts `advance.call_count == len(problems)` even when one problem raises, in both the sequential and parallel paths. A refactor that moved `advance` into an `else:` branch would now fail this test cleanly. ### I5 — Version propagation through `_run_one` closure is now tested `test_bench_and_vera_version_propagate_to_workers` captures the kwargs `run_single_problem` actually receives under `parallel=3` and asserts both `bench_version` and `vera_version` came through. Catches a future refactor that drops them from the kwargs forwarded through the closure. ### S2 — Replace ThreadPoolExecutor patch with thread-identity test `test_parallel_one_uses_sequential_path` now asserts behavior (every call ran on `threading.main_thread()`) instead of patching `concurrent.futures.ThreadPoolExecutor`. The test is robust to a future refactor hoisting the import to module scope. Added a counterpoint test (`test_parallel_two_actually_spawns_worker_threads`) that confirms `parallel>1` does spawn workers. ### S4 — Fix incorrect POSIX-atomicity claim in test docstring The old docstring on `test_parallel_writes_are_serialised` said "Python's GIL doesn't make file writes atomic — partial writes are observable", which was wrong: short writes (< PIPE_BUF ~4096B) with O_APPEND ARE atomic on POSIX. Replaced with an honest explanation that the test proves serialisation comes from the main-thread `as_completed` loop (not the lock that no longer exists, and not POSIX guarantees we don't depend on). ### Updated existing test for new behavior `test_parallel_worker_exception_continues` previously asserted `len(results) == 3` (the crashed problem vanished). Now asserts `len(results) == 4` (success rows + crash row) and verifies the crash row carries `Worker crash:`, the original exception's repr, and a traceback in `error_message`. Added a parallel test for the sequential path's crash semantics. ### Deferred (negotiable suggestions) - **S1** (no error handling on output write): file-write failures on the main thread still abort the sweep. Deferred — pre-existing on the sequential path too, and a sensible operator response (resume from JSONL) doesn't exist yet. - **S3** (Kimi K2.5 anecdotal figures): kept as-is; they're motivating context, not a load-bearing claim. - **S5** (20×8 stress overkill): kept — test runtime is sub-second and the larger scale catches more refactor failures. All 13 TestRunBenchmarkParallel cases pass; ruff check / format --check / S all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@aallan — addressed in fa8d1ce. Going item-by-item against your four-agent review: Important (all in)
Suggestions
Updated existing test for new behavior
13 Thanks again for the substantive review — the I1 finding (silent denominator change) was a real one I would not have caught, and the four-agent convergence on the strengths section was good signal too. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_runner.py`:
- Around line 1558-1560: The local helper _record_thread should be annotated
with precise Python 3.11+ type hints: add parameter and return types (e.g.,
typing.Callable-compatible signature) so the function signature explicitly types
`problem: dict` (or a more specific TypedDict if available) and `**kw: Any`, and
the return type as list of the test result type (e.g., list[ResultType] or
list[dict] if ResultType isn't defined); update the other similar helpers at the
indicated locations (around lines 1599-1601, 1660-1663, 1708-1711, 1812-1815,
1868-1870) with matching annotations, importing Any and other typing names as
needed to satisfy repository typing rules.
- Line 1688: Replace the substring-based selection for crash_row with JSON-first
selection: parse all log lines (e.g., records = [json.loads(ln) for ln in
lines]) and find the record whose "problem_id" equals the project's worker-crash
problem id (use the actual constant/name used elsewhere), assign that record to
crash_row, then assert the message content separately (e.g., assert "Worker
crash" in crash_row["message"]); apply the same change to the other occurrence
around the second instance. Ensure you reference the existing crash_row and
lines variables when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 65240d5f-05bc-4818-988e-ab91d0fceae3
📒 Files selected for processing (3)
tests/test_runner.pyvera_bench/cli.pyvera_bench/runner.py
…ral row select)
Two new CodeRabbit findings posted 2026-05-23T04:07Z after the
I1-I5 commit:
### tests/test_runner.py — Type hints on 6 inner helpers
Test-side closures (`_record_thread` ×2, `_side_effect` ×3, `_capture`)
were untyped. Per the project's "Python 3.11+, type hints everywhere"
rule, annotated all six with:
def _xyz(
problem: dict[str, object], **kw: object
) -> list[ProblemResult]
`ProblemResult` was already imported at module scope.
### tests/test_runner.py — Crash row selection by problem_id, not substring
Replaced the brittle filter:
crash_row = next(json.loads(ln) for ln in lines if "Worker crash" in ln)
with a structural selector:
rows = [json.loads(ln) for ln in lines]
crash_row = next(row for row in rows if row["problem_id"] == "VB-X-2")
The message-content assertions ("simulated worker crash", "RuntimeError",
"Traceback") remain — they're now testing the message-content contract
explicitly rather than relying on it implicitly through the selector.
Applied to both `test_parallel_worker_exception_continues` and
`test_sequential_worker_exception_also_continues`.
All 13 TestRunBenchmarkParallel cases pass; lint clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aallan
left a comment
There was a problem hiding this comment.
Approving — everything in scope is delivered
All five Important issues from the 2026-05-22T20:44 review are resolved, two of five suggestions taken with documented rationale on the rest, and both follow-up CodeRabbit findings on the response commit have already been addressed and confirmed by CR. CI is green across all 9 checks on 7a161f84.
Verification
| Ask | Delivered | Location |
|---|---|---|
| I1 + I3 — worker crashes vanish from JSONL + lost traceback | Unified _crash_result() synthesises a ProblemResult with traceback.format_exc() in error_message; new _record() helper unifies in-memory + JSONL persistence between success and crash paths |
runner.py:894-924 |
| I2 — sequential / parallel error-handling asymmetry | Both paths now wrap run_single_problem in identical try/except and route crashes through _crash_result + _record. Same fault semantics regardless of --parallel N |
runner.py:927-1001 |
I4 — progress.advance(task) untested on exception path |
New test_progress_advances_on_crash_path patches Progress, runs both parallel=2 and parallel=1 back-to-back, asserts advance.call_count == 4 (3 successes + 1 crash) on each path |
tests/test_runner.py:1814-1868 |
| I5 — version propagation through closure untested | New test_bench_and_vera_version_propagate_to_workers captures the kwargs run_single_problem actually receives via a _capture closure under parallel=3 and asserts both versions arrive on every call |
tests/test_runner.py:1870-1908 |
| S2 — patch-target fragility | Replaced ThreadPoolExecutor patch with threading.current_thread() is threading.main_thread() assertion (behaviour over implementation); added counterpoint test_parallel_two_actually_spawns_worker_threads |
tests/test_runner.py:1543-1625 |
| S4 — misleading POSIX-atomicity docstring | Rewritten to honest "main-thread as_completed serialisation" explanation |
tests/test_runner.py:1749-1791 |
Deferred suggestions (S1, S3, S5) all have documented rationale in the commit message — file-write error handling is a pre-existing cross-cutting concern, the Kimi K2.5 anecdote is motivating context, and the 20×8 stress test runs sub-second so the larger scale is free.
Worth calling out
The response on I1 + I3 went structurally tighter than my review prompted. The minimal fix would have been "add traceback.format_exc() to the existing log line." Instead you refactored both paths through _crash_result() + _record() so successes and crashes hit the same persistence machinery — that abstraction makes "did this crash get recorded?" a property of the helper rather than a branching obligation at every call site. Future contributors who add a new branch (e.g., a timeout path) inherit the persistence semantics for free.
The I4 test design is also tighter than the brief: I asked for an assertion on the parallel path; you patched Progress, exercised both parallel=2 and parallel=1 back-to-back with a reset between, and pinned advance.call_count == 4 on both. So I2's symmetric-fault-semantics design gets defended by I4's test on both paths in one go.
The S2 thread-identity replacement is the right kind of trade — implementation tests (patching imports) become brittle the moment someone moves an import; behaviour tests (assert this code runs on the main thread) survive that refactor cleanly. The counterpoint test ensures the inversion still holds.
And the two follow-up CodeRabbit findings on the response commit (type hints on six inner helpers, structural row selection by problem_id instead of message substring) were addressed in 7a161f84 within seven minutes of CR posting them, with CR confirming both. That iteration discipline keeps review productive.
Approved
CI green, every priority ask delivered, deferred items explicitly tracked. Ready to merge.
Positional conflict only: both aallan#73 (TestRunBenchmarkParallel) and aallan#70 (TestAilangLiteral / TestStripAilangMain / TestEvaluateAilangCode / TestLoadAilangPrompt / TestAilangPrompt / TestAilangCLI) appended new test classes at the end of tests/test_runner.py. Resolved by keeping both groups in order: TestRunBenchmarkIntegration -> TestRunBenchmarkParallel (from aallan#73) -> AILANG test classes (from aallan#70). No logical conflict between the PRs. PR aallan#73 modified run_benchmark (with new _crash_result / _record helpers at lines ~1242-1280); PR aallan#70 modified the AILANG evaluator paths (lines ~554-831) and added the AILANG dispatch branch in run_single_problem (lines ~975, 1017, 1107). The runner.py three-way merge resolved cleanly because the regions are disjoint; only the test file needed manual stitching. Verification: - ruff check . / ruff format --check . both clean - AST parse OK on merged test file - All three target classes present exactly once (no duplicates) - Final structure: TestRunBenchmarkIntegration -> TestRunBenchmarkParallel -> AILANG classes, separated by header comments Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Version bump ============ - pyproject.toml: 0.0.11 -> 0.0.12 - vera_bench/__init__.py fallback: 0.1.0 -> 0.0.12 (the fallback only fires when the package isn't installed via metadata; the canonical source is still pyproject.toml + importlib.metadata) - vera_bench/prompts.py _USER_AGENT: "vera-bench/0.0.9" -> "vera-bench/0.0.12" (was stuck at 0.0.9 since that release) Documentation consistency ========================= CHANGELOG.md - New [0.0.12] section covering the AILANG + --parallel work from #70 and #73, plus the worker-crash JSONL fix, the tag-classification regex, and the sequential/parallel symmetry fix - Compatibility note: 0.0.12 is purely additive for Vera, Python, TypeScript, and Aver scoring CLAUDE.md - Project description now mentions AILANG alongside Aver - solutions/ directory list updated to include ailang - New AILANG subsection documenting CLI flag conventions (--quiet/--caps IO/--entry main/--relax-modules, AILANG_TRACE=off, *_API_KEY scrubbing) - New "Adding more comparison languages" subsection noting OpenRouter / MOONSHOT / OPENROUTER env var support - Commands list adds --language ailang for both `run` and `baselines`, plus --parallel N with explanatory paragraph ROADMAP.md - "Where we are" prepended with v0.0.12 summary - Milestone 1 checks off AILANG language support and --parallel N README.md - Quick start adds --parallel N example - Supported providers list adds OpenRouter and OPENROUTER_API_KEY KNOWN_ISSUES.md - Chart-pin section dropped stale "v0.0.9" references in favor of generic "current-version" phrasing — the warning is the same shape regardless of which version is current - Removal trigger updated to reflect that the trigger is "when README is rewritten against current data", not a specific version scripts/README.md - Same chart-pin staleness fix as KNOWN_ISSUES.md Out of scope ============ `scripts/run_full_benchmark.py` was not updated to include AILANG targets — PR #70 added the language support but missed the sweep script. That's a real gap but it's a code change, not a docs change. Spawned a follow-up task to extend the sweep script to 10 targets (LLM + baseline for AILANG) plus the matching scripts/README.md updates. The fixture values "0.0.11" / "0.0.108" in tests/test_runner.py (I5 propagation test) are arbitrary strings used to verify kwargs forwarding through the parallel-path closure — they're not assertions about the current package version. Left as-is. Verification ============ - ruff check . / ruff format --check . both clean - 229 tests pass under pytest (1 known-flaky Rich console-width test unrelated to these changes; CI runners use wider console width) - importlib.metadata.version("vera-bench") still resolves correctly (the fallback at __init__.py is only hit when the package metadata isn't installed, e.g., a raw git checkout without `pip install -e .`) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds a
--parallel Nflag tovera-bench runthat dispatches problems to aThreadPoolExecutorwithNworkers. Each worker is I/O-bound on its LLM call + subprocesscheck/run, so the GIL is not a bottleneck.Use case: slow models like Kimi K2.5 averaged ~50s/problem sequentially across the 60-problem sweep (~50 min total). With
--parallel 10that should drop to ~5 min, which makes release-time re-evals practical.Default
parallel=1preserves the existing sequential path with no behavior change.Scope note
This was originally bundled into PR #70 (the AILANG baseline). It snuck in during release-time benchmarking and was always intended to be a separate PR — concurrent sweep is independent of AILANG support and deserves its own review. Reverted out of #70 (revert commit) and surfacing here.
Implementation
ThreadPoolExecutorwithmax_workers=parallel, futures collected viaas_completedthreading.Lockaround the JSONL append so concurrent writes don't interleave. Lines are still self-contained (carryproblem_id) so completion-order writes are fine for downstream consumerswork_dir; per-problem temp files are uniquified byproblem_id(existing behaviour, unchanged)Smoke-tested with
claude-haiku-4-5 --tier 1 --parallel 4: 10/10 problems, no duplicates, 100%/100% run_correct.Tests
TestRunBenchmarkParallelintests/test_runner.pyadds 7 cases covering:parallel=1does NOT useThreadPoolExecutor(patched to raise on use)parallel>1runs every problem and collects every resultwrite_lockserialises JSONL writes (20 problems × 8 workers, every line parseable JSON)output_path=Noneis a valid code path--parallel Nand defaults to 1All 7 pass locally;
ruff check ./ruff format --check ./ruff check --select S vera_bench/all clean. Local coverage onvera_bench/runner.py: 83% (CI lifts further when vera-binary-dependent paths run).Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
rungains a--paralleloption (default 1) to run problems concurrently; progress shows chosen parallelism, worker errors are recorded per problem without aborting the run, and metadata versions are propagated.output_path=Nonereturns in-memory results without writing files.Tests
--parallel, and progress updates for every problem.