Skip to content

Add --parallel N flag to vera-bench run#73

Merged
aallan merged 5 commits into
aallan:mainfrom
sunholo-voight-kampff:feature/parallel-benchmark
May 25, 2026
Merged

Add --parallel N flag to vera-bench run#73
aallan merged 5 commits into
aallan:mainfrom
sunholo-voight-kampff:feature/parallel-benchmark

Conversation

@sunholo-voight-kampff

@sunholo-voight-kampff sunholo-voight-kampff commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a --parallel N flag to vera-bench run that dispatches problems to a ThreadPoolExecutor with N workers. Each worker is I/O-bound on its LLM call + subprocess check/run, so the GIL is not a bottleneck.

Use case: slow models like Kimi K2.5 averaged ~50s/problem sequentially across the 60-problem sweep (~50 min total). With --parallel 10 that should drop to ~5 min, which makes release-time re-evals practical.

Default parallel=1 preserves the existing sequential path with no behavior change.

Scope note

This was originally bundled into PR #70 (the AILANG baseline). It snuck in during release-time benchmarking and was always intended to be a separate PR — concurrent sweep is independent of AILANG support and deserves its own review. Reverted out of #70 (revert commit) and surfacing here.

Implementation

  • ThreadPoolExecutor with max_workers=parallel, futures collected via as_completed
  • threading.Lock around the JSONL append so concurrent writes don't interleave. Lines are still self-contained (carry problem_id) so completion-order writes are fine for downstream consumers
  • Workers share work_dir; per-problem temp files are uniquified by problem_id (existing behaviour, unchanged)
  • Exception in any worker is caught, logged red, and the sweep continues — one crashed problem doesn't kill the rest

Smoke-tested with claude-haiku-4-5 --tier 1 --parallel 4: 10/10 problems, no duplicates, 100%/100% run_correct.

Tests

TestRunBenchmarkParallel in tests/test_runner.py adds 7 cases covering:

  • parallel=1 does NOT use ThreadPoolExecutor (patched to raise on use)
  • parallel>1 runs every problem and collects every result
  • One worker raising doesn't abort the sweep
  • write_lock serialises JSONL writes (20 problems × 8 workers, every line parseable JSON)
  • output_path=None is a valid code path
  • Click accepts --parallel N and defaults to 1

All 7 pass locally; ruff check . / ruff format --check . / ruff check --select S vera_bench/ all clean. Local coverage on vera_bench/runner.py: 83% (CI lifts further when vera-binary-dependent paths run).

Test plan

  • CI green
  • Code review

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • run gains a --parallel option (default 1) to run problems concurrently; progress shows chosen parallelism, worker errors are recorded per problem without aborting the run, and metadata versions are propagated. output_path=None returns in-memory results without writing files.
  • Tests

    • Added tests covering sequential and parallel execution, concurrency-safe output integrity, error isolation and reporting, CLI validation for --parallel, and progress updates for every problem.

Review Change Stack

MarkEdmondson1234 and others added 2 commits May 22, 2026 14:58
Run N problems concurrently via ThreadPoolExecutor. Each worker
is I/O-bound on its LLM HTTP call + subprocess-based check/run,
so the GIL is not a bottleneck.

Use case: slow models like Kimi K2.5 averaged 49s/problem
sequentially across the 60-problem AILANG sweep (~50 min total).
With --parallel 10 the same sweep should drop to ~5 min, which
makes release-time re-evals practical.

Implementation:
- ThreadPoolExecutor with max_workers=parallel
- Per-problem futures collected via as_completed
- threading.Lock around the JSONL append so concurrent writes
  don't interleave. Lines are still self-contained (carry
  problem_id) so completion-order writes are fine.
- Workers share the same work_dir; per-problem temp files are
  uniquified by problem_id (existing behavior).
- Exception per worker is caught and logged; the sweep continues.

Default parallel=1 preserves the existing sequential path with
no behavior change.

Smoke-tested with claude-haiku-4-5 --tier 1 --parallel 4:
10/10 problems, no duplicates, 100%/100% run_correct.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds TestRunBenchmarkParallel covering the ThreadPoolExecutor path:

- test_parallel_one_uses_sequential_path: parallel=1 (default) does
  NOT touch ThreadPoolExecutor at all (patched to raise on use)
- test_parallel_two_runs_all_problems: every problem completes,
  every result is collected (order may differ — completion order)
- test_parallel_worker_exception_continues: one worker raising
  doesn't abort the sweep; sibling problems still complete
- test_parallel_writes_are_serialised: 20 problems × 8 workers,
  every JSONL line is parseable JSON (no torn writes from the
  write_lock failing to serialise)
- test_parallel_no_output_path_still_collects_results: skipping
  the write block is a valid code path
- test_run_command_accepts_parallel_flag: Click accepts --parallel N
- test_run_command_parallel_default_is_one: help text confirms default

All 7 pass; local ruff check / format --check / S all clean.
Coverage on vera_bench/runner.py: 83% locally (CI lifts further when
vera-binary-dependent paths are reachable).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@sunholo-voight-kampff, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 52 minutes and 42 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8de82082-0b06-49e3-b36b-c896aa350e9e

📥 Commits

Reviewing files that changed from the base of the PR and between fa8d1ce and 7a161f8.

📒 Files selected for processing (1)
  • tests/test_runner.py
📝 Walkthrough

Walkthrough

This PR adds configurable parallel execution to the vera_bench benchmark runner. The core run_benchmark function now accepts a parallel parameter that controls whether problems execute sequentially (default) or concurrently via ThreadPoolExecutor. Thread-safe JSONL output and graceful exception handling ensure correct results even under high contention. The feature is exposed through a new --parallel CLI option.

Changes

Parallel problem execution

Layer / File(s) Summary
Runner implementation (signature, helpers, parallel path)
vera_bench/runner.py
run_benchmark signature extended with parallel: int = 1. Added traceback import and helpers to record results and synthesise crash ProblemResult. Implemented parallel > 1 dispatch using ThreadPoolExecutor + as_completed, with completion-order JSONL writes and per-worker exception capture.
CLI integration
vera_bench/cli.py
Added --parallel Click option (min 1, default 1). Extended run(...) signature with parallel: int and forwarded it to run_benchmark.
Test suite and helpers
tests/test_runner.py
Introduced TestRunBenchmarkParallel class with helpers used across multiple tests.
Sequential-run tests
tests/test_runner.py
Tests that parallel=1 runs on main thread, continues sweep on exceptions (crash rows written), respects output_path=None, and calls progress.advance once per problem.
Parallel-run tests
tests/test_runner.py
Tests that parallel>1 spawns worker threads, collects one ProblemResult per problem, isolates worker failures into crash JSONL rows containing traceback, ensures concurrent JSONL writes are valid and complete, and verifies bench_version/vera_version propagation into workers.
CLI parsing & help tests
tests/test_runner.py
Tests that run --parallel <positive> is accepted, --parallel 0 and negative values are rejected with Click errors, and run --help shows --parallel with default: 1.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

harness

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a --parallel N flag to the vera-bench run command, which is the primary feature across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

sunholo-voight-kampff added a commit to sunholo-voight-kampff/vera-bench that referenced this pull request May 22, 2026
Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70.
(Item 1 — extracting --parallel N into its own PR — addressed via
PR aallan#73.)

### Item 2: README headline section -> single sentence in §Overview

Removed the "AILANG: AI-designed language..." headline section
(13 lines: the heading, the description paragraphs, the per-mode
results table, the "full-circle finding" paragraph). The phrasing
included editorial claims about VeraBench's identity that should
be a project-owner call, and "added in this fork" wouldn't read
correctly post-merge.

Replaced the §Overview line about baselines with the form
@aallan suggested verbatim:

  The same problems are also run in Python, TypeScript,
  [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/)
  as baselines. AILANG and Aver are zero-training-data languages,
  providing additional data points alongside Vera for the
  language-design-vs-training-data thesis.

Matches the existing Aver pattern: light-touch mention without
results writeups in the README.

### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md

Neither file is load-bearing — no code or tests reference them.
Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or
AVER_MAPPING.md. Numbers and writeups go in PR descriptions and
external content; in-repo docs are reserved for things future
maintainers need.

### Item 4: .coderabbit.yaml path_filters

Added the two missing AILANG entries to mirror the existing
{python, typescript, aver} pattern:

    - "!**/*.ail"             (alongside !**/*.vera, !**/*.av)
    - "!solutions/ailang/**"  (alongside the other solutions/* entries)

This stops CodeRabbit from generating speculative findings on
.ail solution files in future review passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sunholo-voight-kampff

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

(Re-triggering — the initial review at 13:32Z was rate-limited.)

@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the PR diff
gh pr view 73 --repo aallan/vera-bench --json files,additions,deletions,headRefName

# Get the diff
gh pr diff 73 --repo aallan/vera-bench

Length of output: 15126


🏁 Script executed:

#!/bin/bash
# Check how run_single_problem creates temp files to verify work_dir concurrency safety
rg -n "work_dir\|tmp\|tempfile\|mktemp\|problem_id" vera_bench/runner.py | head -80

# Check run_benchmark full signature
ast-grep --pattern 'def run_benchmark($$$)'

# Check run_single_problem signature and temp file usage
ast-grep --pattern 'def run_single_problem($$$)'

Length of output: 18516


🏁 Script executed:

#!/bin/bash
# Check how evaluation functions create temp files — specifically if they use problem_id
rg -n "problem_id\|problem\[.id.\]\|work_dir\|\.vera\|\.py\|\.ts\|stem\|suffix\|Path(" vera_bench/runner.py | grep -v "^vera_bench/runner.py:6[6-8]" | grep -E "_(evaluate|check|run)|work_dir|problem\[" | head -60

# Find _evaluate_* function definitions and their temp file creation
ast-grep --pattern 'def _evaluate_python_code($$$)'
ast-grep --pattern 'def _evaluate_aver_code($$$)'
ast-grep --pattern 'def _evaluate_typescript_code($$$)'
ast-grep --pattern 'def _evaluate_code($$$)'

Length of output: 26437


@sunholo-voight-kampff — clean addition. The sequential path is untouched, temp file isolation is solid across all four evaluators, and the exception-survives pattern is exactly right for a sweep tool. Three findings below, roughly in priority order.


Review

🟡 Conceptual: write_lock guards nothing in this pattern

vera_bench/runner.py, Lines 952–957

The write_lock is described in the PR as preventing interleaved JSONL writes from concurrent threads, but the concurrency model here doesn't produce concurrent writes:

  • Worker threads only execute _run_one — they return list[ProblemResult] through the Future object and never touch output_path.
  • The main thread drives the for fut in as_completed(futures) loop. Every statement in that loop body — all_results.extend, the open(output_path, "a"), the f.write(...), the progress.advance(task) — runs sequentially on the main thread.

There is no second thread that could race on the file. The lock acquires and releases with no contention and nothing protecting.

This is harmless today, but the accompanying comment ("serialise JSONL writes") will mislead a future reader into thinking there's something concurrent happening in the write path. If a later refactor does move writes into the workers, a reader might assume the lock already handles it correctly (it wouldn't — the with (write_lock, open(...) as f) syntax acquires both context managers together, but the file open is not re-entrant).

# Current — lock is a no-op; all writes are main-thread via as_completed
with (
    write_lock,
    open(output_path, "a", encoding="utf-8") as f,
):

# Cleaner — drop the lock, add a comment explaining why it isn't needed
# File writes are serialised by the as_completed loop running on the main thread.
with open(output_path, "a", encoding="utf-8") as f:

And write_lock/import threading can be removed entirely.


🟡 CLI: --parallel 0 silently falls through to sequential

vera_bench/cli.py / vera_bench/runner.py Line 889

parallel <= 1 routes zero and negatives to the sequential path without any user feedback. A user who typos --parallel 0 will run a perfectly normal sequential sweep with no indication something was wrong.

click.IntRange rejects out-of-range values before run_benchmark is ever called and produces a standard Click error message:

`@click.option`(
    "--parallel",
    type=click.IntRange(min=1),   # rejects 0 and negatives at parse time
    default=1,
    show_default=True,
    help=...,
)

If you want parallel=1 to explicitly mean "sequential" and 0 to also be valid as an alias, that's fine — but it should be documented in the help text. The current behaviour is just silent.


🟠 Test fragility: "invalid" is too broad a sentinel

tests/test_runner.pytest_run_command_accepts_parallel_flag

assert "invalid" not in (result.output or "").lower()

The string "invalid" can appear in non-option-error output: "invalid API key", "invalid token", "invalid model", etc. If the downstream path emits any such message, this assertion fires a false positive and the test fails even though Click parsed --parallel 4 correctly.

Click uses exit code 2 for all parse/usage errors (UsageError, BadParameter, NoSuchOption). That's the reliable sentinel here:

# Click exits with code 2 on parse errors (NoSuchOption, BadParameter, etc.)
# Downstream failures (missing API key, etc.) use code 1 or non-zero != 2.
assert result.exit_code != 2, f"Click parse error:\n{result.output}"

✅ What looks good

  • Temp file isolation is airtight. All four evaluators (_evaluate_code, _evaluate_python_code, _evaluate_typescript_code, _evaluate_aver_code) name their files {safe_id}_attempt{attempt}.* and {safe_id}_test{i}_attempt{attempt}.*. Parallel workers sharing work_dir will never collide as long as problem IDs are unique (which is a benchmark invariant).
  • Exception handling is correct. fut.result() re-raises worker exceptions on the main thread; the except Exception catches them, logs red, and continues — one bad LLM response can't abort the sweep.
  • Sequential path is byte-for-byte unchanged. parallel <= 1 falls straight into the original block with no refactor churn.
  • test_parallel_one_uses_sequential_path correctly proves the invariant: since parallel=1 never enters the else branch, the patched ThreadPoolExecutor that raises AssertionError is never reached.
  • test_parallel_writes_are_serialised (20 problems × 8 workers, every line re-parsed) is a well-motivated test even if the lock it's testing turns out to be redundant — it validates the JSONL output contract under load.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vera_bench/cli.py (1)

90-101: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add an explicit -> None return type to run

vera_bench/cli.py: run(...) currently has no return annotation even though it only performs side effects (prints/writes results) and doesn’t return a value. Update the signature to:

def run(
    model: str,
    tier: int | None,
    problem: str | None,
    language: str,
    mode: str,
    skill_md: Path | None,
    output_dir: Path | None,
    max_tokens: int,
    keep_temps: bool,
    parallel: int,
) -> None:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vera_bench/cli.py` around lines 90 - 101, The run function lacks an explicit
return annotation; update the function signature for run(...) to declare it
returns None by adding -> None to the definition so its signature becomes def
run(... ) -> None:, ensuring the type hint reflects that it only performs side
effects.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_runner.py`:
- Around line 1711-1713: The test in tests/test_runner.py is brittle because it
checks for the substring "invalid" in result.output; replace that substring
check with Click usage-error semantics by asserting on result.exit_code (use
result.exit_code == 2 for a parse/usage failure, or result.exit_code != 2 for a
successful parse) and remove the "invalid" substring assertion that can
false-fail; locate the assertion lines that reference result.output and the
variable result to make this change.

In `@vera_bench/cli.py`:
- Around line 79-83: Update the Click option for "--parallel" so values <1 are
rejected at parse time (replace type=int with click.IntRange(min=1) or add an
explicit callback validator) to prevent 0/negative values from being passed to
run_benchmark; also add the missing return type annotation "-> None" on the CLI
entry function run(...) to follow the repo's type-hinting guideline and clarify
it returns nothing. Ensure the call site that passes parallel to run_benchmark
remains unchanged (run_benchmark(parallel=parallel)).

In `@vera_bench/runner.py`:
- Around line 915-919: The redundant threading.Lock instance named write_lock
should be removed from the parallel path: delete the write_lock creation and any
acquisitions/releases around JSONL writes (references: write_lock), since JSONL
writes are already serialized by the main-thread as_completed loop (references:
as_completed, ThreadPoolExecutor) and the lock provides no additional
protection; also remove the now-unused threading import if it's only present for
this lock so the imports remain correct.

---

Outside diff comments:
In `@vera_bench/cli.py`:
- Around line 90-101: The run function lacks an explicit return annotation;
update the function signature for run(...) to declare it returns None by adding
-> None to the definition so its signature becomes def run(... ) -> None:,
ensuring the type hint reflects that it only performs side effects.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57028d49-ac03-482c-9980-64392affeb86

📥 Commits

Reviewing files that changed from the base of the PR and between 237ca81 and b936bc5.

📒 Files selected for processing (3)
  • tests/test_runner.py
  • vera_bench/cli.py
  • vera_bench/runner.py

Comment thread tests/test_runner.py
Comment thread vera_bench/cli.py
Comment thread vera_bench/runner.py Outdated
…test brittleness)

### cli.py:83 — IntRange(min=1) for --parallel

`type=int` silently accepted 0 and negative values; `run_benchmark`
then treated `parallel <= 1` as sequential, masking the bug. Switched
to `click.IntRange(min=1)` so 0/negative fail at parse time with
Click's standard usage error (exit_code=2).

Skipped CR's suggested `-> None` annotation on `def run(...)` — per
@aallan's prior comment on PR aallan#70 (commit aa13f25's description),
"missing `-> None` applies to ALL Click handlers in cli.py — pre-
existing project-wide consistency issue, not specific to this PR".
Annotating just `run` would break that consistency; out of scope.

### runner.py:919 — remove redundant write_lock

CR correctly observed that JSONL writes are already serialised by
the main-thread `for fut in as_completed(...)` loop. Workers only
run `_run_one` (LLM/subprocess work) and never touch `output_path`,
so `threading.Lock()` was protection without need. Removed the lock,
the `import threading`, and the lock acquisition. Added a comment
explaining where serialisation actually comes from so a future
reader doesn't re-add the lock thinking it was load-bearing.

Updated the docstring on `test_parallel_writes_are_serialised` to
credit the loop structure (not the lock) as the serialisation
mechanism — the property holds whether the lock is there or not,
because workers never write.

### tests/test_runner.py:1713 — exit_code != 2 over substring check

Replaced the brittle `"invalid" not in result.output` substring
check with `result.exit_code != 2`. Click's parse/usage errors
return exit_code 2 cleanly; substring checks could false-fail on
unrelated runtime output (e.g. an API-key error message containing
the word "invalid").

Also added two new tests pinning the new IntRange behaviour:
- test_run_command_rejects_zero_parallel: --parallel 0 -> exit 2
- test_run_command_rejects_negative_parallel: --parallel -5 -> exit 2

All 9 TestRunBenchmarkParallel cases pass locally; ruff check /
format --check / S all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.49%. Comparing base (237ca81) to head (7a161f8).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
+ Coverage   83.65%   84.49%   +0.84%     
==========================================
  Files          10       10              
  Lines        1395     1432      +37     
==========================================
+ Hits         1167     1210      +43     
+ Misses        228      222       -6     
Flag Coverage Δ
python 84.49% <100.00%> (+0.84%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aallan aallan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed review — --parallel N

Ran a four-agent parallel review (code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer). Findings are deduped; convergence noted where multiple agents independently agreed.

TL;DR

No critical merge-blockers — threading model is sound, JSONL serialisation is correct, CI clean across all 9 checks, all three CodeRabbit inline findings substantively addressed in bfbfae2. But the four agents converged on five Important issues worth a second iteration before merge, plus five negotiable suggestions. The standout is I1 (worker crashes vanish from JSONL) — flagged by silent-failure-hunter and arguably the closest thing to a blocker, since it makes parallel sweeps silently under-report scope to downstream vera-bench report.


Important Issues (5)

I1 — Worker crashes vanish from JSONL ★ priority

vera_bench/runner.py:945-950. When fut.result() raises, the handler logs [red]Worker failed on {pid}: {exc}[/red] to stdout and continues — no ProblemResult is written to output_path. Downstream consequences:

  • A 60-problem sweep with 2 worker crashes produces 58 JSONL lines.
  • vera-bench report then computes pass/fail metrics over those 58 lines and reports "58/58 succeeded (100%)" — the 2 crashes are invisible to anyone reading results files later.
  • The sequential path doesn't have this defect because crashes abort the whole sweep loudly with a traceback.

For a benchmark tool, silently changing the denominator on operator error is arguably worse than diagnostic loss. Fix:

except Exception as exc:  # noqa: BLE001
    pid = futures[fut].get("id", "?")
    tb = traceback.format_exc()
    console.print(f"[red]Worker failed on {pid}: {exc!r}[/red]")
    console.print(f"[dim]{tb}[/dim]")
    crash_result = ProblemResult(
        problem_id=pid,
        model=getattr(client, "model", "unknown"),
        language=language,
        attempt=0,
        check_pass=False,
        run_correct=False,
        error_message=f"Worker crash: {exc!r}\n{tb}",
        timestamp=datetime.now(timezone.utc).isoformat(),
        # ... other required fields
    )
    all_results.append(crash_result)
    if output_path:
        with open(output_path, "a", encoding="utf-8") as f:
            f.write(crash_result.to_jsonl() + "\n")
    progress.advance(task)
    continue

This single change addresses I1 and I3 (traceback) at once.

I2 — Sequential/parallel error-handling asymmetry

Two agents converged. runner.py:885-906 (sequential, no try/except) vs runner.py:945-952 (parallel, swallows Exception). vera-bench run --parallel 1 and vera-bench run --parallel 2 have different fault semantics on the same input — a transient NoneType from a bad model response kills the whole sweep at --parallel 1 but is logged-and-continued at --parallel 2.

Either align them (recommended — wrap the sequential body in the same try/except so a 4-hour sweep survives problem 47 of 60 regardless of parallelism), or document the asymmetry in the run_benchmark docstring alongside the existing "JSONL output ordering" note.

I3 — Lost traceback in worker exception handler

Same diagnostic-loss pattern as C3 on PR #70. runner.py:947 reports str(exc) only. On a 1-hour parallel sweep with hundreds of problems, debugging Worker failed on VB-T3-007: 'NoneType' object has no attribute 'foo' with no file/line is painful.

Subsumed by the I1 fix above (capture traceback.format_exc() into the JSONL row's error_message).

I4 — progress.advance(task) on exception path is untested

tests/test_runner.py:1591-1620 (test_parallel_worker_exception_continues) verifies the loop continues (3 of 4 results return), but does not pin that the progress bar advanced on the crashed problem. The current code does call advance before continue at runner.py:951, but a refactor moving it into an else: branch would silently regress the bar to hang at N-1/N with no test failure.

Fix: inject/patch progress (or pass a fake) and assert advance.call_count == len(problems) even with a worker exception.

I5 — Version propagation through _run_one closure is untested

runner.py:922-936. bench_version and vera_version are captured by closure into _run_one. If a future refactor drops them from the kwargs forwarded to run_single_problem, JSONL would silently get empty version strings. The existing mock at test_runner.py:1546 uses lambda problem, **kw: ... which swallows the kwargs without inspection.

Fix: one test that calls with bench_version="0.1.0", vera_version="0.0.103" and asserts the mock saw them.


Suggestions (5, negotiable)

  • S1 — Output-write block has no error handling. runner.py:949 — disk full or NFS hiccup mid-sweep raises OSError on the main thread, killing the whole sweep. Pre-existing in sequential too, but parallel makes the blast radius bigger (more in-flight uncommitted results). Consider try/except around the write with a fallback log.
  • S2 — Patch-target fragility. tests/test_runner.py:1538-1541 patches concurrent.futures.ThreadPoolExecutor. Works because the import is lazy in the parallel branch, but would silently fail if someone moves the import to module scope. Robustness alternative: assert threading.current_thread() is threading.main_thread() inside _run_one — asserts behaviour (no thread spawn) rather than implementation (no class lookup).
  • S3 — Anecdotal Kimi K2.5 performance figures. runner.py:877-879 and cli.py:85 cite "~50s → ~5s" with no committed benchmark, and the model name will age out. Either commit a measurement or soften to generic "slow models".
  • S4 — Misleading POSIX-atomicity claim in test docstring. tests/test_runner.py:1623's "Python's GIL doesn't make file writes atomic — partial writes are observable" is wrong for the actual case here: short writes (<PIPE_BUF ~4096B) with O_APPEND are atomic on POSIX, and the production code serialises writes on the main thread anyway. The test still verifies useful properties; the docstring oversells.
  • S5 — 20×8 stress level is overkill post-lock-removal. With writes now serialised on main thread, the test only catches the regression of moving writes back into workers — a single race at much lower scale would catch it. Lower N if test suite gets slow.

Strengths

Substantial — calling out specifically because four agents converged on these:

  • runner.py:919-922 "no write lock needed" comment — load-bearing and exemplary (called out by three agents). The kind of comment that survives refactors and actively prevents a future contributor from re-introducing a redundant lock thinking it was protective. Best comment in the PR.
  • cli.py:79-89 click.IntRange(min=1) with show_default=True — closes the silent-fall-through-to-sequential bug class that plain type=int would have allowed. Right pin at the parse boundary.
  • tests/test_runner.py:1518-1564 test_parallel_one_uses_sequential_path — clever negative-space test. Patching ThreadPoolExecutor with side_effect=AssertionError("must not be used in sequential path") proves the absence of an import path, which is genuinely harder than asserting presence.
  • runner.py:946 # noqa: BLE001 correctly catches Exception not BaseException — Ctrl-C / SystemExit still propagate and cancel the sweep, which is exactly what you want. Two agents called this out as the right narrow choice.
  • Set-equality assertions at tests/test_runner.py:1581, 1611 correctly accommodate completion-order nondeterminism. Avoids the trap of list-equality assertions that would be flaky under parallel scheduling.
  • CLI tests assert exit_code != 2 rather than message-string matching — resilient to Click phrasing changes, the right semantic boundary for parse-time validation.
  • Inline closure _run_one at runner.py:922-936 is the right abstraction level — captures 9 invariants locally without polluting module namespace. A module-level _dispatch_one would add boilerplate for no real win.
  • run_benchmark docstring at runner.py:872-883 honestly documents completion-order JSONL ordering with mitigation (problem_id is self-contained, consumers can sort). Setting expectations correctly upfront.
  • Review-iteration discipline — applied 3/3 CodeRabbit inline findings verbatim with extra coverage (the two new IntRange pin tests went beyond what CR asked for), and declined the -> None annotation with documented rationale rather than silently ignoring. Exactly the response pattern that makes review productive.

Recommended action

This is close to merge-ready. I'd ask for one more iteration on I1-I5:

  1. I1 + I3 in one change — synthesise a ProblemResult on worker crash with traceback.format_exc() in error_message. The patch sketch above is ~10 lines and lands both fixes.
  2. I2 — wrap the sequential body in the same try/except so semantics match, OR document the asymmetry. Either is defensible.
  3. I4 + I5 — small test additions in the existing TestRunBenchmarkParallel class.
  4. S1-S5 are negotiable — your call. None are blocking.

After those land I'd approve and merge. Thanks for the careful work on this — the CR-iteration discipline and the load-bearing concurrency comment are both first-rate.

— Reviewed with a four-agent parallel pass (general code review, test coverage, silent-failure hunting, comment accuracy). Convergence noted above where multiple agents flagged the same issue.

Five Important issues and two of five suggestions from the
2026-05-22T20:44 CHANGES_REQUESTED review.

### I1 + I3 — Worker crashes vanish from JSONL (priority blocker)

Before this change, a worker exception was logged to stdout and the
loop `continue`d — no `ProblemResult` was written. A 60-problem
sweep with 2 crashes produced 58 JSONL rows; downstream
`vera-bench report` then showed "58/58 (100%)", silently shrinking
the denominator.

New `_crash_result(problem, exc, tb)` helper synthesises a
`ProblemResult` with `check_pass=False`, `run_correct=False`, and
the full `traceback.format_exc()` embedded in `error_message`.
Wired into both sequential and parallel paths via the new `_record`
helper so successes and crashes hit the same persistence machinery.

### I2 — Sequential / parallel error-handling asymmetry

Pre-fix: `--parallel 1` aborted on any worker exception, `--parallel
2+` logged-and-continued. A transient bad model response would kill
a 4-hour sweep on the sequential path but not the parallel one.

Now both paths wrap `run_single_problem` in the same `try/except`
and route crashes through `_crash_result` + `_record`. Same fault
semantics regardless of N.

### I4 — `progress.advance(task)` on exception path is now tested

`test_progress_advances_on_crash_path` patches `Progress` and asserts
`advance.call_count == len(problems)` even when one problem raises,
in both the sequential and parallel paths. A refactor that moved
`advance` into an `else:` branch would now fail this test cleanly.

### I5 — Version propagation through `_run_one` closure is now tested

`test_bench_and_vera_version_propagate_to_workers` captures the
kwargs `run_single_problem` actually receives under `parallel=3`
and asserts both `bench_version` and `vera_version` came through.
Catches a future refactor that drops them from the kwargs forwarded
through the closure.

### S2 — Replace ThreadPoolExecutor patch with thread-identity test

`test_parallel_one_uses_sequential_path` now asserts behavior (every
call ran on `threading.main_thread()`) instead of patching
`concurrent.futures.ThreadPoolExecutor`. The test is robust to a
future refactor hoisting the import to module scope. Added a
counterpoint test (`test_parallel_two_actually_spawns_worker_threads`)
that confirms `parallel>1` does spawn workers.

### S4 — Fix incorrect POSIX-atomicity claim in test docstring

The old docstring on `test_parallel_writes_are_serialised` said
"Python's GIL doesn't make file writes atomic — partial writes are
observable", which was wrong: short writes (< PIPE_BUF ~4096B)
with O_APPEND ARE atomic on POSIX. Replaced with an honest
explanation that the test proves serialisation comes from the
main-thread `as_completed` loop (not the lock that no longer
exists, and not POSIX guarantees we don't depend on).

### Updated existing test for new behavior

`test_parallel_worker_exception_continues` previously asserted
`len(results) == 3` (the crashed problem vanished). Now asserts
`len(results) == 4` (success rows + crash row) and verifies the
crash row carries `Worker crash:`, the original exception's repr,
and a traceback in `error_message`. Added a parallel test for the
sequential path's crash semantics.

### Deferred (negotiable suggestions)

- **S1** (no error handling on output write): file-write failures
  on the main thread still abort the sweep. Deferred — pre-existing
  on the sequential path too, and a sensible operator response
  (resume from JSONL) doesn't exist yet.
- **S3** (Kimi K2.5 anecdotal figures): kept as-is; they're motivating
  context, not a load-bearing claim.
- **S5** (20×8 stress overkill): kept — test runtime is sub-second
  and the larger scale catches more refactor failures.

All 13 TestRunBenchmarkParallel cases pass; ruff check / format --check
/ S all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sunholo-voight-kampff

Copy link
Copy Markdown
Contributor Author

@aallan — addressed in fa8d1ce. Going item-by-item against your four-agent review:

Important (all in)

  • I1 + I3 (priority) — Worker crashes now synthesise a visible ProblemResult row with check_pass=False, run_correct=False, and traceback.format_exc() embedded in error_message. Extracted into a _crash_result(problem, exc, tb) helper and a _record(problem_results) helper (both nested in run_benchmark to capture the closure variables cleanly) so successes and crashes hit the same persistence path. Your diff sketch caught both findings in one change — landed essentially verbatim, with the helper extracted so the sequential path can reuse it.
  • I2 (sequential / parallel symmetry) — Took the alignment option (recommended). Sequential run_single_problem call is now wrapped in the same try/except, routing crashes through _crash_result + _record. A 4-hour sweep now survives problem 47 of 60 regardless of --parallel N.
  • I4 (progress.advance on exception path)test_progress_advances_on_crash_path patches Progress and asserts advance.call_count == len(problems) even when one problem raises, in both sequential and parallel paths. Refactor that moves advance into an else: branch would now fail this test.
  • I5 (version propagation through closure)test_bench_and_vera_version_propagate_to_workers captures the kwargs run_single_problem actually receives under parallel=3 and asserts both bench_version and vera_version came through. Catches a future refactor dropping them.

Suggestions

  • S2 (patch-target fragility) — Replaced the ThreadPoolExecutor side_effect=AssertionError(...) patch with a thread-identity assertion: under parallel=1, every run_single_problem call must run on threading.main_thread(). Robust to a future refactor hoisting the import to module scope. Added a counterpoint test (test_parallel_two_actually_spawns_worker_threads) confirming parallel>1 actually spawns workers, since the negative-only test could otherwise pass if the parallel path was silently broken too.

  • S4 (incorrect POSIX-atomicity claim) — Fixed. The old docstring claimed "Python's GIL doesn't make file writes atomic — partial writes are observable", which is wrong: short O_APPEND writes < PIPE_BUF are atomic on POSIX. Replaced with an honest description: the test proves the main-thread as_completed loop is the serialisation source, not POSIX guarantees we don't depend on.

  • S1 (no error handling on output write) — Deferred. Pre-existing on the sequential path too, and a sensible operator response (resume from JSONL) doesn't exist yet, so wrapping in try/except just to log-and-continue past a disk-full would lose data without a recovery path. Happy to revisit alongside a resume-from-jsonl feature.

  • S3 (anecdotal Kimi K2.5 figures) — Kept as-is. They're motivating context in the docstring, not a load-bearing claim downstream code relies on. If you'd prefer them softened or removed I'll do that.

  • S5 (20×8 stress overkill) — Kept. Test runtime is sub-second and the larger scale catches more refactor failure modes than a single race would.

Updated existing test for new behavior

test_parallel_worker_exception_continues previously asserted len(results) == 3 (crashed problem vanished) and 3 JSONL lines. Now asserts len(results) == 4 and verifies the crash row carries Worker crash:, the original exception's repr, and a Traceback in error_message. Plus a new test_sequential_worker_exception_also_continues pinning the I2 symmetry.

13 TestRunBenchmarkParallel cases pass locally; ruff check . / ruff format --check . / ruff check --select S vera_bench/ all clean. Pushing now; CI will run momentarily.

Thanks again for the substantive review — the I1 finding (silent denominator change) was a real one I would not have caught, and the four-agent convergence on the strengths section was good signal too.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_runner.py`:
- Around line 1558-1560: The local helper _record_thread should be annotated
with precise Python 3.11+ type hints: add parameter and return types (e.g.,
typing.Callable-compatible signature) so the function signature explicitly types
`problem: dict` (or a more specific TypedDict if available) and `**kw: Any`, and
the return type as list of the test result type (e.g., list[ResultType] or
list[dict] if ResultType isn't defined); update the other similar helpers at the
indicated locations (around lines 1599-1601, 1660-1663, 1708-1711, 1812-1815,
1868-1870) with matching annotations, importing Any and other typing names as
needed to satisfy repository typing rules.
- Line 1688: Replace the substring-based selection for crash_row with JSON-first
selection: parse all log lines (e.g., records = [json.loads(ln) for ln in
lines]) and find the record whose "problem_id" equals the project's worker-crash
problem id (use the actual constant/name used elsewhere), assign that record to
crash_row, then assert the message content separately (e.g., assert "Worker
crash" in crash_row["message"]); apply the same change to the other occurrence
around the second instance. Ensure you reference the existing crash_row and
lines variables when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 65240d5f-05bc-4818-988e-ab91d0fceae3

📥 Commits

Reviewing files that changed from the base of the PR and between b936bc5 and fa8d1ce.

📒 Files selected for processing (3)
  • tests/test_runner.py
  • vera_bench/cli.py
  • vera_bench/runner.py

Comment thread tests/test_runner.py Outdated
Comment thread tests/test_runner.py Outdated
…ral row select)

Two new CodeRabbit findings posted 2026-05-23T04:07Z after the
I1-I5 commit:

### tests/test_runner.py — Type hints on 6 inner helpers

Test-side closures (`_record_thread` ×2, `_side_effect` ×3, `_capture`)
were untyped. Per the project's "Python 3.11+, type hints everywhere"
rule, annotated all six with:

    def _xyz(
        problem: dict[str, object], **kw: object
    ) -> list[ProblemResult]

`ProblemResult` was already imported at module scope.

### tests/test_runner.py — Crash row selection by problem_id, not substring

Replaced the brittle filter:

    crash_row = next(json.loads(ln) for ln in lines if "Worker crash" in ln)

with a structural selector:

    rows = [json.loads(ln) for ln in lines]
    crash_row = next(row for row in rows if row["problem_id"] == "VB-X-2")

The message-content assertions ("simulated worker crash", "RuntimeError",
"Traceback") remain — they're now testing the message-content contract
explicitly rather than relying on it implicitly through the selector.
Applied to both `test_parallel_worker_exception_continues` and
`test_sequential_worker_exception_also_continues`.

All 13 TestRunBenchmarkParallel cases pass; lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@aallan aallan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — everything in scope is delivered

All five Important issues from the 2026-05-22T20:44 review are resolved, two of five suggestions taken with documented rationale on the rest, and both follow-up CodeRabbit findings on the response commit have already been addressed and confirmed by CR. CI is green across all 9 checks on 7a161f84.

Verification

Ask Delivered Location
I1 + I3 — worker crashes vanish from JSONL + lost traceback Unified _crash_result() synthesises a ProblemResult with traceback.format_exc() in error_message; new _record() helper unifies in-memory + JSONL persistence between success and crash paths runner.py:894-924
I2 — sequential / parallel error-handling asymmetry Both paths now wrap run_single_problem in identical try/except and route crashes through _crash_result + _record. Same fault semantics regardless of --parallel N runner.py:927-1001
I4progress.advance(task) untested on exception path New test_progress_advances_on_crash_path patches Progress, runs both parallel=2 and parallel=1 back-to-back, asserts advance.call_count == 4 (3 successes + 1 crash) on each path tests/test_runner.py:1814-1868
I5 — version propagation through closure untested New test_bench_and_vera_version_propagate_to_workers captures the kwargs run_single_problem actually receives via a _capture closure under parallel=3 and asserts both versions arrive on every call tests/test_runner.py:1870-1908
S2 — patch-target fragility Replaced ThreadPoolExecutor patch with threading.current_thread() is threading.main_thread() assertion (behaviour over implementation); added counterpoint test_parallel_two_actually_spawns_worker_threads tests/test_runner.py:1543-1625
S4 — misleading POSIX-atomicity docstring Rewritten to honest "main-thread as_completed serialisation" explanation tests/test_runner.py:1749-1791

Deferred suggestions (S1, S3, S5) all have documented rationale in the commit message — file-write error handling is a pre-existing cross-cutting concern, the Kimi K2.5 anecdote is motivating context, and the 20×8 stress test runs sub-second so the larger scale is free.

Worth calling out

The response on I1 + I3 went structurally tighter than my review prompted. The minimal fix would have been "add traceback.format_exc() to the existing log line." Instead you refactored both paths through _crash_result() + _record() so successes and crashes hit the same persistence machinery — that abstraction makes "did this crash get recorded?" a property of the helper rather than a branching obligation at every call site. Future contributors who add a new branch (e.g., a timeout path) inherit the persistence semantics for free.

The I4 test design is also tighter than the brief: I asked for an assertion on the parallel path; you patched Progress, exercised both parallel=2 and parallel=1 back-to-back with a reset between, and pinned advance.call_count == 4 on both. So I2's symmetric-fault-semantics design gets defended by I4's test on both paths in one go.

The S2 thread-identity replacement is the right kind of trade — implementation tests (patching imports) become brittle the moment someone moves an import; behaviour tests (assert this code runs on the main thread) survive that refactor cleanly. The counterpoint test ensures the inversion still holds.

And the two follow-up CodeRabbit findings on the response commit (type hints on six inner helpers, structural row selection by problem_id instead of message substring) were addressed in 7a161f84 within seven minutes of CR posting them, with CR confirming both. That iteration discipline keeps review productive.

Approved

CI green, every priority ask delivered, deferred items explicitly tracked. Ready to merge.

@aallan aallan merged commit 6db02f4 into aallan:main May 25, 2026
10 checks passed
aallan added a commit to sunholo-voight-kampff/vera-bench that referenced this pull request May 25, 2026
Positional conflict only: both aallan#73 (TestRunBenchmarkParallel) and aallan#70
(TestAilangLiteral / TestStripAilangMain / TestEvaluateAilangCode /
TestLoadAilangPrompt / TestAilangPrompt / TestAilangCLI) appended new
test classes at the end of tests/test_runner.py. Resolved by keeping
both groups in order: TestRunBenchmarkIntegration -> TestRunBenchmarkParallel
(from aallan#73) -> AILANG test classes (from aallan#70).

No logical conflict between the PRs. PR aallan#73 modified run_benchmark
(with new _crash_result / _record helpers at lines ~1242-1280);
PR aallan#70 modified the AILANG evaluator paths (lines ~554-831) and added
the AILANG dispatch branch in run_single_problem (lines ~975, 1017,
1107). The runner.py three-way merge resolved cleanly because the
regions are disjoint; only the test file needed manual stitching.

Verification:
- ruff check . / ruff format --check . both clean
- AST parse OK on merged test file
- All three target classes present exactly once (no duplicates)
- Final structure: TestRunBenchmarkIntegration -> TestRunBenchmarkParallel ->
  AILANG classes, separated by header comments

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aallan added a commit that referenced this pull request May 25, 2026
Version bump
============

- pyproject.toml: 0.0.11 -> 0.0.12
- vera_bench/__init__.py fallback: 0.1.0 -> 0.0.12 (the fallback only
  fires when the package isn't installed via metadata; the canonical
  source is still pyproject.toml + importlib.metadata)
- vera_bench/prompts.py _USER_AGENT: "vera-bench/0.0.9" -> "vera-bench/0.0.12"
  (was stuck at 0.0.9 since that release)

Documentation consistency
=========================

CHANGELOG.md
- New [0.0.12] section covering the AILANG + --parallel work from
  #70 and #73, plus the worker-crash JSONL fix, the tag-classification
  regex, and the sequential/parallel symmetry fix
- Compatibility note: 0.0.12 is purely additive for Vera, Python,
  TypeScript, and Aver scoring

CLAUDE.md
- Project description now mentions AILANG alongside Aver
- solutions/ directory list updated to include ailang
- New AILANG subsection documenting CLI flag conventions
  (--quiet/--caps IO/--entry main/--relax-modules, AILANG_TRACE=off,
  *_API_KEY scrubbing)
- New "Adding more comparison languages" subsection noting OpenRouter
  / MOONSHOT / OPENROUTER env var support
- Commands list adds --language ailang for both `run` and `baselines`,
  plus --parallel N with explanatory paragraph

ROADMAP.md
- "Where we are" prepended with v0.0.12 summary
- Milestone 1 checks off AILANG language support and --parallel N

README.md
- Quick start adds --parallel N example
- Supported providers list adds OpenRouter and OPENROUTER_API_KEY

KNOWN_ISSUES.md
- Chart-pin section dropped stale "v0.0.9" references in favor of
  generic "current-version" phrasing — the warning is the same shape
  regardless of which version is current
- Removal trigger updated to reflect that the trigger is "when README
  is rewritten against current data", not a specific version

scripts/README.md
- Same chart-pin staleness fix as KNOWN_ISSUES.md

Out of scope
============

`scripts/run_full_benchmark.py` was not updated to include AILANG
targets — PR #70 added the language support but missed the sweep
script. That's a real gap but it's a code change, not a docs change.
Spawned a follow-up task to extend the sweep script to 10 targets
(LLM + baseline for AILANG) plus the matching scripts/README.md
updates.

The fixture values "0.0.11" / "0.0.108" in tests/test_runner.py
(I5 propagation test) are arbitrary strings used to verify kwargs
forwarding through the parallel-path closure — they're not assertions
about the current package version. Left as-is.

Verification
============

- ruff check . / ruff format --check . both clean
- 229 tests pass under pytest (1 known-flaky Rich console-width test
  unrelated to these changes; CI runners use wider console width)
- importlib.metadata.version("vera-bench") still resolves correctly
  (the fallback at __init__.py is only hit when the package metadata
  isn't installed, e.g., a raw git checkout without `pip install -e .`)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants