docs: document all scripts in scripts/README.md; make plot_results data-driven by aallan · Pull Request #59 · aallan/vera-bench

aallan · 2026-04-17T10:19:25Z

Summary

Add scripts/README.md covering all three operational scripts (run_full_benchmark.py, plot_results.py, validate_problems.py)
Refactor plot_results.py to read metrics from JSONL files (no more hand-copied percentages) and add an --extra flag for opt-in comparison languages (Aver is the first; more are pluggable)
Add assets/benchmark_v0.0.9_with-aver.png as the 5-mode variant; assets/benchmark_v0.0.9.png regenerated and byte-equivalent to the previous version

Why now

Previous state: scripts/ had no README, and the claim that each script was "self-documenting via --help" was thin. For example, run_full_benchmark.py --help doesn't mention the 8-target list, the per-provider env var mapping, how to sweep multiple models, expected timings, or the output filename convention.

Hand-copied chart data was also a foot-gun — if percentages in plot_results.py drifted from the JSONL files, the headline chart would silently misrepresent results.

What's in the README

run_full_benchmark.py — what the eight targets are, env var per provider, sweep recipe for all six models, timing table, output file naming, exit-code behaviour
plot_results.py — usage, file-discovery glob patterns, how to add a new model or comparison language, colour palette, reproducibility notes
validate_problems.py — column-by-column interpretation, when to run

Test plan

python scripts/plot_results.py --version 0.0.9 → regenerates default chart (byte-equivalent to previous benchmark_v0.0.9.png)
python scripts/plot_results.py --version 0.0.9 --extra aver → produces benchmark_v0.0.9_with-aver.png with Aver as a 5th mode
python scripts/plot_results.py --version 0.0.7 → historical chart reproduces from archived JSONL files
ruff check scripts/plot_results.py passes
Delta-chart title + x-axis auto-adapt when extras are added

Not tagged as a release — scripts/docs only.

Generated with Claude Code

Summary by CodeRabbit

Documentation
- Added a scripts README documenting benchmark, plotting and validation workflows, outputs, resume/failure behaviour; updated README/ROADMAP links to the new results-graph image.
New Features
- CLI-driven chart generator with versioned outputs and optional Aver comparison; improved data extraction across multiple comparison modes.
- Benchmark runner now writes resumable JSONL results, consolidated timing/status outputs and continues remaining targets on failures.
Chores
- Updated .gitignore to exclude local timing and generated chart files.

Previously the scripts/ directory had no README — each script was claimed to be "self-documenting via --help", but --help coverage was thin (run_full_benchmark.py omits the 8-target list, env-var mapping, multi-model sweeps, timing, and output filenames). scripts/README.md now covers all three: - run_full_benchmark.py: target list, provider env vars, sweep recipe, timing expectations, output filenames, exit-code semantics - plot_results.py: usage, file-discovery pattern, model-registry extension, colour palette, reproducibility, new --extra flag - validate_problems.py: column-by-column output interpretation, when to run Related plot_results.py refactor: - Read run_correct from JSONL files via compute_metrics() instead of hand-copied percentages (eliminates transcription errors) - Auto-detect Vera compiler version and problem count from filenames - Add MODELS registry keyed on file_prefix; replace hardcoded FLAGSHIP/SONNET dicts - Add --extra CLI flag (opt-in comparison languages; Aver registered today, others pluggable) - Output suffix _with-<lang> when extras used; default path unchanged - Delta-chart title + x-axis auto-adapt to the comparison-language list Regenerate assets/benchmark_v0.0.9.png (byte-equivalent to previous); add assets/benchmark_v0.0.9_with-aver.png as the 5-mode variant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-17T10:19:56Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an in-repo scripts README; refactors scripts/plot_results.py into a data-driven, CLI-driven renderer with ModelSpec, JSONL ingestion, version detection and Aver support; updates .gitignore and README/ROADMAP chart references.

Changes

Cohort / File(s)	Summary
Documentation `scripts/README.md`	New README describing `run_full_benchmark.py`, `plot_results.py`, and `validate_problems.py`: execution matrix, provider key resolution, resumable JSONL results, failure recording, output locations, and chart-generation conventions.
Plotting & Mode Extraction `scripts/plot_results.py`	Major refactor: introduce `ModelSpec` and `MODELS`/`MODE_PATTERNS`; add JSONL loaders (`_load_jsonl`, `extract_data`, `_find_result_file`); add version/problem detectors (`_detect_vera_version`, `_detect_problem_count`, `_default_version`, `_version_to_filename`); generalise plotting (`plot_tier`, `plot_all_modes`, `plot_vera_vs_comparison`), accept CLI flags (`--version`, `--results-dir`, `--output`, `--extra`), add Aver mode and change output filename/version semantics.
Repository metadata & docs links `.gitignore`, `README.md`, `ROADMAP.md`	Ignore `results/timing.json` and generated chart assets (`assets/results-graph_.png`, `assets/benchmark_.png`); update README/ROADMAP references from `assets/benchmark_v0.0.7.png` to `assets/results-graph.png` and to versioned snapshot links.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Update README with v0.0.7 multi-model benchmark results #43: Refactors and extensions to scripts/plot_results.py — directly related to the plotting redesign and ModelSpec work.
Implement LLM runner harness (Phase 2) #3: Adds/updates vera_bench.metrics.compute_metrics and JSONL metric shapes consumed by the new extractor.
Increase test coverage to 83%, version in filenames (v0.0.6) #36: Introduces versioned result filenames / Vera-version metadata consumed by the new detection and filename logic.

Suggested labels

ci, docs

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarises the two main objectives: documenting scripts in scripts/README.md and refactoring plot_results.py to be data-driven.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/scripts-readme

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Add results/timing.json to .gitignore. It's transient per-run state (wall-clock + status per target) and was never previously committed — but nothing was stopping `git add .` from slurping it in. - Correct the scripts/README.md reproducibility note: results/*.jsonl are gitignored, not committed. Old wording implied the JSONL files live in git, which was wrong. Reproducing a historical chart means rerunning `vera-bench run` / `baselines` to regenerate the JSONL, then running plot_results.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/plot_results.py`:
- Around line 148-149: The _load_jsonl function should skip malformed JSON lines
instead of crashing: wrap the json.loads call in a try/except that catches
json.JSONDecodeError, log a warning including the file path and the offending
line (use Python's logging or warnings) and continue to the next line, returning
only successfully parsed dicts; keep function name _load_jsonl and the Path
parameter 'path' as the location to read and report in the warning.
- Around line 262-267: The code computes n = len(comparison_modes) and uses it
to compute height = 0.7 / n and offsets, which throws ZeroDivisionError if
comparison_modes is empty; add a defensive guard at the start of the plotting
block or function (referencing comparison_modes, n, height, offsets, and models)
to handle an empty list: either return early (no-op) or raise a clear
ValueError, or set n = max(1, n) before computing height so division by zero
cannot occur, and ensure offsets remain valid for downstream plotting.
- Around line 423-429: The _default_version function currently parses
pyproject.toml by string matching which is fragile; replace it with proper TOML
parsing using tomllib (stdlib in Python 3.11+) inside scripts/plot_results.py:
open and load the TOML with tomllib.load, then read the version from well-known
locations (first check data.get("project", {}).get("version"), then fall back to
data.get("tool", {}).get("poetry", {}).get("version") and similar tool-specific
keys), and return that value or "0.0.0" as a final fallback; ensure the code
handles environments without tomllib (if you must support older Pythons, use a
conditional import or an optional dependency) and update the _default_version
function accordingly.

In `@scripts/README.md`:
- Around line 255-258: In the README.md fenced code block that shows the warning
example (the block starting with ``` and the lines "Warnings: Kimi K2.5 / Vera
NL: no file matching bench-0.0.9"), add a language specifier such as text or
console by changing the opening fence from ``` to ```text so the snippet renders
consistently; locate the code block text in the scripts/README.md and update the
opening fence accordingly.
- Around line 288-292: Update the README text that says "You'll also need to
update the three `plot_*` functions, which currently assume four modes." to
reflect the refactor: removing that claim and noting that `plot_results.py`'s
plotting functions accept dynamic mode lists via the `comparison_modes` and
`modes` parameters so you only need to add entries to `MODE_PATTERNS` and
`COLORS` (no edits to `plot_*` functions required). Reference the
`plot_results.py` module and the `comparison_modes`/`modes` parameters in the
updated sentence to guide readers.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a5df7a8-2d0d-45c8-931d-161422090b74

📥 Commits

Reviewing files that changed from the base of the PR and between efcfd3a and a82afae.

⛔ Files ignored due to path filters (2)

assets/benchmark_v0.0.9.png is excluded by !**/*.png
assets/benchmark_v0.0.9_with-aver.png is excluded by !**/*.png

📒 Files selected for processing (2)

scripts/README.md
scripts/plot_results.py

codecov · 2026-04-17T10:28:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.27%. Comparing base (efcfd3a) to head (5c2c290).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #59   +/-   ##
=======================================
  Coverage   83.27%   83.27%           
=======================================
  Files          10       10           
  Lines        1363     1363           
=======================================
  Hits         1135     1135           
  Misses        228      228

Flag	Coverage Δ
python	`83.27% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Only one chart should be version-controlled: the one embedded in the top-level README. Prior state committed v0.0.7 and v0.0.9 variants alongside each other, which invites confusion about which is "the" benchmark chart. Changes: - plot_results.py: default output is now assets/results-graph.png. Suffix conventions unchanged: _v{VER} for historical versions, _with-{lang} for --extra. Suffixed variants are always gitignored. - .gitignore: ignore assets/results-graph_*.png (any variant with a suffix) and assets/benchmark_*.png (catches old-convention files). - assets/: delete benchmark_v0.0.7.png, benchmark_v0.0.9.png, benchmark_v0.0.9_with-aver.png. Add results-graph.png pinned to v0.0.7 content (generated via --version 0.0.7) so it matches the v0.0.7 narrative below it in the top-level README. The pin will be released to v0.0.9 content along with the v0.0.9 writeup. - README.md: update image path to assets/results-graph.png; keep v0.0.7 caption and narrative unchanged. - ROADMAP.md: swap broken assets/benchmark_v0.0.7.png references for v0.0.7-tag permalinks to the GitHub blob view. - scripts/README.md: update filename convention table, add overwrite-gotcha callout (default invocation regenerates from current pyproject version, which would clobber the v0.0.7 pin). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

scripts/README.md (1)
255-258: 🧹 Nitpick | 🔵 Trivial

Add language specifier to fenced code block.

The warning example code block lacks a language specifier. Adding text or console improves rendering consistency.
📝 Suggested fix
-```
+```text
 Warnings:
   Kimi K2.5 / Vera NL: no file matching bench-0.0.9
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @scripts/README.md around lines 255 - 258, Update the fenced code block
containing the example starting with the line "Warnings:" so it includes a
language specifier (e.g., add text or console) at the top of the block to
improve rendering; locate the block that currently shows "Warnings: Kimi K2.5 /
Vera NL: no file matching bench-0.0.9" and prefix the opening fence with the
chosen language.
</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @scripts/README.md:

Around line 255-258: Update the fenced code block containing the example
starting with the line "Warnings:" so it includes a language specifier (e.g.,
add text or console) at the top of the block to improve rendering; locate
the block that currently shows "Warnings: Kimi K2.5 / Vera NL: no file matching
bench-0.0.9" and prefix the opening fence with the chosen language.
</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: ASSERTIVE

**Plan**: Pro

**Run ID**: `a810fa54-225d-42b6-b083-8e0264853e42`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between a82afae0e8c77e8c99b7c710914317e0867fc92a and 9296c4d0ace7c276fdd3ab2febe4ab29cef6fd9b.

</details>

<details>
<summary>📒 Files selected for processing (2)</summary>

* `.gitignore`
* `scripts/README.md`

</details>

</details>

coderabbitai

♻️ Duplicate comments (5)

scripts/README.md (2)
275-278: 🧹 Nitpick | 🔵 Trivial

Add a language specifier to the fenced code block.

The warning example lacks a language specifier (e.g. text or console), which affects rendering consistency.
-```
+```text
 Warnings:
   Kimi K2.5 / Vera NL: no file matching bench-0.0.9
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @scripts/README.md around lines 275 - 278, The fenced code block showing the
warning example is missing a language specifier; update that fenced block (the
"Warnings: Kimi K2.5 / Vera NL: no file matching bench-0.0.9" example) to
include a language tag such as "text" (i.e., change totext) so it
renders consistently.
</details>

---

`308-312`: _⚠️ Potential issue_ | _🟡 Minor_

**Documentation inconsistency: plot functions now accept dynamic mode lists.**

Lines 310–312 claim you need to update the three `plot_*` functions when adding a new mode, but the refactored `plot_results.py` accepts mode lists dynamically via `comparison_modes` and `modes` parameters—no function edits required.

```diff
 ### Adding a new mode
 
 Add an entry to `MODE_PATTERNS` *and* add the colour to `COLORS`. You'll
-also need to update the three `plot_*` functions, which currently assume
-four modes.
+also need to register it in `OPTIONAL_COMPARISON_MODES` if it should be
+opt-in via `--extra`, or add it to `DEFAULT_COMPARISON_MODES` if it should
+always appear. The plot functions accept dynamic mode lists and require
+no changes.
```

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@scripts/README.md` around lines 308 - 312, Update the README to remove the
incorrect claim that the three plot_* functions must be edited when adding a new
mode: state that adding a new entry to MODE_PATTERNS and a colour to COLORS is
sufficient because plot_results.py handles dynamic mode lists via the
comparison_modes and modes parameters (reference MODE_PATTERNS, COLORS,
plot_results.py, and the comparison_modes/modes parameters).
```

</details>

</blockquote></details>
<details>
<summary>scripts/plot_results.py (3)</summary><blockquote>

`148-149`: _🧹 Nitpick_ | _🔵 Trivial_

**Consider adding resilience for malformed JSONL lines (optional).**

If a result file contains a malformed JSON line, `json.loads` will raise `JSONDecodeError` and abort chart generation. For operational tooling this fail-fast behaviour is defensible, but logging a warning and skipping bad lines would be more graceful.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@scripts/plot_results.py` around lines 148 - 149, The helper _load_jsonl
currently calls json.loads on every line which will raise JSONDecodeError and
abort when encountering malformed lines; change _load_jsonl to iterate lines,
try to json.loads each one, catch JSONDecodeError (and any other ValueError),
log a warning via the module logger with context (file path and line number or
snippet), skip the bad line, and continue returning only successfully parsed
dicts so chart generation is resilient to malformed JSONL entries.
```

</details>

---

`262-267`: _🧹 Nitpick_ | _🔵 Trivial_

**Minor edge case: empty `comparison_modes` causes division by zero (optional guard).**

If `comparison_modes` is empty, `n = 0` leads to `ZeroDivisionError` at line 264. The current `main()` flow always includes at least Python + TypeScript, so this won't occur in practice, but a defensive guard would make the function safer for reuse.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@scripts/plot_results.py` around lines 262 - 267, The code computes n =
len(comparison_modes) and then divides by n, which will raise ZeroDivisionError
when comparison_modes is empty; update the plotting logic in this block (where
variables y, n, height, offsets are defined) to guard against n == 0 by either
returning/skipping plotting early when comparison_modes is empty or treating n
as 1 and using a single centered offset (e.g., offsets = [0]) so height
calculation and offsets don't divide by zero; make the chosen behavior
consistent with the rest of the function (skip plotting or render a single bar)
and adjust downstream code that expects offsets accordingly.
```

</details>

---

`423-429`: _🧹 Nitpick_ | _🔵 Trivial_

**Fragile pyproject.toml parsing (optional improvement).**

The version extraction assumes the first line starting with `version` is the project version. If `pyproject.toml` gains additional `version` keys in tool sections (e.g. `[tool.black]`), this could pick up the wrong one. Consider using `tomllib` (stdlib in 3.11+) for robust parsing.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@scripts/plot_results.py` around lines 423 - 429, The current _default_version
function is fragile because it string-parses pyproject.toml and may pick up
non-project "version" keys; change it to parse pyproject.toml with the TOML
library (use stdlib tomllib on Python 3.11+) and extract the canonical project
version (e.g., check keys like project.version and, if you need Poetry support,
tool.poetry.version) instead of line-based matching; update the function
_default_version to open and tomllib.load the pyproject file, return the
appropriate version field if present, and keep the existing "0.0.0" fallback if
parsing or keys are missing (or fallback to the current string-scan logic only
as a last resort).
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @scripts/plot_results.py:

Around line 148-149: The helper _load_jsonl currently calls json.loads on
every line which will raise JSONDecodeError and abort when encountering
malformed lines; change _load_jsonl to iterate lines, try to json.loads each
one, catch JSONDecodeError (and any other ValueError), log a warning via the
module logger with context (file path and line number or snippet), skip the bad
line, and continue returning only successfully parsed dicts so chart generation
is resilient to malformed JSONL entries.

Around line 262-267: The code computes n = len(comparison_modes) and then
divides by n, which will raise ZeroDivisionError when comparison_modes is empty;
update the plotting logic in this block (where variables y, n, height, offsets
are defined) to guard against n == 0 by either returning/skipping plotting early
when comparison_modes is empty or treating n as 1 and using a single centered
offset (e.g., offsets = [0]) so height calculation and offsets don't divide by
zero; make the chosen behavior consistent with the rest of the function (skip
plotting or render a single bar) and adjust downstream code that expects offsets
accordingly.

Around line 423-429: The current _default_version function is fragile because
it string-parses pyproject.toml and may pick up non-project "version" keys;
change it to parse pyproject.toml with the TOML library (use stdlib tomllib on
Python 3.11+) and extract the canonical project version (e.g., check keys like
project.version and, if you need Poetry support, tool.poetry.version) instead of
line-based matching; update the function _default_version to open and
tomllib.load the pyproject file, return the appropriate version field if
present, and keep the existing "0.0.0" fallback if parsing or keys are missing
(or fallback to the current string-scan logic only as a last resort).

In @scripts/README.md:

Around line 275-278: The fenced code block showing the warning example is
missing a language specifier; update that fenced block (the "Warnings: Kimi K2.5
/ Vera NL: no file matching bench-0.0.9" example) to include a language tag such
as "text" (i.e., change totext) so it renders consistently.

Around line 308-312: Update the README to remove the incorrect claim that the
three plot_* functions must be edited when adding a new mode: state that adding
a new entry to MODE_PATTERNS and a colour to COLORS is sufficient because
plot_results.py handles dynamic mode lists via the comparison_modes and modes
parameters (reference MODE_PATTERNS, COLORS, plot_results.py, and the
comparison_modes/modes parameters).
</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: ASSERTIVE

**Plan**: Pro

**Run ID**: `3e744ba8-02e6-4f52-9abf-3ecc1569fbaf`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 9296c4d0ace7c276fdd3ab2febe4ab29cef6fd9b and 74fa6abb6a734d5d3ceb89aa449e2bd5bff3682c.

</details>

<details>
<summary>⛔ Files ignored due to path filters (2)</summary>

* `assets/benchmark_v0.0.7.png` is excluded by `!**/*.png`
* `assets/results-graph.png` is excluded by `!**/*.png`

</details>

<details>
<summary>📒 Files selected for processing (5)</summary>

* `.gitignore`
* `README.md`
* `ROADMAP.md`
* `scripts/README.md`
* `scripts/plot_results.py`

</details>

</details>

- plot_results.py: parse pyproject.toml with tomllib (stdlib in 3.11+ per CLAUDE.md) instead of line-by-line string matching. Robust against unrelated `version` keys in other sections. Falls back to PEP 621 project.version then poetry's tool.poetry.version, then "0.0.0". - scripts/README.md: add `text` language to the warning-example fenced block so GitHub's syntax highlighter doesn't guess. - scripts/README.md: the "Adding a new mode" section claimed the three plot_* functions would need edits — that was stale. After the dynamic mode-list refactor they take comparison_modes/modes parameters and need no edits. Also note _COMPILER_SUFFIXED for compiler-versioned filenames. Rejected CodeRabbit findings (commented inline on PR): - _load_jsonl error-swallowing: inputs are machine-written by vera-bench; failing loud on malformed lines is correct. - ZeroDivisionError guard on comparison_modes: list is constructed from a hardcoded non-empty default; the empty case is unreachable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Whitespace-only — CI lint includes 'ruff format --check'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/plot_results.py`:
- Around line 2-15: Update the module docstring to match the current output
convention used by main(): replace the outdated reference to
`assets/benchmark_v{VERSION}.png` with the new default path (e.g.
`assets/results-graph...`) so that ArgumentParser(description=__doc__) and the
script's --help text are accurate; locate the top-level docstring in
scripts/plot_results.py and edit the sentence(s) describing the generated file
name/output path accordingly, keeping the usage examples consistent with the
current --output/default behavior.
- Around line 400-421: The title detection functions currently rescan all
matching Vera files and can pick stale runs; update them to derive metadata from
the actual file chosen by _find_result_file() instead of globbing. Modify
_detect_vera_version() and _detect_problem_count() (or replace them) so they
accept the single selected Path (the value returned by _find_result_file()) and
parse that file's filename/stem for the compiler version and read its contents
to count unique problem_id values; alternatively change the call sites to call
_find_result_file() first and pass its Path into these functions rather than
letting those functions perform their own glob().
- Around line 182-184: compute_metrics is being called without excluding Tier 5,
which mixes Tier‑5 language-specific tasks into the headline cross-language
rates; update the call where metrics is computed (the compute_metrics(...,
_load_jsonl(path)) invocation near row[mode]) to pass the exclude_tiers argument
to omit tier 5 (e.g., exclude_tiers=[5] or the codebase's expected container) so
rate = metrics.run_correct_rate reflects only T1–T4; keep the rest of the
rounding/assignment to row[mode] unchanged.
- Around line 176-180: Replace the placeholder numeric 0 for missing JSONL
results with a proper missing-value sentinel (e.g., use numpy.nan or pandas.NA)
so the delta chart can skip unavailable data: change the assignment row[mode] =
0 to row[mode] = np.nan (ensure import numpy as np) in the shown branch where
warnings are appended (reference variables: row, mode, warnings, model.display),
and make the analogous change for the other missing-data blocks flagged in the
same file (the region you noted around lines 257–292). Also verify any
downstream delta/comparison logic (the code that computes deltas from the
DataFrame) uses pd.isna / np.isnan checks so it skips or marks comparisons when
values are NaN instead of treating them as 0.

In `@scripts/README.md`:
- Around line 317-325: Update the README to describe panel 3 as dynamic rather
than a fixed "Does Vera beat Python / TypeScript?" by noting that
plot_vera_vs_comparison() appends every --extra comparison mode; change the
panel description to explain it renders horizontal delta bars for each
comparison mode (including any --extra values like "aver") and will vary
depending on invocation, and mention that default invocation shows Vera vs
Python/TypeScript while extra flags add additional comparison columns.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 39307976-318c-4dfa-85ef-4b2ebf9e50e2

📥 Commits

Reviewing files that changed from the base of the PR and between 74fa6ab and d249f5c.

📒 Files selected for processing (2)

scripts/README.md
scripts/plot_results.py

Uploaded benchmark_v0.0.7.png as a release asset on the v0.0.7 GitHub release, then switched ROADMAP.md and scripts/README.md to use the release-asset URL instead of a tag-scoped blob URL. Release assets survive tag/repo restructuring and live on the release page, so they're more discoverable than deep-linked tag views. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three CodeRabbit findings this round: 1. Module docstring still referenced the old `benchmark_v{VERSION}.png` filename convention. Updated to reflect the current `assets/results-graph.png` default plus suffix variants. 2. `_detect_vera_version` and `_detect_problem_count` re-globbed `results/*-bench-{ver}-vera-*.jsonl` independently of which files `_find_result_file` actually selected. With _find_result_file's "newest-by-mtime" tie-breaker, the two lookups could disagree: the chart would plot one set of files while the subtitle described another. In practice no drift exists today, but it's a real bug waiting to happen. Fix: extract_data() now also returns `used_paths` — the JSONL files it actually consulted. The two _detect_* functions take that list instead of globbing, so subtitle metadata is consistent with the plotted data by construction. 3. README "Chart layout" section described panel 3 as a fixed "Does Vera beat Python / TypeScript?" and panel 4 as "four modes". Both are actually dynamic — the panel-3 title is rebuilt from the active comparison set, and panel-4's bar count grows with --extra flags. Updated to reflect the dynamic behaviour. Rejected findings (commented inline on PR): - `compute_metrics(..., exclude_tiers={5})`: this is a design choice, not a bug. The chart has always used all-tier numbers to match the v0.0.7 chart convention; excluding T5 would alter the published historical content and create a mismatch with the README narrative. Worth a separate issue to discuss chart metric semantics. - `row[mode] = 0` → `np.nan` for missing files: the warning channel already flags missing data, and a visible 0-bar is a louder "data is absent" signal than an invisible bar. In practice we don't plot with missing data; worth revisiting only if that changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

scripts/plot_results.py (3)
398-419: ⚠️ Potential issue | 🟡 Minor

Subtitle metadata may not match the files actually plotted.

_detect_vera_version() and _detect_problem_count() glob all matching Vera files for the bench version, but _find_result_file() selects only the most recently modified match. If stale reruns remain in results/, the chart title could advertise a compiler version or problem count from a file that wasn't used for the bars.

Consider deriving these values from the paths returned by _find_result_file() during extract_data(), or passing the selected paths into these helpers.

,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/plot_results.py` around lines 398 - 419, The subtitle metadata is
computed from all matching Vera files but the plotting uses the single path
selected by _find_result_file; update extract_data to pass the actual selected
Vera result path(s) into _detect_vera_version and _detect_problem_count (or
change those helpers to accept a list of paths) so they derive the version and
problem count from the exact file(s) chosen by _find_result_file rather than
from a fresh glob; specifically, modify extract_data to capture the path(s)
returned by _find_result_file and forward them to
_detect_vera_version/_detect_problem_count (or refactor those functions to
accept paths and compute counts/versions from those paths).
2-15: ⚠️ Potential issue | 🟡 Minor

Module docstring still references the old output path.

The docstring says the script "produces a run_correct comparison chart at assets/benchmark_v{VERSION}.png" but main() now defaults to assets/results-graph{...}.png. Since ArgumentParser(description=__doc__) reuses this text, --help output is stale.
📝 Suggested fix
 """Generate benchmark comparison charts from VeraBench results.
 
 Reads JSONL files in `results/` (via `vera_bench.metrics.compute_metrics`) and
-produces a run_correct comparison chart at `assets/benchmark_v{VERSION}.png`.
+produces a run_correct comparison chart at `assets/results-graph.png` (or a
+suffixed variant for historical versions / optional comparison languages).
 
 Usage:
     python scripts/plot_results.py                    # uses pyproject version
     python scripts/plot_results.py --version 0.0.9    # explicit bench version
     python scripts/plot_results.py --version 0.0.7    # re-render historical chart
     python scripts/plot_results.py --output my.png    # custom output path
+    python scripts/plot_results.py --extra aver       # include Aver comparison
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/plot_results.py` around lines 2 - 15, The module docstring is stale
and still says the output is "assets/benchmark_v{VERSION}.png" while
main()/ArgumentParser(description=__doc__) now defaults to
"assets/results-graph{...}.png"; update the top-level docstring to describe the
current default output filename/pattern (matching the value used in
main/argument parsing) so the --help text is correct (look for the docstring and
the main() / ArgumentParser(description=__doc__) usage to ensure the wording and
example usages reflect the new assets/results-graph...png path).
174-182: ⚠️ Potential issue | 🟠 Major

Missing JSONL files are recorded as 0, which fabricates delta-chart wins.

When a result file is missing, row[mode] = 0 is used as a placeholder. This causes the delta panel to compute Vera - 0 and display an inflated positive delta. A distinct sentinel (None or np.nan) would let downstream plotting skip or annotate unavailable comparisons rather than overstating Vera's margin.

Additionally, compute_metrics() is called without exclude_tiers={5}, so Tier 5's language-specific tasks are folded into the cross-language percentages. The exclude_tiers parameter exists precisely for T1–T4 apples-to-apples comparison.
🛠️ Proposed fix for Tier 5 exclusion
-            metrics = compute_metrics(_load_jsonl(path))
+            metrics = compute_metrics(_load_jsonl(path), exclude_tiers={5})
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/plot_results.py` around lines 174 - 182, When a result file is
missing, don't set row[mode] = 0 (which fabricates deltas); instead set a
sentinel (e.g., None or np.nan) so downstream plotting can skip or annotate
missing values, and ensure you import numpy if using np.nan; also call
compute_metrics with exclude_tiers={5} (i.e., compute_metrics(_load_jsonl(path),
exclude_tiers={5})) so Tier‑5 tasks are excluded from the T1–T4 cross-language
percentages—update the logic around path existence (the block referencing
model.display, mode, version, warnings, _load_jsonl, compute_metrics, and row)
accordingly.
scripts/README.md (1)
317-325: ⚠️ Potential issue | 🟡 Minor

Panel 3 description is stale — now dynamic based on --extra flags.

The description states the third panel is "Does Vera beat Python / TypeScript?" but plot_vera_vs_comparison() now dynamically includes any --extra comparison modes. When invoked with --extra aver, the panel title becomes "Does Vera beat Python / TypeScript / Aver?" and includes an additional delta bar group.
📝 Suggested fix
-3. **Does Vera beat Python / TypeScript?** — horizontal delta bars per model
+3. **Does Vera beat [comparison languages]?** — horizontal delta bars per model,
+   one group per comparison language (Python, TypeScript by default; additional
+   languages added via `--extra`)
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/README.md` around lines 317 - 325, The README's Chart layout text is
stale: update the Panel 3 description that currently reads "Does Vera beat
Python / TypeScript?" to reflect that plot_vera_vs_comparison() builds the panel
dynamically based on the --extra flags; change the wording to indicate the panel
shows Vera vs the comparison modes specified via --extra (e.g., "Does Vera beat
Python / TypeScript / <extra>?" or "Dynamic: Vera vs comparison modes from
--extra") and mention that additional bars/groups appear when extra modes are
passed so readers know the title and contents are generated at runtime.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/README.md`:
- Around line 227-241: The README has fenced code blocks inside a numbered list
without blank lines, causing MD031; update the sections that show "Rust":
"rust-", the `_COMPILER_SUFFIXED = {"Vera": "vera", "Vera NL": "vera", "Aver":
"aver", "Rust": "rust"}` snippet, the COLORS note, and the
`OPTIONAL_COMPARISON_MODES = {"aver": "Aver", "rust": "Rust"}` snippet to ensure
there is a blank line before and after each ```python fenced code block so
parsers/linters stop flagging MD031 and the blocks render correctly.

---

Duplicate comments:
In `@scripts/plot_results.py`:
- Around line 398-419: The subtitle metadata is computed from all matching Vera
files but the plotting uses the single path selected by _find_result_file;
update extract_data to pass the actual selected Vera result path(s) into
_detect_vera_version and _detect_problem_count (or change those helpers to
accept a list of paths) so they derive the version and problem count from the
exact file(s) chosen by _find_result_file rather than from a fresh glob;
specifically, modify extract_data to capture the path(s) returned by
_find_result_file and forward them to _detect_vera_version/_detect_problem_count
(or refactor those functions to accept paths and compute counts/versions from
those paths).
- Around line 2-15: The module docstring is stale and still says the output is
"assets/benchmark_v{VERSION}.png" while
main()/ArgumentParser(description=__doc__) now defaults to
"assets/results-graph{...}.png"; update the top-level docstring to describe the
current default output filename/pattern (matching the value used in
main/argument parsing) so the --help text is correct (look for the docstring and
the main() / ArgumentParser(description=__doc__) usage to ensure the wording and
example usages reflect the new assets/results-graph...png path).
- Around line 174-182: When a result file is missing, don't set row[mode] = 0
(which fabricates deltas); instead set a sentinel (e.g., None or np.nan) so
downstream plotting can skip or annotate missing values, and ensure you import
numpy if using np.nan; also call compute_metrics with exclude_tiers={5} (i.e.,
compute_metrics(_load_jsonl(path), exclude_tiers={5})) so Tier‑5 tasks are
excluded from the T1–T4 cross-language percentages—update the logic around path
existence (the block referencing model.display, mode, version, warnings,
_load_jsonl, compute_metrics, and row) accordingly.

In `@scripts/README.md`:
- Around line 317-325: The README's Chart layout text is stale: update the Panel
3 description that currently reads "Does Vera beat Python / TypeScript?" to
reflect that plot_vera_vs_comparison() builds the panel dynamically based on the
--extra flags; change the wording to indicate the panel shows Vera vs the
comparison modes specified via --extra (e.g., "Does Vera beat Python /
TypeScript / <extra>?" or "Dynamic: Vera vs comparison modes from --extra") and
mention that additional bars/groups appear when extra modes are passed so
readers know the title and contents are generated at runtime.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1574b988-37a6-45f3-9150-793860e391a0

📥 Commits

Reviewing files that changed from the base of the PR and between d249f5c and 5c8dcd7.

📒 Files selected for processing (3)

ROADMAP.md
scripts/README.md
scripts/plot_results.py

Strict Markdown parsers (and markdownlint MD031) require blank lines around fenced code blocks inside numbered lists. GitHub renders it fine either way, but silencing the linter is cheap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread scripts/plot_results.py

Comment thread scripts/plot_results.py

Comment thread scripts/plot_results.py Outdated

Comment thread scripts/README.md Outdated

Comment thread scripts/README.md Outdated

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

aallan and others added 2 commits April 17, 2026 15:22

Apply ruff format to plot_results.py

15e0902

Whitespace-only — CI lint includes 'ruff format --check'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread scripts/plot_results.py

Comment thread scripts/plot_results.py

Comment thread scripts/plot_results.py

Comment thread scripts/plot_results.py Outdated

Comment thread scripts/README.md

aallan and others added 2 commits April 17, 2026 15:30

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread scripts/README.md

aallan merged commit c82d28b into main Apr 17, 2026
10 checks passed

aallan deleted the docs/scripts-readme branch April 17, 2026 14:47

coderabbitai Bot mentioned this pull request May 7, 2026

Migrate kimi-k2-turbo-preview → kimi-k2.6 (closes #68) #69

Merged

4 tasks

aallan mentioned this pull request May 22, 2026

Add scripts/plot_slide.py — v0.0.7 talk-slide renderer at 16:9 #71

Merged

5 tasks

coderabbitai Bot mentioned this pull request May 29, 2026

Migrate Anthropic defaults: Opus 4 → Opus 4.8, Sonnet 4 → Sonnet 4.6 #76

Merged

7 tasks

Conversation

aallan commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why now

What's in the README

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aallan commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

codecov Bot commented Apr 17, 2026 •

edited

Loading