docs: document all scripts in scripts/README.md; make plot_results data-driven#59
Conversation
Previously the scripts/ directory had no README — each script was claimed to be "self-documenting via --help", but --help coverage was thin (run_full_benchmark.py omits the 8-target list, env-var mapping, multi-model sweeps, timing, and output filenames). scripts/README.md now covers all three: - run_full_benchmark.py: target list, provider env vars, sweep recipe, timing expectations, output filenames, exit-code semantics - plot_results.py: usage, file-discovery pattern, model-registry extension, colour palette, reproducibility, new --extra flag - validate_problems.py: column-by-column output interpretation, when to run Related plot_results.py refactor: - Read run_correct from JSONL files via compute_metrics() instead of hand-copied percentages (eliminates transcription errors) - Auto-detect Vera compiler version and problem count from filenames - Add MODELS registry keyed on file_prefix; replace hardcoded FLAGSHIP/SONNET dicts - Add --extra CLI flag (opt-in comparison languages; Aver registered today, others pluggable) - Output suffix _with-<lang> when extras used; default path unchanged - Delta-chart title + x-axis auto-adapt to the comparison-language list Regenerate assets/benchmark_v0.0.9.png (byte-equivalent to previous); add assets/benchmark_v0.0.9_with-aver.png as the 5-mode variant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an in-repo scripts README; refactors Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
- Add results/timing.json to .gitignore. It's transient per-run state (wall-clock + status per target) and was never previously committed — but nothing was stopping `git add .` from slurping it in. - Correct the scripts/README.md reproducibility note: results/*.jsonl are gitignored, not committed. Old wording implied the JSONL files live in git, which was wrong. Reproducing a historical chart means rerunning `vera-bench run` / `baselines` to regenerate the JSONL, then running plot_results.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/plot_results.py`:
- Around line 148-149: The _load_jsonl function should skip malformed JSON lines
instead of crashing: wrap the json.loads call in a try/except that catches
json.JSONDecodeError, log a warning including the file path and the offending
line (use Python's logging or warnings) and continue to the next line, returning
only successfully parsed dicts; keep function name _load_jsonl and the Path
parameter 'path' as the location to read and report in the warning.
- Around line 262-267: The code computes n = len(comparison_modes) and uses it
to compute height = 0.7 / n and offsets, which throws ZeroDivisionError if
comparison_modes is empty; add a defensive guard at the start of the plotting
block or function (referencing comparison_modes, n, height, offsets, and models)
to handle an empty list: either return early (no-op) or raise a clear
ValueError, or set n = max(1, n) before computing height so division by zero
cannot occur, and ensure offsets remain valid for downstream plotting.
- Around line 423-429: The _default_version function currently parses
pyproject.toml by string matching which is fragile; replace it with proper TOML
parsing using tomllib (stdlib in Python 3.11+) inside scripts/plot_results.py:
open and load the TOML with tomllib.load, then read the version from well-known
locations (first check data.get("project", {}).get("version"), then fall back to
data.get("tool", {}).get("poetry", {}).get("version") and similar tool-specific
keys), and return that value or "0.0.0" as a final fallback; ensure the code
handles environments without tomllib (if you must support older Pythons, use a
conditional import or an optional dependency) and update the _default_version
function accordingly.
In `@scripts/README.md`:
- Around line 255-258: In the README.md fenced code block that shows the warning
example (the block starting with ``` and the lines "Warnings: Kimi K2.5 / Vera
NL: no file matching bench-0.0.9"), add a language specifier such as text or
console by changing the opening fence from ``` to ```text so the snippet renders
consistently; locate the code block text in the scripts/README.md and update the
opening fence accordingly.
- Around line 288-292: Update the README text that says "You'll also need to
update the three `plot_*` functions, which currently assume four modes." to
reflect the refactor: removing that claim and noting that `plot_results.py`'s
plotting functions accept dynamic mode lists via the `comparison_modes` and
`modes` parameters so you only need to add entries to `MODE_PATTERNS` and
`COLORS` (no edits to `plot_*` functions required). Reference the
`plot_results.py` module and the `comparison_modes`/`modes` parameters in the
updated sentence to guide readers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7a5df7a8-2d0d-45c8-931d-161422090b74
⛔ Files ignored due to path filters (2)
assets/benchmark_v0.0.9.pngis excluded by!**/*.pngassets/benchmark_v0.0.9_with-aver.pngis excluded by!**/*.png
📒 Files selected for processing (2)
scripts/README.mdscripts/plot_results.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #59 +/- ##
=======================================
Coverage 83.27% 83.27%
=======================================
Files 10 10
Lines 1363 1363
=======================================
Hits 1135 1135
Misses 228 228
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Only one chart should be version-controlled: the one embedded in the
top-level README. Prior state committed v0.0.7 and v0.0.9 variants
alongside each other, which invites confusion about which is "the"
benchmark chart.
Changes:
- plot_results.py: default output is now assets/results-graph.png.
Suffix conventions unchanged: _v{VER} for historical versions,
_with-{lang} for --extra. Suffixed variants are always gitignored.
- .gitignore: ignore assets/results-graph_*.png (any variant with a
suffix) and assets/benchmark_*.png (catches old-convention files).
- assets/: delete benchmark_v0.0.7.png, benchmark_v0.0.9.png,
benchmark_v0.0.9_with-aver.png. Add results-graph.png pinned to
v0.0.7 content (generated via --version 0.0.7) so it matches the
v0.0.7 narrative below it in the top-level README. The pin will be
released to v0.0.9 content along with the v0.0.9 writeup.
- README.md: update image path to assets/results-graph.png; keep
v0.0.7 caption and narrative unchanged.
- ROADMAP.md: swap broken assets/benchmark_v0.0.7.png references for
v0.0.7-tag permalinks to the GitHub blob view.
- scripts/README.md: update filename convention table, add
overwrite-gotcha callout (default invocation regenerates from
current pyproject version, which would clobber the v0.0.7 pin).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
scripts/README.md (1)
255-258: 🧹 Nitpick | 🔵 TrivialAdd language specifier to fenced code block.
The warning example code block lacks a language specifier. Adding
textorconsoleimproves rendering consistency.📝 Suggested fix
-``` +```text Warnings: Kimi K2.5 / Vera NL: no file matching bench-0.0.9</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@scripts/README.mdaround lines 255 - 258, Update the fenced code block
containing the example starting with the line "Warnings:" so it includes a
language specifier (e.g., addtext orconsole) at the top of the block to
improve rendering; locate the block that currently shows "Warnings: Kimi K2.5 /
Vera NL: no file matching bench-0.0.9" and prefix the opening fence with the
chosen language.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In@scripts/README.md:
- Around line 255-258: Update the fenced code block containing the example
starting with the line "Warnings:" so it includes a language specifier (e.g.,
addtext orconsole) at the top of the block to improve rendering; locate
the block that currently shows "Warnings: Kimi K2.5 / Vera NL: no file matching
bench-0.0.9" and prefix the opening fence with the chosen language.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: ASSERTIVE **Plan**: Pro **Run ID**: `a810fa54-225d-42b6-b083-8e0264853e42` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between a82afae0e8c77e8c99b7c710914317e0867fc92a and 9296c4d0ace7c276fdd3ab2febe4ab29cef6fd9b. </details> <details> <summary>📒 Files selected for processing (2)</summary> * `.gitignore` * `scripts/README.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
♻️ Duplicate comments (5)
scripts/README.md (2)
275-278: 🧹 Nitpick | 🔵 TrivialAdd a language specifier to the fenced code block.
The warning example lacks a language specifier (e.g.
textorconsole), which affects rendering consistency.-``` +```text Warnings: Kimi K2.5 / Vera NL: no file matching bench-0.0.9<details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@scripts/README.mdaround lines 275 - 278, The fenced code block showing the
warning example is missing a language specifier; update that fenced block (the
"Warnings: Kimi K2.5 / Vera NL: no file matching bench-0.0.9" example) to
include a language tag such as "text" (i.e., changetotext) so it
renders consistently.</details> --- `308-312`: _⚠️ Potential issue_ | _🟡 Minor_ **Documentation inconsistency: plot functions now accept dynamic mode lists.** Lines 310–312 claim you need to update the three `plot_*` functions when adding a new mode, but the refactored `plot_results.py` accepts mode lists dynamically via `comparison_modes` and `modes` parameters—no function edits required. ```diff ### Adding a new mode Add an entry to `MODE_PATTERNS` *and* add the colour to `COLORS`. You'll -also need to update the three `plot_*` functions, which currently assume -four modes. +also need to register it in `OPTIONAL_COMPARISON_MODES` if it should be +opt-in via `--extra`, or add it to `DEFAULT_COMPARISON_MODES` if it should +always appear. The plot functions accept dynamic mode lists and require +no changes. ``` <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@scripts/README.md` around lines 308 - 312, Update the README to remove the incorrect claim that the three plot_* functions must be edited when adding a new mode: state that adding a new entry to MODE_PATTERNS and a colour to COLORS is sufficient because plot_results.py handles dynamic mode lists via the comparison_modes and modes parameters (reference MODE_PATTERNS, COLORS, plot_results.py, and the comparison_modes/modes parameters). ``` </details> </blockquote></details> <details> <summary>scripts/plot_results.py (3)</summary><blockquote> `148-149`: _🧹 Nitpick_ | _🔵 Trivial_ **Consider adding resilience for malformed JSONL lines (optional).** If a result file contains a malformed JSON line, `json.loads` will raise `JSONDecodeError` and abort chart generation. For operational tooling this fail-fast behaviour is defensible, but logging a warning and skipping bad lines would be more graceful. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@scripts/plot_results.py` around lines 148 - 149, The helper _load_jsonl currently calls json.loads on every line which will raise JSONDecodeError and abort when encountering malformed lines; change _load_jsonl to iterate lines, try to json.loads each one, catch JSONDecodeError (and any other ValueError), log a warning via the module logger with context (file path and line number or snippet), skip the bad line, and continue returning only successfully parsed dicts so chart generation is resilient to malformed JSONL entries. ``` </details> --- `262-267`: _🧹 Nitpick_ | _🔵 Trivial_ **Minor edge case: empty `comparison_modes` causes division by zero (optional guard).** If `comparison_modes` is empty, `n = 0` leads to `ZeroDivisionError` at line 264. The current `main()` flow always includes at least Python + TypeScript, so this won't occur in practice, but a defensive guard would make the function safer for reuse. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@scripts/plot_results.py` around lines 262 - 267, The code computes n = len(comparison_modes) and then divides by n, which will raise ZeroDivisionError when comparison_modes is empty; update the plotting logic in this block (where variables y, n, height, offsets are defined) to guard against n == 0 by either returning/skipping plotting early when comparison_modes is empty or treating n as 1 and using a single centered offset (e.g., offsets = [0]) so height calculation and offsets don't divide by zero; make the chosen behavior consistent with the rest of the function (skip plotting or render a single bar) and adjust downstream code that expects offsets accordingly. ``` </details> --- `423-429`: _🧹 Nitpick_ | _🔵 Trivial_ **Fragile pyproject.toml parsing (optional improvement).** The version extraction assumes the first line starting with `version` is the project version. If `pyproject.toml` gains additional `version` keys in tool sections (e.g. `[tool.black]`), this could pick up the wrong one. Consider using `tomllib` (stdlib in 3.11+) for robust parsing. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@scripts/plot_results.py` around lines 423 - 429, The current _default_version function is fragile because it string-parses pyproject.toml and may pick up non-project "version" keys; change it to parse pyproject.toml with the TOML library (use stdlib tomllib on Python 3.11+) and extract the canonical project version (e.g., check keys like project.version and, if you need Poetry support, tool.poetry.version) instead of line-based matching; update the function _default_version to open and tomllib.load the pyproject file, return the appropriate version field if present, and keep the existing "0.0.0" fallback if parsing or keys are missing (or fallback to the current string-scan logic only as a last resort). ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In@scripts/plot_results.py:
- Around line 148-149: The helper _load_jsonl currently calls json.loads on
every line which will raise JSONDecodeError and abort when encountering
malformed lines; change _load_jsonl to iterate lines, try to json.loads each
one, catch JSONDecodeError (and any other ValueError), log a warning via the
module logger with context (file path and line number or snippet), skip the bad
line, and continue returning only successfully parsed dicts so chart generation
is resilient to malformed JSONL entries.- Around line 262-267: The code computes n = len(comparison_modes) and then
divides by n, which will raise ZeroDivisionError when comparison_modes is empty;
update the plotting logic in this block (where variables y, n, height, offsets
are defined) to guard against n == 0 by either returning/skipping plotting early
when comparison_modes is empty or treating n as 1 and using a single centered
offset (e.g., offsets = [0]) so height calculation and offsets don't divide by
zero; make the chosen behavior consistent with the rest of the function (skip
plotting or render a single bar) and adjust downstream code that expects offsets
accordingly.- Around line 423-429: The current _default_version function is fragile because
it string-parses pyproject.toml and may pick up non-project "version" keys;
change it to parse pyproject.toml with the TOML library (use stdlib tomllib on
Python 3.11+) and extract the canonical project version (e.g., check keys like
project.version and, if you need Poetry support, tool.poetry.version) instead of
line-based matching; update the function _default_version to open and
tomllib.load the pyproject file, return the appropriate version field if
present, and keep the existing "0.0.0" fallback if parsing or keys are missing
(or fallback to the current string-scan logic only as a last resort).In
@scripts/README.md:
- Around line 275-278: The fenced code block showing the warning example is
missing a language specifier; update that fenced block (the "Warnings: Kimi K2.5
/ Vera NL: no file matching bench-0.0.9" example) to include a language tag such
as "text" (i.e., changetotext) so it renders consistently.- Around line 308-312: Update the README to remove the incorrect claim that the
three plot_* functions must be edited when adding a new mode: state that adding
a new entry to MODE_PATTERNS and a colour to COLORS is sufficient because
plot_results.py handles dynamic mode lists via the comparison_modes and modes
parameters (reference MODE_PATTERNS, COLORS, plot_results.py, and the
comparison_modes/modes parameters).</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: ASSERTIVE **Plan**: Pro **Run ID**: `3e744ba8-02e6-4f52-9abf-3ecc1569fbaf` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 9296c4d0ace7c276fdd3ab2febe4ab29cef6fd9b and 74fa6abb6a734d5d3ceb89aa449e2bd5bff3682c. </details> <details> <summary>⛔ Files ignored due to path filters (2)</summary> * `assets/benchmark_v0.0.7.png` is excluded by `!**/*.png` * `assets/results-graph.png` is excluded by `!**/*.png` </details> <details> <summary>📒 Files selected for processing (5)</summary> * `.gitignore` * `README.md` * `ROADMAP.md` * `scripts/README.md` * `scripts/plot_results.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
- plot_results.py: parse pyproject.toml with tomllib (stdlib in 3.11+ per CLAUDE.md) instead of line-by-line string matching. Robust against unrelated `version` keys in other sections. Falls back to PEP 621 project.version then poetry's tool.poetry.version, then "0.0.0". - scripts/README.md: add `text` language to the warning-example fenced block so GitHub's syntax highlighter doesn't guess. - scripts/README.md: the "Adding a new mode" section claimed the three plot_* functions would need edits — that was stale. After the dynamic mode-list refactor they take comparison_modes/modes parameters and need no edits. Also note _COMPILER_SUFFIXED for compiler-versioned filenames. Rejected CodeRabbit findings (commented inline on PR): - _load_jsonl error-swallowing: inputs are machine-written by vera-bench; failing loud on malformed lines is correct. - ZeroDivisionError guard on comparison_modes: list is constructed from a hardcoded non-empty default; the empty case is unreachable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Whitespace-only — CI lint includes 'ruff format --check'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/plot_results.py`:
- Around line 2-15: Update the module docstring to match the current output
convention used by main(): replace the outdated reference to
`assets/benchmark_v{VERSION}.png` with the new default path (e.g.
`assets/results-graph...`) so that ArgumentParser(description=__doc__) and the
script's --help text are accurate; locate the top-level docstring in
scripts/plot_results.py and edit the sentence(s) describing the generated file
name/output path accordingly, keeping the usage examples consistent with the
current --output/default behavior.
- Around line 400-421: The title detection functions currently rescan all
matching Vera files and can pick stale runs; update them to derive metadata from
the actual file chosen by _find_result_file() instead of globbing. Modify
_detect_vera_version() and _detect_problem_count() (or replace them) so they
accept the single selected Path (the value returned by _find_result_file()) and
parse that file's filename/stem for the compiler version and read its contents
to count unique problem_id values; alternatively change the call sites to call
_find_result_file() first and pass its Path into these functions rather than
letting those functions perform their own glob().
- Around line 182-184: compute_metrics is being called without excluding Tier 5,
which mixes Tier‑5 language-specific tasks into the headline cross-language
rates; update the call where metrics is computed (the compute_metrics(...,
_load_jsonl(path)) invocation near row[mode]) to pass the exclude_tiers argument
to omit tier 5 (e.g., exclude_tiers=[5] or the codebase's expected container) so
rate = metrics.run_correct_rate reflects only T1–T4; keep the rest of the
rounding/assignment to row[mode] unchanged.
- Around line 176-180: Replace the placeholder numeric 0 for missing JSONL
results with a proper missing-value sentinel (e.g., use numpy.nan or pandas.NA)
so the delta chart can skip unavailable data: change the assignment row[mode] =
0 to row[mode] = np.nan (ensure import numpy as np) in the shown branch where
warnings are appended (reference variables: row, mode, warnings, model.display),
and make the analogous change for the other missing-data blocks flagged in the
same file (the region you noted around lines 257–292). Also verify any
downstream delta/comparison logic (the code that computes deltas from the
DataFrame) uses pd.isna / np.isnan checks so it skips or marks comparisons when
values are NaN instead of treating them as 0.
In `@scripts/README.md`:
- Around line 317-325: Update the README to describe panel 3 as dynamic rather
than a fixed "Does Vera beat Python / TypeScript?" by noting that
plot_vera_vs_comparison() appends every --extra comparison mode; change the
panel description to explain it renders horizontal delta bars for each
comparison mode (including any --extra values like "aver") and will vary
depending on invocation, and mention that default invocation shows Vera vs
Python/TypeScript while extra flags add additional comparison columns.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 39307976-318c-4dfa-85ef-4b2ebf9e50e2
📒 Files selected for processing (2)
scripts/README.mdscripts/plot_results.py
Uploaded benchmark_v0.0.7.png as a release asset on the v0.0.7 GitHub release, then switched ROADMAP.md and scripts/README.md to use the release-asset URL instead of a tag-scoped blob URL. Release assets survive tag/repo restructuring and live on the release page, so they're more discoverable than deep-linked tag views. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three CodeRabbit findings this round:
1. Module docstring still referenced the old `benchmark_v{VERSION}.png`
filename convention. Updated to reflect the current
`assets/results-graph.png` default plus suffix variants.
2. `_detect_vera_version` and `_detect_problem_count` re-globbed
`results/*-bench-{ver}-vera-*.jsonl` independently of which files
`_find_result_file` actually selected. With _find_result_file's
"newest-by-mtime" tie-breaker, the two lookups could disagree: the
chart would plot one set of files while the subtitle described
another. In practice no drift exists today, but it's a real bug
waiting to happen.
Fix: extract_data() now also returns `used_paths` — the JSONL files
it actually consulted. The two _detect_* functions take that list
instead of globbing, so subtitle metadata is consistent with the
plotted data by construction.
3. README "Chart layout" section described panel 3 as a fixed
"Does Vera beat Python / TypeScript?" and panel 4 as "four modes".
Both are actually dynamic — the panel-3 title is rebuilt from the
active comparison set, and panel-4's bar count grows with --extra
flags. Updated to reflect the dynamic behaviour.
Rejected findings (commented inline on PR):
- `compute_metrics(..., exclude_tiers={5})`: this is a design choice,
not a bug. The chart has always used all-tier numbers to match the
v0.0.7 chart convention; excluding T5 would alter the published
historical content and create a mismatch with the README narrative.
Worth a separate issue to discuss chart metric semantics.
- `row[mode] = 0` → `np.nan` for missing files: the warning channel
already flags missing data, and a visible 0-bar is a louder "data
is absent" signal than an invisible bar. In practice we don't plot
with missing data; worth revisiting only if that changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
scripts/plot_results.py (3)
398-419:⚠️ Potential issue | 🟡 MinorSubtitle metadata may not match the files actually plotted.
_detect_vera_version()and_detect_problem_count()glob all matching Vera files for the bench version, but_find_result_file()selects only the most recently modified match. If stale reruns remain inresults/, the chart title could advertise a compiler version or problem count from a file that wasn't used for the bars.Consider deriving these values from the paths returned by
_find_result_file()duringextract_data(), or passing the selected paths into these helpers.,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/plot_results.py` around lines 398 - 419, The subtitle metadata is computed from all matching Vera files but the plotting uses the single path selected by _find_result_file; update extract_data to pass the actual selected Vera result path(s) into _detect_vera_version and _detect_problem_count (or change those helpers to accept a list of paths) so they derive the version and problem count from the exact file(s) chosen by _find_result_file rather than from a fresh glob; specifically, modify extract_data to capture the path(s) returned by _find_result_file and forward them to _detect_vera_version/_detect_problem_count (or refactor those functions to accept paths and compute counts/versions from those paths).
2-15:⚠️ Potential issue | 🟡 MinorModule docstring still references the old output path.
The docstring says the script "produces a run_correct comparison chart at
assets/benchmark_v{VERSION}.png" butmain()now defaults toassets/results-graph{...}.png. SinceArgumentParser(description=__doc__)reuses this text,--helpoutput is stale.📝 Suggested fix
"""Generate benchmark comparison charts from VeraBench results. Reads JSONL files in `results/` (via `vera_bench.metrics.compute_metrics`) and -produces a run_correct comparison chart at `assets/benchmark_v{VERSION}.png`. +produces a run_correct comparison chart at `assets/results-graph.png` (or a +suffixed variant for historical versions / optional comparison languages). Usage: python scripts/plot_results.py # uses pyproject version python scripts/plot_results.py --version 0.0.9 # explicit bench version python scripts/plot_results.py --version 0.0.7 # re-render historical chart python scripts/plot_results.py --output my.png # custom output path + python scripts/plot_results.py --extra aver # include Aver comparison,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/plot_results.py` around lines 2 - 15, The module docstring is stale and still says the output is "assets/benchmark_v{VERSION}.png" while main()/ArgumentParser(description=__doc__) now defaults to "assets/results-graph{...}.png"; update the top-level docstring to describe the current default output filename/pattern (matching the value used in main/argument parsing) so the --help text is correct (look for the docstring and the main() / ArgumentParser(description=__doc__) usage to ensure the wording and example usages reflect the new assets/results-graph...png path).
174-182:⚠️ Potential issue | 🟠 MajorMissing JSONL files are recorded as
0, which fabricates delta-chart wins.When a result file is missing,
row[mode] = 0is used as a placeholder. This causes the delta panel to computeVera - 0and display an inflated positive delta. A distinct sentinel (Noneornp.nan) would let downstream plotting skip or annotate unavailable comparisons rather than overstating Vera's margin.Additionally,
compute_metrics()is called withoutexclude_tiers={5}, so Tier 5's language-specific tasks are folded into the cross-language percentages. Theexclude_tiersparameter exists precisely for T1–T4 apples-to-apples comparison.🛠️ Proposed fix for Tier 5 exclusion
- metrics = compute_metrics(_load_jsonl(path)) + metrics = compute_metrics(_load_jsonl(path), exclude_tiers={5}),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/plot_results.py` around lines 174 - 182, When a result file is missing, don't set row[mode] = 0 (which fabricates deltas); instead set a sentinel (e.g., None or np.nan) so downstream plotting can skip or annotate missing values, and ensure you import numpy if using np.nan; also call compute_metrics with exclude_tiers={5} (i.e., compute_metrics(_load_jsonl(path), exclude_tiers={5})) so Tier‑5 tasks are excluded from the T1–T4 cross-language percentages—update the logic around path existence (the block referencing model.display, mode, version, warnings, _load_jsonl, compute_metrics, and row) accordingly.scripts/README.md (1)
317-325:⚠️ Potential issue | 🟡 MinorPanel 3 description is stale — now dynamic based on
--extraflags.The description states the third panel is "Does Vera beat Python / TypeScript?" but
plot_vera_vs_comparison()now dynamically includes any--extracomparison modes. When invoked with--extra aver, the panel title becomes "Does Vera beat Python / TypeScript / Aver?" and includes an additional delta bar group.📝 Suggested fix
-3. **Does Vera beat Python / TypeScript?** — horizontal delta bars per model +3. **Does Vera beat [comparison languages]?** — horizontal delta bars per model, + one group per comparison language (Python, TypeScript by default; additional + languages added via `--extra`),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/README.md` around lines 317 - 325, The README's Chart layout text is stale: update the Panel 3 description that currently reads "Does Vera beat Python / TypeScript?" to reflect that plot_vera_vs_comparison() builds the panel dynamically based on the --extra flags; change the wording to indicate the panel shows Vera vs the comparison modes specified via --extra (e.g., "Does Vera beat Python / TypeScript / <extra>?" or "Dynamic: Vera vs comparison modes from --extra") and mention that additional bars/groups appear when extra modes are passed so readers know the title and contents are generated at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/README.md`:
- Around line 227-241: The README has fenced code blocks inside a numbered list
without blank lines, causing MD031; update the sections that show "Rust":
"rust-", the `_COMPILER_SUFFIXED = {"Vera": "vera", "Vera NL": "vera", "Aver":
"aver", "Rust": "rust"}` snippet, the COLORS note, and the
`OPTIONAL_COMPARISON_MODES = {"aver": "Aver", "rust": "Rust"}` snippet to ensure
there is a blank line before and after each ```python fenced code block so
parsers/linters stop flagging MD031 and the blocks render correctly.
---
Duplicate comments:
In `@scripts/plot_results.py`:
- Around line 398-419: The subtitle metadata is computed from all matching Vera
files but the plotting uses the single path selected by _find_result_file;
update extract_data to pass the actual selected Vera result path(s) into
_detect_vera_version and _detect_problem_count (or change those helpers to
accept a list of paths) so they derive the version and problem count from the
exact file(s) chosen by _find_result_file rather than from a fresh glob;
specifically, modify extract_data to capture the path(s) returned by
_find_result_file and forward them to _detect_vera_version/_detect_problem_count
(or refactor those functions to accept paths and compute counts/versions from
those paths).
- Around line 2-15: The module docstring is stale and still says the output is
"assets/benchmark_v{VERSION}.png" while
main()/ArgumentParser(description=__doc__) now defaults to
"assets/results-graph{...}.png"; update the top-level docstring to describe the
current default output filename/pattern (matching the value used in
main/argument parsing) so the --help text is correct (look for the docstring and
the main() / ArgumentParser(description=__doc__) usage to ensure the wording and
example usages reflect the new assets/results-graph...png path).
- Around line 174-182: When a result file is missing, don't set row[mode] = 0
(which fabricates deltas); instead set a sentinel (e.g., None or np.nan) so
downstream plotting can skip or annotate missing values, and ensure you import
numpy if using np.nan; also call compute_metrics with exclude_tiers={5} (i.e.,
compute_metrics(_load_jsonl(path), exclude_tiers={5})) so Tier‑5 tasks are
excluded from the T1–T4 cross-language percentages—update the logic around path
existence (the block referencing model.display, mode, version, warnings,
_load_jsonl, compute_metrics, and row) accordingly.
In `@scripts/README.md`:
- Around line 317-325: The README's Chart layout text is stale: update the Panel
3 description that currently reads "Does Vera beat Python / TypeScript?" to
reflect that plot_vera_vs_comparison() builds the panel dynamically based on the
--extra flags; change the wording to indicate the panel shows Vera vs the
comparison modes specified via --extra (e.g., "Does Vera beat Python /
TypeScript / <extra>?" or "Dynamic: Vera vs comparison modes from --extra") and
mention that additional bars/groups appear when extra modes are passed so
readers know the title and contents are generated at runtime.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1574b988-37a6-45f3-9150-793860e391a0
📒 Files selected for processing (3)
ROADMAP.mdscripts/README.mdscripts/plot_results.py
Strict Markdown parsers (and markdownlint MD031) require blank lines around fenced code blocks inside numbered lists. GitHub renders it fine either way, but silencing the linter is cheap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
scripts/README.mdcovering all three operational scripts (run_full_benchmark.py,plot_results.py,validate_problems.py)plot_results.pyto read metrics from JSONL files (no more hand-copied percentages) and add an--extraflag for opt-in comparison languages (Aver is the first; more are pluggable)assets/benchmark_v0.0.9_with-aver.pngas the 5-mode variant;assets/benchmark_v0.0.9.pngregenerated and byte-equivalent to the previous versionWhy now
Previous state:
scripts/had no README, and the claim that each script was "self-documenting via--help" was thin. For example,run_full_benchmark.py --helpdoesn't mention the 8-target list, the per-provider env var mapping, how to sweep multiple models, expected timings, or the output filename convention.Hand-copied chart data was also a foot-gun — if percentages in
plot_results.pydrifted from the JSONL files, the headline chart would silently misrepresent results.What's in the README
run_full_benchmark.py— what the eight targets are, env var per provider, sweep recipe for all six models, timing table, output file naming, exit-code behaviourplot_results.py— usage, file-discovery glob patterns, how to add a new model or comparison language, colour palette, reproducibility notesvalidate_problems.py— column-by-column interpretation, when to runTest plan
python scripts/plot_results.py --version 0.0.9→ regenerates default chart (byte-equivalent to previousbenchmark_v0.0.9.png)python scripts/plot_results.py --version 0.0.9 --extra aver→ producesbenchmark_v0.0.9_with-aver.pngwith Aver as a 5th modepython scripts/plot_results.py --version 0.0.7→ historical chart reproduces from archived JSONL filesruff check scripts/plot_results.pypassesNot tagged as a release — scripts/docs only.
Generated with Claude Code
Summary by CodeRabbit
Documentation
New Features
Chores