Increase test coverage to 83%, version in filenames (v0.0.6) by aallan · Pull Request #36 · aallan/vera-bench

aallan · 2026-03-31T13:29:55Z

Summary

Two issues bundled: #20 (version in filenames) and #5 (test coverage).

Coverage: 66% → 83%

File	Before	After
validate.py	12%	82%
cli.py	48%	71%
prompts.py	79%	100%
vera_runner.py	65%	91%
runner.py	68%	77%
baseline_runner.py	91%	92%

4 new test files, 52 new tests, 376 total (was 324).
CI coverage threshold raised from 35% to 80%.

Version tracking (#20)

Filenames now include bench + vera versions:

model-bench-0-0-6-vera-0-0-105.jsonl

Each JSONL record carries bench_version and vera_version fields.

Closes #20. Progress on #5.

Generated with Claude Code

Summary by CodeRabbit

Tests
- Added extensive unit and integration tests across CLI, models, runner and validation, raising coverage to ~83% (52 new tests).
New Features
- Added a public version-reporting method for the runner API.
Chores
- Bumped project version to 0.0.6.
- Raised CI coverage gate from 35% to 80%.
- Updated changelog and roadmap to reflect the new release.

New test files: - test_vera_runner_integration.py: real vera subprocess tests (check, verify, run_fn, version, _vera_bin edge cases) - test_validate_integration.py: real validation pipeline tests (find_vera_file, normalize_output, validate_problem, run_validation) - test_cli.py: Click CliRunner tests for all commands - test_models.py: LLM client creation, missing API keys, mock complete() Expanded existing tests: - test_runner.py: Python eval error paths (syntax, runtime, wrong output), run_benchmark JSONL writing, version fields in ProblemResult Coverage improvements: - validate.py: 12% → 82% - cli.py: 48% → 71% - prompts.py: 79% → 100% - vera_runner.py: 65% → 91% - runner.py: 68% → 77% CI coverage threshold raised from 35% to 80%. 376 tests passing (was 324). Closes #20. Progress on #5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-31T13:30:14Z

📝 Walkthrough

Walkthrough

Bumps package to v0.0.6, raises CI Python 3.12 coverage gate from 35% to 80%, and adds extensive test coverage (multiple new unit and integration test files) along with changelog and roadmap updates; no production API or exported symbols were changed.

Changes

Cohort / File(s)	Summary
Version & Release Metadata `pyproject.toml`, `CHANGELOG.md`, `ROADMAP.md`	Package version updated to `0.0.6`; changelog and roadmap updated to document release, benchmark/vera versioning metadata, new `VeraRunner.version()` mention, and adjusted coverage target/metrics.
CI Configuration `.github/workflows/ci.yml`	Increased coverage-fail threshold for Python 3.12 tests from `--cov-fail-under=35` to `--cov-fail-under=80`.
CLI Tests `tests/test_cli.py`	New Click integration tests for `validate`, `run`, `baselines`, and `report` subcommands, asserting exit codes, output strings, warnings, and JSONL baseline generation.
Model Client Tests `tests/test_models.py`	New tests for `create_client`, `AnthropicClient`, `OpenAIClient` and `LLMResponse`, covering missing API keys, unknown models, and mocked SDK completions.
Runner & Validator Tests `tests/test_runner.py`, `tests/test_validate.py`, `tests/test_validate_integration.py`	New unit and integration tests for Python evaluation errors, JSONL output writing, skill markdown loading, `validate_problem` behaviour, normalization and error categories; some tests gated on external `vera` availability.
VeraRunner Integration Tests `tests/test_vera_runner_integration.py`	Integration tests validating `vera` binary discovery, `VeraRunner.version()`, `check()`/`verify()` outcomes and exported-function execution against the system `vera` binary (module skipped if `vera` absent).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Possibly related issues

Increase test coverage to >90% #5: Test additions and CI coverage threshold increase directly map to the issue's objective to raise test coverage.
Include vera compiler version in output JSONL filename #20: Changelog entries and tests around VeraRunner.version() and JSONL filename/recording align with the request to include vera compiler version in result filenames and records.

Possibly related PRs

Include bench and vera versions in filenames and JSONL records (#20) #35: Tests here exercise VeraRunner.version() and vera-version metadata introduced in that PR.
Implement baseline runner (Phase 3) #8: CLI baselines tests validate behaviour (single JSONL output naming/content) implemented in that PR.
Implement LLM runner harness (Phase 2) #3: New tests exercise runner, client, CLI and output behaviours added in that PR.

Suggested labels

ci

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	Issue `#20` requires vera compiler version in output JSONL filenames and records, but the raw_summary provides no evidence of implementation in cli.py, runner.py, or JSONL output logic.	Verify that cli.py appends vera version to filenames when language=='vera', and that JSONL records include bench_version and vera_version fields as required by `#20`.
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.47% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarises the main changes: increased test coverage to 83% and version bump to 0.0.6 reflected in filenames and documentation.
Out of Scope Changes check	✅ Passed	All changes align with PR objectives: test coverage expansion (4 new test files, 52 tests) and version documentation updates (pyproject.toml, CHANGELOG, ROADMAP) directly support issues `#20` and `#5`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/test-coverage

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-31T13:35:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.93%. Comparing base (e723cb7) to head (4bdb888).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #36       +/-   ##
===========================================
+ Coverage   65.68%   82.93%   +17.24%     
===========================================
  Files          10       10               
  Lines        1090     1090               
===========================================
+ Hits          716      904      +188     
+ Misses        374      186      -188

Flag	Coverage Δ
python	`82.93% <ø> (+17.24%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Around line 16-17: Update the release note sentence that currently reads "52
new tests across 3 new test files (test_cli.py, test_models.py,
test_vera_runner_integration.py)" to reflect the correct number "4" and ensure
the parenthetical lists all four new test filenames (add the missing test file
name or mirror the PR summary); update the numeric count ("3" → "4") and adjust
the parenthetical to include the fourth file so the changelog is factually
accurate.

In `@ROADMAP.md`:
- Line 19: Update the checklist item text that currently reads "[x] Increase
test coverage to >83% (issue `#5`, ongoing)" in ROADMAP.md to accurately reflect
the achieved coverage by replacing ">83%" with either "83%" or ">=83%"; locate
the exact string in the file and edit it to "83%" (or ">=83%" if you prefer to
express a lower bound) so the roadmap is factually correct.

In `@tests/test_cli.py`:
- Around line 79-92: The test test_typescript_baselines currently always asserts
exit_code == 0 but may require the external "tsx" runtime; modify
test_typescript_baselines to detect presence of the TypeScript runtime before
invoking CliRunner (e.g., use shutil.which("tsx") or similar) and call
pytest.skip("tsx not found") when missing so the test is skipped rather than
failing; update the test in tests/test_cli.py (inside test_typescript_baselines,
before invoking main/CliRunner) to perform this check and skip behavior.
- Around line 11-15: The test_runs_successfully test hardcodes "50/50" which
will break when the corpus size changes; update the assertion on result.output
(from the CliRunner(...).invoke(main, ["validate"]) call) to check for a dynamic
passed/total pattern instead (e.g., use a regex like \d+/\d+ via re.search) or
parse the output to extract numeric counts and assert the format and that
exit_code == 0 remains true so the test is resilient to changes in problem
count.

In `@tests/test_models.py`:
- Around line 84-89: The current test patches vera_bench.models.anthropic/openai
but local imports inside AnthropicClient.__init__ and OpenAIClient.__init__
still import the real SDKs; instead patch the constructors themselves: replace
patch("vera_bench.models.anthropic") with
patch.object(vera_bench.models.AnthropicClient, "__init__", return_value=None)
and similarly patch.object(vera_bench.models.OpenAIClient, "__init__",
return_value=None); after patching the __init__s, set up MagicMock instances for
the client behaviors you need and, if your code references SDK exceptions like
anthropic.APITimeoutError or openai.error.Timeout, inject mock exception
attributes onto the mocked objects or modules used by the clients so tests use
the mocked exceptions rather than real SDK classes.

In `@tests/test_validate_integration.py`:
- Line 1: Add a module-level skip when the external "vera" binary is not on
PATH: import pytest and shutil, set pytestmark =
pytest.mark.skipif(shutil.which("vera") is None, reason="vera not available") at
the top of tests/test_validate_integration.py so all tests in this module are
skipped if shutil.which("vera") returns None; reference the pytestmark symbol
and the use of shutil.which("vera") to locate where to add the guard.

In `@tests/test_validate.py`:
- Around line 177-201: These tests rely on real network calls to veralang.dev
(tests test_load_from_url, test_load_default, test_bad_url) which makes them
flaky; update them to mock the URL fetch used by load_skill_md (or replace
SKILL_MD_URL) so network I/O is deterministic: patch the HTTP client/function
load_skill_md uses (e.g., requests.get or your internal fetcher) to return a
fixed response body for the success cases and a controlled error/HTTP status for
the failure case, and assert against that mocked content (keep
test_load_from_file unchanged); reference load_skill_md, SKILL_MD_URL, and the
tests test_load_from_url/test_load_default/test_bad_url when implementing the
mocks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2dd86e15-3402-4b28-af7b-c6aa0a913582

📥 Commits

Reviewing files that changed from the base of the PR and between e723cb7 and de4ef1a.

📒 Files selected for processing (10)

.github/workflows/ci.yml
CHANGELOG.md
ROADMAP.md
pyproject.toml
tests/test_cli.py
tests/test_models.py
tests/test_runner.py
tests/test_validate.py
tests/test_validate_integration.py
tests/test_vera_runner_integration.py

- CHANGELOG: fix test file count (3 -> 4) - ROADMAP: fix coverage percentage (>83% -> 83%) - test_cli.py: skip TS baselines when tsx missing, use regex for problem count assertion - test_validate_integration.py: skip all when vera not on PATH - test_vera_runner_integration.py: skip all when vera not on PATH - test_validate.py: mock URL fetch in load_skill_md tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_cli.py`:
- Around line 83-99: The inline comment "# May skip if tsx not available" in the
test_typescript_baselines test is stale because the pytest.mark.skipif decorator
(pytest.mark.skipif(...)) already handles skipping when tsx/npx are missing;
remove that comment to avoid misleading phrasing and keep the test doc accurate,
leaving the decorator and the assert result.exit_code == 0 unchanged.

In `@tests/test_validate.py`:
- Around line 212-216: test_bad_url currently performs a real HTTP request;
update the test_bad_url test to patch urllib.request.urlopen so it raises
urllib.error.URLError (e.g., via unittest.mock.patch or pytest monkeypatch) when
load_skill_md is called, preserving the pytest.raises(RuntimeError,
match="Failed to fetch") assertion and referencing the load_skill_md function
and urllib.request.urlopen to locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5e71ab2d-be2d-4557-bc12-018f49ff875d

📥 Commits

Reviewing files that changed from the base of the PR and between de4ef1a and 2879585.

📒 Files selected for processing (6)

CHANGELOG.md
ROADMAP.md
tests/test_cli.py
tests/test_validate.py
tests/test_validate_integration.py
tests/test_vera_runner_integration.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

Comment thread ROADMAP.md Outdated

Comment thread tests/test_cli.py

Comment thread tests/test_cli.py

Comment thread tests/test_models.py

Comment thread tests/test_validate_integration.py

Comment thread tests/test_validate.py

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread tests/test_cli.py

Comment thread tests/test_validate.py

Remove stale comment, mock bad URL test

4bdb888

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan merged commit e0c542c into main Mar 31, 2026
9 checks passed

aallan deleted the feature/test-coverage branch March 31, 2026 15:16

coderabbitai Bot mentioned this pull request Apr 7, 2026

Moonshot provider support + full benchmark script (v0.0.7) #38

Merged

coderabbitai Bot mentioned this pull request Apr 17, 2026

docs: document all scripts in scripts/README.md; make plot_results data-driven #59

Merged

5 tasks

coderabbitai Bot mentioned this pull request May 5, 2026

Populate bench_version on baseline JSONL output (closes #66) #67

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase test coverage to 83%, version in filenames (v0.0.6)#36

Increase test coverage to 83%, version in filenames (v0.0.6)#36
aallan merged 3 commits into
mainfrom
feature/test-coverage

aallan commented Mar 31, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

❌ Failed checks (2 warnings)

Uh oh!

codecov Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aallan commented Mar 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Coverage: 66% → 83%

Version tracking (#20)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

❌ Failed checks (2 warnings)

Uh oh!

codecov Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aallan commented Mar 31, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading

codecov Bot commented Mar 31, 2026 •

edited

Loading