Report T1-T4 aggregate separately for cross-language comparison by jasisz · Pull Request #56 · aallan/vera-bench

jasisz · 2026-04-14T07:15:57Z

Summary

Add exclude_tiers parameter to compute_metrics() for tier-filtered aggregation
Report shows "All Tiers (T1–T5)" and "Comparable (T1–T4)" sections in summary table
Methodology note explains why T5 is reported separately
README note directs readers to T1–T4 for cross-language headline rates

Context

Tier 5 tests algebraic effect handlers (State, Exn, IO) unique to Vera. Python uses try/except, TypeScript uses try/catch, Aver uses Result<T,E> and recursion. Cross-language T5 comparison is apples-to-oranges.

Discussed in PR #48, formalized in #50.

Changes

vera_bench/metrics.py — compute_metrics(results, *, exclude_tiers={5}) filters problems before aggregation
vera_bench/report.py — two summary sections + methodology note; extracted _SUMMARY_HEADER, _SUMMARY_SEP, _summary_row() to avoid duplication
README.md — cross-language comparison note
tests/test_runner.py — tests for tier filtering, backward compatibility, report rendering

Test plan

exclude_tiers={5} correctly removes T5 from aggregates
exclude_tiers=None and exclude_tiers=set() preserve existing behavior
Report renders both sections with correct problem counts
434 tests pass, 85.5% coverage, triple ruff clean

Closes #50

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Clarified how to compute cross-language headline rates and explained why Tier 5 is reported separately; changelog notes Comparable section suppression when no T1–T4 problems exist.
New Features
- Reports show separate metric tables for All Tiers (T1–T5) and Comparable (T1–T4).
- Added option to exclude specific tiers from metric aggregation.
Tests
- Expanded tests for tier-exclusion behaviour and report rendering scenarios.

coderabbitai · 2026-04-14T07:16:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4fa0d3f5-71ff-457d-bf3b-d5acc668663f

📥 Commits

Reviewing files that changed from the base of the PR and between b3e0a25 and 237bc75.

📒 Files selected for processing (5)

CHANGELOG.md
README.md
tests/test_runner.py
vera_bench/metrics.py
vera_bench/report.py

📝 Walkthrough

Walkthrough

Adds tier-filtered metrics and reporting so the primary cross-language headline uses a T1–T4 "Comparable" aggregate while Tier 5 is reported separately; implements filtering in metrics, dual-section report generation, test coverage, and documentation updates. (50 words)

Changes

Cohort / File(s)	Summary
Documentation `README.md`, `CHANGELOG.md`	Adds methodology note and changelog entries describing T1–T4 headline reporting and separate Tier‑5 reporting.
Metrics Filtering `vera_bench/metrics.py`	`compute_metrics(results, *, exclude_tiers: set[int]
Report Generation `vera_bench/report.py`	`generate_report` now computes both full and comparable (T1–T4, via `exclude_tiers={5}`) metric sets; `_summary_table` refactored to optionally render a "Comparable (T1–T4)" section, new `_summary_row` and shared formatting constants.
Tests `tests/test_runner.py`	Adds tests for `exclude_tiers` behavior and report generation asserting both "All Tiers" and "Comparable" sections, plus scenarios where comparable section is omitted (T5‑only).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

Repo housekeeping: hero image, verify fix, codecov, pre-commit, templates, DESIGN/ROADMAP #16 — Modifies compute_metrics; likely overlaps with verify/counting and filtering semantics.
Implement LLM runner harness (Phase 2) #3 — Adds core metrics/reporting infra that this change extends to support tier filtering and dual-table output.
Add --language python for cross-language LLM comparison #11 — Related cross-language reporting changes touching metrics, report generation and tests.

Suggested labels

harness, docs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main objective: reporting T1-T4 aggregate separately for cross-language comparison, which is the core change across metrics, reporting, and documentation.
Linked Issues check	✅ Passed	All coding objectives from issue `#50` are met: compute_metrics accepts exclude_tiers parameter [vera_bench/metrics.py], report renders dual sections T1-T5 and T1-T4 with methodology note [vera_bench/report.py], and README updated with T1-T4 guidance [README.md]. Plotting script updates are deferred (acknowledged in PR context as separate work).
Out of Scope Changes check	✅ Passed	All changes directly support the stated objectives: tier filtering logic, dual-section reporting, test coverage for new behaviour, and documentation updates. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_runner.py`:
- Around line 345-348: The current assertions in tests/test_runner.py check
generic fragments ("| 2 |", "| 1 |") on the shared variable report, which can
match unrelated table cells; update the assertions to scope to the correct
section and assert full model rows. Locate the assertions referencing report and
replace them with checks that first isolate the section header text (e.g., "All
tiers" and "T1-T4 comparable") and then assert the exact row or use a anchored
regex that matches the full row containing the expected count (for example
assert re.search(r"^All tiers.*\|\s*2\s*\|", section_text, re.M) and similarly
for the "T1-T4 comparable" section to ensure you are matching the intended row
and count).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1792b57c-fdb7-4603-8c63-bc73aebad63a

📥 Commits

Reviewing files that changed from the base of the PR and between acf1693 and b3e0a25.

📒 Files selected for processing (4)

README.md
tests/test_runner.py
vera_bench/metrics.py
vera_bench/report.py

codecov · 2026-04-14T07:20:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.30%. Comparing base (acf1693) to head (237bc75).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #56      +/-   ##
==========================================
+ Coverage   83.06%   83.30%   +0.24%     
==========================================
  Files          10       10              
  Lines        1346     1360      +14     
==========================================
+ Hits         1118     1133      +15     
+ Misses        228      227       -1

Flag	Coverage Δ
python	`83.30% <100.00%> (+0.24%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…an#50) Tier 5 tests algebraic effect handlers unique to Vera. Other languages solve these with native idioms (try/except, Result types, recursion), making cross-language T5 comparison apples-to-oranges. - metrics.py: add exclude_tiers parameter to compute_metrics() - report.py: show "All Tiers (T1-T5)" and "Comparable (T1-T4)" sections in summary table, with methodology note - README.md: add cross-language comparison note referencing aallan#50 - tests: add tests for exclude_tiers filtering, backward compatibility, and report T1-T4 section rendering Closes aallan#50 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aallan

@jasisz Overall this is a clean, well-scoped PR — backward-compatible compute_metrics signature, correct filter placement, and the rendered report looks right against our real results. A few notes:

1. Empty "Comparable (T1–T4)" row when only T5 problems are present (cosmetic)

If someone runs vera-bench run --model X --tier 5 and then reports, the Comparable section renders a row with all dashes and | 0 | for Problems. Not wrong, but looks odd. Consider suppressing rows where total_problems == 0 in the comparable section.

2. Hard-coded row assertion in `test_report_t1t4_shows_different_counts` (minor brittleness)

The test asserts exact strings like "| test-model | 100% | - | - | 100% | 2 |". Any future cosmetic change to _summary_row (new column, padding, em-dash) will break this without a real regression. Consider checking for the key data points separately (model name, problem count) rather than the full formatted row.

3. No CHANGELOG entry

Worth adding an [Unreleased] entry for this change. We can stamp the version when we merge — we are planning to merge this alongside #25 (expanding the test-case pool) since the two are complementary: this PR splits the headline by tiers, and #25 ensures the T1–T4 pool has enough testable problems to be statistically meaningful.

We are starting work on #25 now (adding test_cases to T2-004/T2-005 and new T2/T3 problems with testable signatures). Will link the PR here once it is up — plan is to merge them back-to-back.

aallan · 2026-04-16T07:30:18Z

PR #57 is up — adds 10 new T2/T3 problems with testable signatures, bringing the T1–T4 testable pool from 18 to 30 problems. Plan is to merge these back-to-back so the T1–T4 headline rate in the report has a meaningful denominator.

- Suppress Comparable (T1–T4) section when no T1–T4 problems exist - Replace brittle full-row assertions with key data point checks - Add test for T5-only report hiding Comparable section - Add CHANGELOG [Unreleased] entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jasisz · 2026-04-16T08:13:05Z

Thanks for the review! All three points addressed in 237bc75:

Comparable section is now suppressed when no T1–T4 problems are present
Test assertions check key data points instead of full formatted rows
CHANGELOG [Unreleased] entry added

Looking forward to #57 landing — merging them back-to-back sounds like the right plan.

jasisz requested a review from aallan as a code owner April 14, 2026 07:16

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/test_runner.py Outdated

jasisz force-pushed the tier5-methodology branch from b3e0a25 to 1e40379 Compare April 14, 2026 07:25

aallan mentioned this pull request Apr 14, 2026

Tier 5 cross-language methodology: report T1-T4 aggregate separately #50

Closed

4 tasks

aallan reviewed Apr 16, 2026

View reviewed changes

aallan mentioned this pull request Apr 16, 2026

Add 10 new T2/T3 problems with testable signatures #57

Merged

7 tasks

aallan merged commit b59fdc6 into aallan:main Apr 16, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report T1-T4 aggregate separately for cross-language comparison#56

Report T1-T4 aggregate separately for cross-language comparison#56
aallan merged 2 commits into
aallan:mainfrom
jasisz:tier5-methodology

jasisz commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

aallan left a comment •

edited

Loading

Uh oh!

aallan commented Apr 16, 2026

Uh oh!

Uh oh!

jasisz commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jasisz commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aallan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. Empty "Comparable (T1–T4)" row when only T5 problems are present (cosmetic)

2. Hard-coded row assertion in test_report_t1t4_shows_different_counts (minor brittleness)

3. No CHANGELOG entry

Uh oh!

aallan commented Apr 16, 2026

Uh oh!

Uh oh!

jasisz commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jasisz commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

codecov Bot commented Apr 14, 2026 •

edited

Loading

aallan left a comment •

edited

Loading

2. Hard-coded row assertion in `test_report_t1t4_shows_different_counts` (minor brittleness)